All Projects
experiment Local AI Research + Prototype

Gemma on Metal

Local Gemma 4 + LoRA on Apple Silicon

A local-first experiment using Gemma 4, 4-bit quantization, and LoRA workflows to see how far a 16GB M2 MacBook can go for private multimodal extraction.

Project Brief

MacBook M2 / 16GB
Machine
Ollama + Metal
Runtime
Gemma 4 E4B
Target model
Private extraction
End use
01 - Project Brief

Problem, Hypothesis, Outcome.

Summary

A practical test of Gemma 4 on a 16GB MacBook M2: run the small edge model locally with Ollama and Metal, then shape a LoRA tuning path for private document and audio extraction.

Problem

A lot of parsing and transcription workflows break the moment the inputs are private, domain-specific, or too messy for generic OCR and speech pipelines.

Hypothesis

If the small Gemma 4 variants are quantized hard enough and adapted with lightweight LoRA training, a 16GB Apple Silicon laptop can handle useful private multimodal extraction without sending sensitive material to the cloud.

Outcome

Validated the local inference path around Gemma 4 and Ollama on Apple Silicon, narrowed the workable model sizes for 16GB unified memory, and framed a domain-tuning workflow aimed at private form extraction and specialized audio transcription.

02 - Goals & Stack

What the build was trying to do.

Goals

  • Find a Gemma 4 variant that runs comfortably on a 16GB M2 machine.
  • Keep document and audio inputs local instead of routing them through a hosted API.
  • Define a LoRA path for industry-specific extraction tasks.
  • Turn the experiment into a private multimodal parser pattern, not just a chatbot demo.

Technologies Used

Gemma 4 Ollama Apple Silicon Metal 4-bit quantization Unsloth LoRA / QLoRA GGUF
03 - Breakdown & Notes

Implementation notes.

Breakdown

This project started with a simple constraint: keep the workflow local. The goal was not to squeeze the biggest open model onto a laptop for bragging rights. The goal was to see whether a 16GB MacBook M2 could become a useful private workstation for multimodal extraction.

Gemma 4 made that interesting because the small E2B and E4B variants are explicitly positioned for edge devices, and both support text, image, and audio input. That made them a much better fit for the job than a cloud-only stack or a larger model that would leave almost no headroom on a 16GB machine.

Ollama became the serving layer because it made the local path simple: pull the official Gemma 4 model, run it on macOS, and shape a tighter runtime with a Modelfile when the defaults needed to be constrained for laptop memory.

ollama pull gemma4
ollama run gemma4:e4b
ollama ps

Build notes

  • The official Google and Ollama docs now describe the Gemma 4 family with e2b, e4b, 26b, and 31b tags. For a MacBook M2 with 16GB unified memory, the E4B path is the practical starting point.
  • Quantization mattered more than model bragging rights here. The local win came from choosing a model size that leaves room for the operating system, context cache, and the rest of the workflow.
  • Audio support shaped the choice as much as memory did. In the current docs, audio is available on E2B and E4B, which nudged this project toward the smaller edge models instead of the larger workstation variants.
  • The tuning plan was centered on LoRA rather than full-model retraining. The point was to steer the model toward industry-specific field extraction and specialized transcription, not to rebuild the base model from scratch.
  • Unsloth is the most promising path for that layer right now, but the docs are moving quickly. Their newer Gemma 4 pages say the model can run and be fine-tuned in Unsloth Studio on macOS, while their beginner requirements still describe Apple and MLX training as in progress. I treated that as a moving boundary, not a settled fact.

Lessons Learned

I went in thinking the interesting part would be the quantization trick. The more useful insight was that the real value is privacy plus structure. A local model like this becomes compelling when it can turn a messy form, screenshot, or short audio segment into fields that matter to a real workflow.

04 - Analysis

Findings.

01

The current Gemma 4 stack is cleanly scoped: official docs expose `gemma4:e2b`, `gemma4:e4b`, `gemma4:26b`, and `gemma4:31b`, which makes the local sizing decision much clearer than older Gemma naming.

02

The smaller E2B and E4B models are the right class for a 16GB Apple Silicon laptop, especially when the goal is privacy-preserving extraction instead of raw leaderboard performance.

03

Audio support pushed the project toward the edge models, because current Gemma 4 docs reserve audio for E2B and E4B.

04

The real target was not generic chat. It was private multimodal extraction: forms, screenshots, OCR-heavy inputs, and short specialized audio segments.

[ Connect ]

Worth a conversation?

If you are trying to keep model inference close to the data, especially on Apple Silicon, or you need a sane path from local experimentation to domain-tuned extraction, I would be happy to compare notes.

All Projects →

You are reaching

John Meyer

Security Engineer → AI

  • Open to roles
  • Contract + consulting
  • Architecture advisory