Summary
A practical test of Gemma 4 on a 16GB MacBook M2: run the small edge model locally with Ollama and Metal, then shape a LoRA tuning path for private document and audio extraction.
Local Gemma 4 + LoRA on Apple Silicon
A local-first experiment using Gemma 4, 4-bit quantization, and LoRA workflows to see how far a 16GB M2 MacBook can go for private multimodal extraction.
Project Brief
Summary
A practical test of Gemma 4 on a 16GB MacBook M2: run the small edge model locally with Ollama and Metal, then shape a LoRA tuning path for private document and audio extraction.
Problem
A lot of parsing and transcription workflows break the moment the inputs are private, domain-specific, or too messy for generic OCR and speech pipelines.
Hypothesis
If the small Gemma 4 variants are quantized hard enough and adapted with lightweight LoRA training, a 16GB Apple Silicon laptop can handle useful private multimodal extraction without sending sensitive material to the cloud.
Outcome
Validated the local inference path around Gemma 4 and Ollama on Apple Silicon, narrowed the workable model sizes for 16GB unified memory, and framed a domain-tuning workflow aimed at private form extraction and specialized audio transcription.
Goals
Technologies Used
This project started with a simple constraint: keep the workflow local. The goal was not to squeeze the biggest open model onto a laptop for bragging rights. The goal was to see whether a 16GB MacBook M2 could become a useful private workstation for multimodal extraction.
Gemma 4 made that interesting because the small E2B and E4B variants are explicitly positioned for edge devices, and both support text, image, and audio input. That made them a much better fit for the job than a cloud-only stack or a larger model that would leave almost no headroom on a 16GB machine.
Ollama became the serving layer because it made the local path simple: pull the official Gemma 4 model, run it on macOS, and shape a tighter runtime with a Modelfile when the defaults needed to be constrained for laptop memory.
ollama pull gemma4
ollama run gemma4:e4b
ollama ps
e2b, e4b, 26b, and 31b tags. For a MacBook M2 with 16GB unified memory, the E4B path is the practical starting point.I went in thinking the interesting part would be the quantization trick. The more useful insight was that the real value is privacy plus structure. A local model like this becomes compelling when it can turn a messy form, screenshot, or short audio segment into fields that matter to a real workflow.
01
The current Gemma 4 stack is cleanly scoped: official docs expose `gemma4:e2b`, `gemma4:e4b`, `gemma4:26b`, and `gemma4:31b`, which makes the local sizing decision much clearer than older Gemma naming.
02
The smaller E2B and E4B models are the right class for a 16GB Apple Silicon laptop, especially when the goal is privacy-preserving extraction instead of raw leaderboard performance.
03
Audio support pushed the project toward the edge models, because current Gemma 4 docs reserve audio for E2B and E4B.
04
The real target was not generic chat. It was private multimodal extraction: forms, screenshots, OCR-heavy inputs, and short specialized audio segments.
[ Connect ]
If you are trying to keep model inference close to the data, especially on Apple Silicon, or you need a sane path from local experimentation to domain-tuned extraction, I would be happy to compare notes.
You are reaching
John Meyer
Security Engineer → AI