What are the key differences between Gemma 4 and Gemma 3?

Gemma 4 ships in four sizes under Apache 2.0 (removing the old license restrictions), with dramatic performance improvements. The 31B model scores 89.2% on AIME 2026 compared to Gemma 3 27B's 20.8%, and ranks #3 among all open models on Arena AI as of April 2026.

How much VRAM do I need to run Gemma 4 locally?

The E4B requires about 5 GB at Q4_K_M quantization, the 26B MoE needs around 14 GB, and the 31B dense model uses approximately 19 GB. All three have full Ollama support with the Q4_K_M quantization as the default. BF16 full precision requires 62 GB or more for the 31B.

Why does Gemma 4's Apache 2.0 license matter?

The old Gemma license had prohibited use cases, policies that could change, and enough ambiguity that legal teams flagged it as a blocker. Apache 2.0 is unambiguous: use, modify, redistribute, and build products with the model with attribution only. No usage review or separate agreement required.

Is Gemma 4 good for coding tasks?

Yes. The 31B scores 80.0% on LiveCodeBench and 89.2% on AIME 2026, making it the strongest single-GPU option for coding in its VRAM class. The 26B MoE scores around 70% on LiveCodeBench. Both significantly outperform comparable-sized models from 2024.

Gemma 4: Google's best Apache 2.0 open model

Quick answer

Google released Gemma 4 on April 2 under Apache 2.0. Four model sizes. The 31B ranks #3 among all open models globally and runs on a single RTX 4090 (~19 GB at Q4_K_M). If you avoided Gemma because of the old license terms, that objection is gone. Full Ollama support: ollama run gemma4.

Gemma 3 was a decent model with a frustrating license. Google's own Gemma terms required a separate review, added friction for commercial use, and made teams skip it in favor of Apache 2.0 models like Qwen or Llama. Gemma 4 changes that. Apache 2.0 across all four model sizes. No usage policy review, no separate agreement, no carve-outs. Just a standard open-source license.

The license change alone would be significant. The benchmark numbers make it a bigger story.

What Gemma 4 actually ships

Four model sizes, two architectures. The E2B and E4B come from Google's Gemma-3n architecture. "E" stands for effective parameters. The E4B has roughly 11B total parameters but runs inference like a 4B model, using per-layer embeddings and alternating attention (sliding-window local layers plus full-context global layers) to access a larger knowledge base without the compute cost. It's unusual architecture and the practical result is a 5 GB model that outperforms most 7B models from 2024.

The 26B is a Mixture-of-Experts model: 26B total parameters, 4B active per inference step. The 31B is a standard dense model. All four support text, image, and audio input with text output. Multimodal is built into the architecture, not added on after.

E2B: mobile and edge deployment. Not covered here.
E4B: ~5 GB VRAM at Q4_K_M. 128K context. Default Ollama pull.
26B MoE: ~14 GB VRAM at Q4_K_M. 256K context. #6 open model on Arena AI.
31B Dense: ~19 GB VRAM at Q4_K_M. 256K context. #3 open model on Arena AI.

The benchmark numbers

The AIME 2026 score is the headline. Gemma 3 27B scored 20.8%. Gemma 4 31B scores 89.2%. That's not a small improvement. It's the largest single-generation reasoning leap in open-source model history. For context: Qwen3.5-27B, the previous top single-GPU model for coding, scores 48.7% on AIME. Gemma 4 31B nearly doubles it.

LiveCodeBench tells a similar story. Gemma 3's 27B scored 29.1%. The 31B scores 80.0%. The 26B MoE sits at approximately 70%. The E4B comes in at 52%, competitive with other models in the 5 GB tier, and ahead of most that ran at this VRAM level a year ago.

On the Arena AI text leaderboard as of April 2026: the 31B ranks #3 among all open models. The 26B MoE ranks #6. Google's own description: "outcompetes models 20x its size." Independent testing confirms it.

GPQA Diamond (scientific reasoning): 84.3%. Multi-needle retrieval over long context: 66.4%, up from Gemma 3's 13.5%. The long-context improvements matter practically. 256K context is only useful if the model can actually reason over the full window, and the retrieval scores suggest it can.

What the Apache 2.0 change actually means

The old Gemma license was Google's own terms. Commercial use was technically allowed but came with restrictions: prohibited use cases defined by reference to other documents, usage policies that could change, and enough ambiguity that legal teams at many organizations flagged it. The practical result: teams building products defaulted to Apache 2.0 models and skipped Gemma entirely.

Apache 2.0 is unambiguous. Use the model, modify it, redistribute it, build products with it. Attribute Google. That's it. No usage policy review needed, no separate agreement, no questions about whether a specific application triggers a restricted-use clause.

For local-first deployments this matters less. You're running the model on your hardware, no cloud inference, no one auditing your use case. But for any team shipping a product that embeds a model, the license change removes a real obstacle.

Running it locally

All three consumer-grade sizes have full Ollama support as of release day. Pick based on your VRAM:

E4B (~5 GB): ollama run gemma4. Default pull. RTX 3050, MacBook M1, any 6 GB+ GPU.
26B MoE (~14 GB): ollama run gemma4:26b. RTX 3080 10 GB is borderline; RTX 3080 12 GB or RTX 4070 and up is comfortable.
31B Dense (~19 GB): ollama run gemma4:31b. RTX 3090, RTX 4090, or Mac M2 Max 32 GB+. The one to run if your hardware supports it.

Ollama pulls Q4_K_M by default. If you want higher precision or specific quantization, pull from HuggingFace and run via ollama create. BF16 (full precision) requires 62 GB+ for the 31B. That's multi-GPU or server territory.

LM Studio also supports Gemma 4 through the model browser. The GPU guide covers hardware expectations by tier if you're choosing hardware around these numbers.

Who should upgrade from Gemma 3

If you're running Gemma 3 12B today, the E4B is the obvious upgrade. Similar VRAM footprint, better benchmarks, cleaner license. Drop-in replacement on Ollama.

If you're on Gemma 3 27B, the 26B MoE is the equivalent slot in Gemma 4 with better numbers and slightly lower VRAM. The 31B is worth considering if your GPU can handle 19 GB. The performance gap between the 26B and 31B is larger than the gap between Gemma 3's 12B and 27B.

If you skipped Gemma entirely because of the license: now is the time to test it. The AIME and LCB scores for the 31B are the strongest argument for a single-GPU coding model in this VRAM range.

Using Gemma 4 with Bodega One Code

BYOLLM means the model choice is yours. Connect Bodega One Code to Ollama with Gemma 4 pulled. Any of the three sizes work as a provider preset. Local inference, no cloud dependency, nothing leaves your machine.

The 31B is the strongest single-GPU option right now for agentic coding tasks based on LCB and AIME scores. If your GPU can handle 19 GB at Q4_K_M, test it against your current pick. Use the VRAM calculator to verify your setup before pulling.

Beta is free and open to everyone at bodegaone.ai/download. Full launch coming later this year.

Common questions

What are the key differences between Gemma 4 and Gemma 3?: Gemma 4 ships in four sizes under Apache 2.0 (removing the old license restrictions), with dramatic performance improvements. The 31B model scores 89.2% on AIME 2026 compared to Gemma 3 27B's 20.8%, and ranks #3 among all open models on Arena AI as of April 2026.
How much VRAM do I need to run Gemma 4 locally?: The E4B requires about 5 GB at Q4_K_M quantization, the 26B MoE needs around 14 GB, and the 31B dense model uses approximately 19 GB. All three have full Ollama support with the Q4_K_M quantization as the default. BF16 full precision requires 62 GB or more for the 31B.
Why does Gemma 4's Apache 2.0 license matter?: The old Gemma license had prohibited use cases, policies that could change, and enough ambiguity that legal teams flagged it as a blocker. Apache 2.0 is unambiguous: use, modify, redistribute, and build products with the model with attribution only. No usage review or separate agreement required.
Is Gemma 4 good for coding tasks?: Yes. The 31B scores 80.0% on LiveCodeBench and 89.2% on AIME 2026, making it the strongest single-GPU option for coding in its VRAM class. The 26B MoE scores around 70% on LiveCodeBench. Both significantly outperform comparable-sized models from 2024.

Written by the Bodega One team. We build Bodega One Code, the local-first AI IDE, and we write here about local models, AI costs, and what we learn shipping it. More about the team and why we build local-first on the about page.

Stay in the loop

Build-in-public updates, model picks, and Copilot/Cursor news as it breaks.

Ready to own your tools?

Beta is free and open to everyone. Download free.

Download Free →See Pricing

← Back to the blog