Ranked by SWE-bench Verified, the benchmark that tests real GitHub issues, not toy problems. Opinionated picks for every VRAM tier. Not sure if local models are ready? Read our honest assessment.
47 models evaluated·9 research passes·SWE-bench · LCB · HumanEval+·Updated monthly
What should I run?
Pick your hardware. Get a model recommendation in seconds.
1
Platform
2
Hardware
3
Result
Step 1: Select your platform
VRAM47 models
≤ 4 GBEdge, integrated GPU, older laptops
★ Top PickBigCode RAIL-MOllama ✓
StarCoder2-3B
BigCode · 3B
The de facto standard for inline code completion. Continue.dev recommends it as the default FIM model. At 2 GB VRAM, it runs on almost any GPU — including integrated graphics with enough shared memory. Use the base model, not instruct — the fine-tuning actually hurts FIM quality. Not for chat; only for autocomplete.
† FIM fill-in-middle champion — not an instruct model
~2 GB·16K ctx·Inline autocomplete (FIM)
MoEApache 2.0
Granite 4.0 Tiny
IBM · 7B MoE / 1B active
HumanEval82.41%
IBM's MoE entry in the sub-4GB tier. 82.41% HumanEval from a model that fits on a laptop GPU is genuinely impressive. The trade-off: 8K context is a hard limitation for anything beyond single-file edits. Apache 2.0 — cleanest possible license for enterprise use. IBM tracks data provenance per token, which matters for compliance teams.
~4 GB·8K ctx·Code generation
MITOllama ✓
DeepSeek-R1-Distill-Qwen-1.5B
DeepSeek · 1.5B
LiveCodeBench16.9%
The smallest model with chain-of-thought reasoning. Distilled from DeepSeek-R1. 16.9% LCB is modest but it reasons through problems instead of pattern-matching. MIT license. Useful for edge devices, CI environments, or when you genuinely have no VRAM budget. Don't expect production-grade code — expect a model that thinks before it answers.
~1.2 GB·128K ctx·Reasoning on edge devices
Apache 2.0
SmolLM3-3B
HuggingFace · 3B
LiveCodeBench30%
HuggingFace's flagship compact model. 30% LCB from a 3B that fits in 2 GB VRAM is a real milestone — it beats every previous 3B coding model and reaches the range of some 7B models from 2024. Apache 2.0. 64K context. Not yet in the official Ollama library, but GGUF files are on HuggingFace. The pick for edge deployments, CI environments, or devices where even 4 GB isn't available. Solid code generation within genuine hardware constraints.
† Surpasses every previous 3B model on coding — near 7B 2024-class quality
~2 GB·64K ctx·Edge coding - top accuracy at 2 GB
Apache 2.0Ollama ✓
Qwen3.5-4B
Alibaba · 4B
LiveCodeBench55.8%
The best coding model under 4 GB that doesn't get enough attention. 55.8% LCB at 4B parameters puts it ahead of some 7B models from 2024. 262K context window at 3 GB VRAM is unmatched in this tier — every other tiny model tops out at 64K or less. Multimodal: reads images, screenshots, and diagrams alongside code. Apache 2.0. Available on Ollama. If your hardware limit is a 4 GB GPU or you're deploying to memory-constrained devices, this is the pick.
† 262K context and multimodal at 3 GB — ahead of some 7B models from 2024
~3 GB·262K ctx·Edge coding with long context
MITOllama ✓
Phi-4-mini
Microsoft · 3.8B
LiveCodeBench19.9%
HumanEval+68.3%
74.4% HumanEval at 3.8B and 2.3 GB VRAM. MIT license. The caveat: LiveCodeBench is only 19.9% — it handles function-level code generation well but struggles with harder algorithmic problems. The reason to pick it over Qwen3.5-4B: brand diversity, MIT license, and solid HumanEval score. Best choice if you're in a Microsoft-oriented stack or need a non-Qwen option at the ultra-small tier.
† Strong HumanEval but weak LCB — better at function-level tasks than hard algorithmic problems
~2.3 GB·128K ctx·Edge coding — Microsoft stack
Apache 2.0
Bonsai 8B
PrismML · 8B (1-bit)
1-bit quantization stores each weight in a single bit instead of 16 or 32. That gets 8B parameters down to 1.15 GB of RAM. Commercially licensed under Apache 2.0. First 1-bit model to hold up on real tasks: chat, summarization, tool calling. No Ollama support yet. Load via Hugging Face Transformers or PrismML's custom llama.cpp fork at github.com/PrismML-Eng/llama.cpp. Good fit for embedded hardware, edge deployments, or anything where every GB matters.
† No standard benchmark results published yet. Community reports competitive with Qwen3.5-8B on general tasks.
~1.2 GB·4K ctx·Extreme RAM-constrained devices
Apache 2.0
Bonsai 1.7B
PrismML · 1.7B (1-bit)
Runs in under 256 MB of RAM using 1-bit quantization. Apache 2.0 license with no usage caps. Confirmed running on Raspberry Pi 5 at useful speeds. The use case is narrow but real: anywhere you need an LLM and have less than 512 MB to spare. Load via PrismML's llama.cpp fork or Hugging Face Transformers. No Ollama support yet.
† No published benchmarks. Designed for Raspberry Pi 5 and sub-1 GB RAM hardware.
~0.24 GB·4K ctx·Raspberry Pi 5 / embedded AI
5–8 GBRTX 3050/4050, MacBook M1/M2 base
★ Top PickMoEMITOllama ✓
GLM-4.7-Flash
Zhipu AI (Z.ai) · ~9B active (MoE)
SWE-bench Verified†59.2%
LiveCodeBench84.9%
HumanEval94.2%
The single most surprising model in this guide. 84.9% LCB and 94.2% HumanEval from a MoE that runs on a 6GB laptop GPU. For competitive programming and pure code generation, it outperforms most 32B dense models. SWE-bench is 59.2% (the 73.8% score belongs to its full 355B parent model — not Flash). MIT license. Available on Ollama. The sleeper pick of 2026.
† SWE-bench for Flash distill (NOT the full 355B parent's 73.8%)
~6 GB·200K ctx·Competitive programming & code gen
★ Top PickMIT
DeepSeek-R1-0528-Qwen3-8B
DeepSeek · 8B
LiveCodeBench60.5%
60.5% LCB at 8B and 5GB VRAM. That's what 32B models scored in 2024. The jump from older R1-Distill-Qwen-7B (37.6% LCB) happened because this was distilled from a fundamentally stronger teacher — R1-0528, which itself jumped +9.8 LCB over the original R1. Best reasoning model at the laptop GPU tier. MIT license.
† Distilled from the updated R1-0528 teacher — dramatically better than older R1 distills
~5 GB·128K ctx·Reasoning + coding at 8B
MoEApache 2.0
Qwen3-30B-A3B
Alibaba · 30B / 3.3B active (MoE)
30 billion parameters of knowledge accessible through a 3.3B-active-parameter window — all on 6GB VRAM. The 256K context window is extraordinary for this tier; every other laptop GPU model tops out at 128K or less. Apache 2.0. Best pick when you need to hold a large codebase in context on limited hardware. Speed is slower than a pure 8B dense model.
† 256K context at laptop VRAM — major differentiator
~6 GB·256K ctx·Long-context coding at low VRAM
★ Top PickApache 2.0
Granite 3.3 8B Instruct
IBM · 8B
HumanEval89.73%
A massive jump from IBM's old code-specific Granite models. 89.73% HumanEval at 8B and 5GB VRAM. Apache 2.0 with IBM's full data provenance tracking — every training token is documented. The cleanest enterprise license in this tier. If your legal team needs to audit the training data, this is the only sub-8GB model that can satisfy that requirement.
~5 GB·128K ctx·Enterprise code generation
MIT
Seed-Coder-8B Instruct
ByteDance Seed · 8B
HumanEval84.8%
Released June 2025 with 6 trillion tokens of training data — more code-specific pretraining than almost any model at this size. 84.8% HumanEval is state-of-the-art for an 8B model. MIT license. Essentially unknown outside research circles despite being one of the strongest 8B code models available. Requires GGUF download from HuggingFace — not in Ollama library.
~5 GB·64K ctx·Code generation
Apache 2.0Ollama ✓
Yi-Coder-9B
01.ai · 9B
HumanEval85.4%
85.4% HumanEval at 9B, 128K context, Apache 2.0, and full Ollama support. Supports 52 programming languages — broader than most models in this tier. The best Apache 2.0 coding model from a non-Alibaba Chinese lab at the laptop GPU size. Clean license, strong benchmark, straightforward to run.
~6 GB·128K ctx·Multilingual code generation
HybridApache 2.0Ollama ✓
MiniCPM4-8B
Tsinghua / OpenBMB · 8B
Built for on-device and battery-constrained environments. Uses InfLLM v2 infinite inference architecture for 7x faster inference than comparable models on end-device chips. Quality matches Qwen3-8B. Apache 2.0. Best pick if you're deploying to user machines or a constrained environment where inference speed and power draw matter as much as benchmark scores.
† Matches Qwen3-8B quality with 7x faster inference on device chips
~5 GB·32K ctx·Edge / on-device deployment
Apache 2.0
InternLM3-8B
Shanghai AI Lab · 8B
LiveCodeBench17.8%
HumanEval82.3%
Trained on only 4 trillion tokens — extraordinarily data-efficient for its benchmark scores. Supports a deep thinking mode for harder problems. 82.3% HumanEval beats Llama 3.1 8B and Qwen2.5-7B at the same size. Apache 2.0. LCB is modest at 17.8% — stronger on function-level tasks than algorithmic competition problems. Community GGUF available.
~5 GB·128K ctx·Code generation with thinking mode
Llama 3.1Ollama ✓
Llama 3.1 8B Instruct
Meta · 8B
HumanEval72.6%
The baseline model the community measures everything else against. 72.6% HumanEval — not the highest in this tier, but Llama 3.1 8B has 108M+ Ollama pulls. Every local IDE integration (Continue.dev, Aider, etc.) has explicit docs for it. Tutorials, community support, and known behavior are unmatched at this size. The Llama license has a >700M MAU cap — check it before building a commercial product.
† 108M+ Ollama pulls — the ecosystem standard everyone compares against
~5 GB·128K ctx·General-purpose baseline
Apache 2.0Ollama ✓
Qwen2.5-Coder-7B
Alibaba · 7B
HumanEval+84.1%
88.4% HumanEval at 7B and 5 GB VRAM. The gap between this and the 14B is minimal on single-file tasks (88.4% vs 89%). Apache 2.0. Ollama support. Strong FIM support for inline autocomplete. If you want a serious coding model on a MacBook M1 or an RTX 3050, this is the most efficient pick in the Qwen2.5-Coder family.
† Near-identical to the 14B on single-file tasks — 2x more VRAM-efficient
~5 GB·128K ctx·Code generation at laptop VRAM
Apache 2.0Ollama ✓
Qwen3-8B
Alibaba · 8B
Qwen3's 8B dense member — a generalist with built-in thinking mode. Handles code, analysis, math, and multi-step reasoning within 5 GB VRAM. Improved instruction following over Qwen2.5-7B, which shows in complex prompts and continued conversations. Apache 2.0. Ollama support. The practical default if you want a thinking-capable model at 5 GB VRAM without chasing specialized coding benchmarks.
† Thinking mode included — better instruction following than Qwen2.5-7B across multi-step tasks
~5 GB·128K ctx·General-purpose coding assistant
Apache 2.0Ollama ✓
Qwen3.5-9B
Alibaba · 9B
LiveCodeBench65.6%
65.6% LCB at 9B is genuinely impressive — most 14B models from 2025 don't score that high. 262K context window extensible to 1M via YaRN. Multimodal: reads images, screenshots, and diagrams alongside code. Gated DeltaNet hybrid architecture (3:1 linear-to-softmax attention ratio) makes the long context practical without the VRAM penalty. Apache 2.0. Ollama support. If you're choosing between Qwen3-8B and Qwen3.5-9B on similar hardware — pick the 9B. It's stronger on real coding tasks at essentially the same VRAM.
† Outperforms Qwen3-30B (3x its size) on reasoning — 262K context, multimodal
~6.6 GB·262K ctx·General-purpose coding assistant
Apache 2.0Ollama ✓
Gemma 4 E4B
Google DeepMind · ~11B total / 4B effective
LiveCodeBench52%
Google's entry into the laptop-GPU tier, and the default Gemma 4 Ollama pull. E4B means 4B effective parameters -- the model actually has ~11B total parameters but uses per-layer embeddings and alternating attention to run inference like a 4B while accessing knowledge from a much larger base. 52% LCB at 5 GB VRAM is competitive with other models in this tier. Multimodal: reads images, diagrams, and screenshots alongside code. 128K context. Apache 2.0 -- cleaner than the old Gemma license, which required a separate Google terms review. If you already run Gemma 3, this is a drop-in upgrade.
† E = Effective params. Per-layer embeddings + alternating attention — runs like 4B, draws on 11B knowledge.
~5 GB·128K ctx·Multimodal coding at laptop VRAM
8–12 GBRTX 3060 / 4060, MacBook M2 Pro
★ Top PickMIT
Phi-4-Reasoning
Microsoft · 14B
LiveCodeBench53.8%
HumanEval+92.9%
Microsoft's hidden gem. 92.9% HumanEval+ — the same score as o1-mini — at 14B and 9GB VRAM. 53.8% LCB is strong for the size. The key differentiator is chain-of-thought: it reasons through problems before answering, catching logical errors that direct-generation models miss. MIT license. Trained on high-density synthetic reasoning data. Not on Ollama but GGUF available.
† HumanEval+ ties o1-mini
~9 GB·32K ctx·Reasoning-heavy coding
MITOllama ✓
DeepSeek-R1-Distill-Qwen-14B
DeepSeek · 14B
LiveCodeBench53.1%
Distilled from the original DeepSeek-R1 (not the stronger R1-0528 update). 53.1% LCB at 14B, MIT license, available on Ollama — the most accessible reasoning model at this VRAM tier. Strong for algorithm problems and math-heavy code. Note: if you can tolerate a GGUF download, the 8B R1-0528 distill actually scores higher on LCB at lower VRAM.
~9 GB·128K ctx·Reasoning + coding
Apache 2.0Ollama ✓
Qwen3-14B
Alibaba · 14B
The balanced choice at the 9GB tier. Strong general coding ability, thinking mode for harder problems, 128K context, Ollama support, and Apache 2.0 license. Doesn't top any single benchmark but performs reliably across code generation, explanation, refactoring, and debugging. The default recommendation when someone asks "what 14B model should I run?"
The strongest model in the 8–12 GB tier with multimodal capability — it reads screenshots, UI mockups, diagrams, or error images alongside code. 85.4% HumanEval, 128K context, Ollama support. License is Google's Gemma Terms: commercial use is allowed for most applications, but it requires reviewing Google's terms — not as clean as Apache 2.0 or MIT. Best pick if your workflow involves visual inputs.
† Multimodal — reads images and diagrams alongside code
~8 GB·128K ctx·Multimodal code generation
★ Top PickApache 2.0Ollama ✓
Qwen2.5-Coder-14B
Alibaba · 14B
HumanEval89.1%
The top code model in the 8–12 GB VRAM range. 89.1% HumanEval, 128K context, Apache 2.0, Ollama support, FIM for autocomplete. The sweet spot of the Qwen2.5-Coder family: the 14B hits 89%+ on HumanEval while the 32B adds roughly 4 more points at 2x the VRAM. If you have mid-tier hardware and want the best pure coding performance available, this is it.
† Best HumanEval score in the 8–12 GB tier
~10 GB·128K ctx·Code generation — best in tier
12–16 GBRTX 3080 / 4070, MacBook M2 Pro 16GB
★ Top PickApache 2.0Ollama ✓
Devstral Small 2
Mistral AI · 24B
SWE-bench Verified68%
The best Apache 2.0 model that runs on a single consumer GPU. 68% SWE-bench — up from 46.8% at v1. That's a 21-point jump in one release, the largest single-model improvement of 2025. 256K context. Available on Ollama. Purpose-built for agentic workflows: fixing real GitHub issues, multi-file edits, running tests. If you need commercial-clean agentic coding on one card, this is the pick.
~16 GB·256K ctx·Agentic coding (Apache 2.0)
Mistral CLOllama ✓
Codestral 22B
Mistral AI · 22B
HumanEval86.6%
86.6% HumanEval and excellent fill-in-middle performance for both chat and autocomplete use cases. 256K context on a 14GB GPU. The catch: Mistral's Commercial License means you need a Mistral agreement for production deployment — it's free for personal/research use only. If you're building a product, factor that in. For personal coding work, it's an outstanding model at this tier.
~14 GB·256K ctx·FIM completion + code gen
Apache 2.0Ollama ✓
Mistral Small 3.2
Mistral AI · 24B
HumanEval92.9%
Strong generalist model at 24B. 92.9% HumanEval is Pass@5 (five attempts, best selected) — not Pass@1. Individual generation quality is lower than that number implies. Still a capable coding model with 128K context and Apache 2.0. Good choice if you want a well-rounded assistant rather than a specialized coding model. Ollama support makes setup trivial.
† HumanEval score is Pass@5, not Pass@1 — individual generation quality lower
~15 GB·128K ctx·General coding assistant
MoEApache 2.0
GPT-OSS-20B
OpenAI · 20B / 3.6B active (MoE)
The headline: this is OpenAI's first open-weight release since GPT-2. Apache 2.0 — the most permissive license OpenAI has ever shipped. 20B total, 3.6B active (MoE), runs within 16 GB VRAM, available on Ollama with 7.7M+ pulls. Community reception was mixed — independent testing found it performs solidly but doesn't stand out against Qwen3.5 or DeepSeek at similar parameter counts. The reason to run it: drop-in API compatibility with OpenAI's format and the widest ecosystem support of any open-weight model.
† OpenAI's first open-weight model since GPT-2 (2019). Apache 2.0.
~14 GB·128K ctx·OpenAI-compatible local model
MoEApache 2.0Ollama ✓
Gemma 4 26B
Google DeepMind · 26B / 4B active (MoE)
LiveCodeBench70%
The practical Gemma 4 for the 16 GB tier. MoE architecture: 26B total parameters, 4B active per inference step. That ratio keeps VRAM at ~14 GB while giving the model access to 26B worth of knowledge in routing decisions. 256K context. 70% LCB puts it in range of the best models in this tier. Ranks #6 among all open models on Arena AI. Apache 2.0 -- no usage restrictions to review. Multimodal: text, image, and audio inputs. Ollama support with a direct pull command.
† #6 open model on Arena AI (April 2026). MoE: 26B total, 4B active per token.
~14 GB·256K ctx·Agentic coding — Apache 2.0
16–24 GBRTX 3090 / 4090, Mac M2 Max 32GB
★ Top PickApache 2.0Ollama ✓
Gemma 4 31B
Google DeepMind · 31B
LiveCodeBench80%
The top-of-stack Gemma 4 and the most significant open-weight release of early 2026. 80% LCB and 89.2% AIME -- the AIME jump from Gemma 3's 20.8% is the largest single-generation reasoning improvement in open-source model history. Ranks #3 among all open models on the Arena AI text leaderboard. 256K context. Apache 2.0 -- commercially clean with no usage caps. Fits on a single RTX 4090 or Mac M2 Max at Q4_K_M (~19 GB). Multimodal: reads images, diagrams, and audio alongside code. For serious agentic workflows on a single consumer GPU, this is the new benchmark.
† #3 open model on Arena AI (April 2026). AIME 2026: 89.2% (Gemma 3 was 20.8%).
~19 GB·256K ctx·Agentic coding + reasoning
★ Top PickApache 2.0Ollama ✓
Qwen3.5-27B
Alibaba · 27B
SWE-bench Verified72.4%
LiveCodeBench80.7%
Best SWE-bench score that fits on a single consumer GPU. 72.4% SWE-bench, 80.7% LCB — outperforming its own 122B-A10B MoE sibling on coding tasks because dense beats MoE when full parameter engagement matters for complex multi-file reasoning. GatedDeltaNet hybrid architecture means 256K context at practical speed. Apache 2.0. Ollama support. The single-GPU pick of 2026.
† Outperforms the 122B-A10B MoE sibling on coding — dense beats MoE here
~17 GB·256K ctx·Agentic coding — best single GPU
Apache 2.0
KAT-Dev-32B
Kwaipilot (Kuaishou AI) · 32B
SWE-bench Verified62.4%
KAT = Kwai-AutoThink. Built by Kuaishou AI (the ByteDance competitor). 62.4% SWE-bench at 32B — runs on a single RTX 3090 or 4090. Almost entirely missed by mainstream AI coverage. Beats Qwen2.5-Coder-32B on real-world agentic tasks through a 3-stage training pipeline: SFT, RL on coding tasks, RLHF alignment. Apache 2.0. GGUF available via community.
~20 GB·128K ctx·Agentic coding
Apache 2.0Ollama ✓
Qwen3-32B
Alibaba · 32B
HumanEval72.05%
The 32B version of Qwen3 — strong generalist with thinking mode, 128K context, Ollama support, Apache 2.0. For pure coding performance in this VRAM tier, Qwen3.5-27B (72.4% SWE-bench) is the better pick. Qwen3-32B is the choice when you want a generalist that handles code, analysis, and reasoning equally well within the same VRAM budget.
† EvalPlus score; strong generalist with thinking mode
~20 GB·128K ctx·General coding assistant
Apache 2.0
OLMo 3.1 32B Think
Allen AI · 32B
LiveCodeBench83.3%
HumanEval+91.5%
91.5% HumanEval+ and 83.3% LCB are legitimate frontier-level numbers. But the real story is the license: Apache 2.0 with fully open training data under the ODC-BY license (Dolma dataset). If your organization needs to audit the training data for IP or compliance reasons — not just model weights — this is the only competitive coding model that can satisfy that requirement.
~20 GB·128K ctx·Fully open / compliance use
Research NC
EXAONE Deep 32B
LG AI Research · 32B
LiveCodeBench59.5%
LG AI Research's reasoning model. 59.5% LCB — competitive with DeepSeek-R1-Distill-Qwen-32B on competitive programming despite being a different architecture. Requires a `<thought>` tag prefix to activate reasoning mode. Non-commercial research license — not suitable for product deployment. GGUF available. Best use: hard algorithm problems, AIME/math olympiad style tasks, research environments.
~18 GB·32K ctx·Algorithm reasoning
Llama 3.1
Hermes 4.3 36B
NousResearch · 36B
One remarkable property: 512K context on a single consumer GPU. No other single-GPU model comes close. Built on a ByteDance Seed base, post-trained using a decentralized compute network (Solana/Psyche). The Llama 3.1 license limits use to <700M MAU. If your use case requires holding enormous codebases, documentation corpora, or long conversation histories in context, this is the only option that fits on one card.
† 512K context — only single-GPU model in this class
~22 GB·512K ctx·Massive context window
★ Top PickApache 2.0Ollama ✓
Qwen2.5-Coder-32B
Alibaba · 32B
HumanEval+86.2%
The 2025 community gold standard for local coding agents. 92.7% HumanEval, 86.2% HumanEval+ — among the highest code generation scores of any model that fits on a single consumer GPU. Apache 2.0. Ollama support. 128K context. Purpose-built for software engineering, and the 32B hit the sweet spot: strong enough for real production code, small enough for an RTX 3090 or Mac M2 Max. Qwen3.5-27B has since taken the SWE-bench crown (72.4%), but for raw code generation, Qwen2.5-Coder-32B is still a top-tier pick.
† 2025 community gold standard for local coding agents
~20 GB·128K ctx·Code generation — community gold standard
MoEApache 2.0Ollama ✓
Qwen3.5-35B-A3B
Alibaba · 35B / 3B active (MoE)
SWE-bench Verified69.2%
The MoE alternative to Qwen3.5-27B in the 20 GB tier. 69.2% SWE-bench — 3 points below the dense 27B sibling (72.4%), but faster throughput because only 3B parameters activate per token. 256K context. Apache 2.0. The trade-off makes sense for agentic loops: when your workflow involves hundreds of model calls across a long session, that inference speed difference adds up more than the benchmark gap.
† MoE — faster inference than the 27B dense sibling at similar VRAM
~22 GB·256K ctx·Agentic coding — speed-optimized
HybridNVIDIA Nemotron LicenseOllama ✓
Nemotron 3 Nano 30B-A3B
NVIDIA · 31.6B / 3.2B active (MoE)
LiveCodeBench68.3%
HumanEval78.05%
NVIDIA's first consumer-runnable open model. 78.05% HumanEval and 68.3% LCB from a hybrid Mamba-2 + Transformer architecture — the only model at this tier built on SSM technology. 1M token context is practical here because Mamba-2 processes sequences in linear time, not quadratic. Outperforms Qwen3-30B-A3B on both math (82.88% vs 61.14%) and code (78.05% vs 70.73%). Fits on a single RTX 3090/4090. NVIDIA's own Nemotron license — commercially permissive, but review the terms before shipping a product.
† Hybrid Mamba-2 + Transformer — 1M context at 24 GB; outperforms Qwen3-30B-A3B on math and code
~24 GB·1000K ctx·Long-context coding on 24 GB GPU
40–48 GBDual RTX 3090, A6000, Mac Pro 192GB
MoEApache 2.0
Mistral Small 4
Mistral AI · 119B (6B active, MoE)
The name is misleading. "Small" refers to 6B active parameters per forward pass, not total model size. Total is 119B across 128 experts, with 4 active per token. At Q4_K_M quantization all 119B weights load into ~67 GB VRAM, which is above a standard workstation card but within range of Mac Pro 192GB unified memory or a single A100 80GB. The payoff: 256K context window, Apache 2.0 license, strong multilingual benchmark results. No Ollama support at launch. Load via llama.cpp or Hugging Face Transformers.
† Strong multilingual benchmarks. "Small" refers to 6B active params per forward pass, not total model size.
~67 GB·256K ctx·High-context MoE on high-VRAM workstations
★ Top PickMoEQwen License
Qwen3-Coder-Next
Alibaba · 80B / 3B active (MoE)
SWE-bench Verified71.3%
LiveCodeBench70.6%
71.3% SWE-bench on an 80B MoE that activates only 3B parameters per token. Purpose-built for software engineering. The catch: requires ~45-49GB VRAM at Q4_K_M — not a single RTX 4090. You need a 48GB workstation card (A6000, RTX 6000), a dual-GPU setup, or Apple Silicon 64GB+. If you have the hardware, this is the best locally-runnable coding model in the world.
† Needs 48GB+ card or dual GPU — NOT runnable on single RTX 4090
~45–49 GB·256K ctx·Frontier agentic coding
★ Top PickApache 2.0
KAT-Dev-72B-Exp
Kwaipilot (Kuaishou AI) · 72B
SWE-bench Verified†74.6%
74.6% SWE-bench at release made this the highest-scoring open-weight coding model in the world for several weeks. Dense 72B — same VRAM tier as Llama 3.3 70B and Kimi-Dev-72B. Apache 2.0. Community GGUF available. Built by Kwaipilot, Kuaishou's developer tooling team. Still almost entirely unknown outside Chinese research circles. If you have dual 3090s and want top-tier SWE-bench performance with a clean license, this is worth serious consideration.
Purpose-built via large-scale reinforcement learning on Docker test suites — the model literally practiced resolving GitHub issues in containerized environments. 60.4% SWE-bench. Apache 2.0. Best MIT/Apache-licensed 72B coding model when VRAM budget allows. Community GGUF available. Kimi's cloud API is easier, but the weights are yours if you run it locally.
~40 GB·128K ctx·Agentic coding
MITOllama ✓
DeepSeek-R1-Distill-Llama-70B
DeepSeek · 70B
SWE-bench Verified49.2%
LiveCodeBench57.5%
Reasoning-capable distill of the original DeepSeek-R1 at 70B. 57.5% LCB, 49.2% SWE-bench, MIT license, Ollama support. The main trade-off vs newer models: distilled from the original R1 (not R1-0528), so it's been surpassed on coding benchmarks. Still useful if you specifically want reasoning chains at 70B scale with a clean MIT license and want Ollama convenience.
~40 GB·128K ctx·Reasoning at 70B
Llama 3.3Ollama ✓
Llama 3.3 70B
Meta · 70B
HumanEval88.4%
88.4% HumanEval, 128K context, Ollama support — the workhorse 70B. The Llama license has a >700M MAU cap. For pure agentic coding at 40GB VRAM, KAT-Dev-72B-Exp (74.6% SWE-bench) and Kimi-Dev-72B (60.4% SWE-bench) are stronger. Llama 3.3 is the choice when you want a capable generalist that handles code as one of many tasks rather than a specialized coding agent.
~40 GB·128K ctx·General-purpose coding
All these models and more work in Bodega One
No config files. No YAML. Pick a model, connect a provider, start coding. One-time purchase.
SWE-bench tells you which models write code. These picks cover everything else: reasoning, research, writing, and math. All run locally, all work in Bodega One.
Reasoning
Chain-of-thought analysis, logical problem solving, and extended thinking for complex multi-step tasks.
~5 GB
DeepSeek-R1-0528-Qwen3-8B
60.5% LCB at 5GB. Best reasoning per watt available.
~10 GB
Phi-4-Reasoning
Distilled from o3-mini. Math and logic specialist from Microsoft.
~5 GB
OLMo 3.1 Think
Fully open Apache 2.0 thinking model. No license restrictions.
Long-context research
Document analysis, knowledge synthesis, and multi-source research requiring large context windows.
~24 GB
Hermes 4.3 36B
512K context window. Reads entire codebases or document sets.
~18 GB
Qwen3.5-27B
Best dense model at this weight. Strong on long-context and reasoning.
Server
Llama 3.3 70B
128K context. Meta flagship, top open-weight instruction follower.
Writing & editing
Prose, documentation, structured output, and natural instruction following for content tasks.
~5 GB
Qwen3-8B
Punches above its weight. Excellent at structured writing at 5GB.
Server
Llama 3.3 70B
Best open-weight instruction follower at any size class.
~8 GB
Mistral Nemo 12B
Strong multilingual writing. Apache 2.0, runs on 8GB cards.
Math & science
Symbolic computation, step-by-step proofs, competition math, and STEM reasoning tasks.
~10 GB
Phi-4-Reasoning
Purpose-built for mathematical reasoning. Top performer at 10GB.
~5 GB
DeepSeek-R1-0528-Qwen3-8B
Extended thinking mode. Strong on competition-level math.
~20 GB
QwQ-32B
72.9% MATH-500. Qwen reasoning model, math specialist.
Local AI that actually works
Every model on this page runs inside a full IDE with AI chat and an autonomous coding agent. Your data stays on your machine.
GLM-4.7-Flash scores 94.2% HumanEval on a 6GB laptop GPU. The benchmark is done. SWE-bench Verified and LiveCodeBench are the only meaningful signals for 2026.
Dense beats MoE on hard tasks
Qwen3.5-27B (dense, 27B params) outperforms Qwen3.5-122B-A10B (MoE, 10B active) on coding. When complex multi-file reasoning needs full parameter engagement, dense wins.
The 8B tier is now actually good
DeepSeek-R1-0528-Qwen3-8B scores 60.5% LCB at 5GB VRAM. That's what 32B models scored in 2024. Entry-level hardware is now competitive.
Devstral's 21-point jump
Devstral Small went from 46.8% to 68% SWE-bench between v1 and v2. The largest single-model improvement of the year. Best Apache 2.0 coding model on a single GPU.
Quantization matters sub-8B
Q4_K_M causes ~8-10% variance on coding tasks at 7B. Use Q6_K or Q8_0 for models under 8B. Q4 is fine at 14B and above.
Context floor: 32K minimum
8K context disqualifies a model for repo-level work. 32K is the minimum. 64K–128K is the sweet spot. Larger than 128K can hurt via 'lost in the middle' degradation.
Rankings updated monthly. Get notified when they change.
Beyond consumer hardware
These models require server infrastructure or multi-GPU setups. They set the ceiling for what open-weight models can achieve.
Model
Org
Params
SWE-bench
Min VRAM
License
MiniMax M2.5
MiniMax
229B / 10B active
80.2%
~128 GB (3-bit)
Commercial OK
GLM-5
Zhipu AI
744B / 40B active
77.8%
~180 GB (2-bit)
MIT
Kimi K2.5
Moonshot AI
1T / 32B active
76.8%
~375 GB (2-bit)
Modified MIT
Qwen3.5-397B-A17B
Alibaba
397B / 17B active
76.4%
~220 GB
Apache 2.0
KAT-Dev-72B-ExpBorderline consumer
Kwaipilot
72B
74.6%
~40 GB (dual GPU)
Apache 2.0
DeepSeek V3.2
DeepSeek
685B / 37B active
74.1%
Server
MIT
GLM-4.7 (full)
Zhipu AI
355B / 9B active
73.8%
Server
MIT
MiMo-V2-Flash150 tok/s via MTP
Xiaomi
309B / 15B active
73.4%
Multi-GPU
Apache 2.0
Devstral 2 Large
Mistral AI
123B
72.2%
Multi-GPU
Apache 2.0
Qwen3.5-122B-A10BLCB 78.9% — 27B dense actually beats it on coding at 1/4 the VRAM
Alibaba
122B / 10B active
72%
~70–81 GB (multi-GPU)
Apache 2.0
Nemotron 3 Super 120B-A12BLCB 81.19% — Hybrid Mamba-2 MoE, 1M context, 7.5x faster than Qwen3.5-122B
NVIDIA
120.6B / 12.7B active
60.47%
~87 GB Q4 (64 GB+ unified)
NVIDIA Nemotron
Benchmark glossary
SWE-bench Verified
% of real GitHub issues resolved autonomously. A human validated each issue. The most practical benchmark: it tests actual software engineering, not toy problems. Frontier models top out around 80%.
LiveCodeBench (LCB)
Contamination-free competitive programming problems collected after the training cutoffs of the models being tested. Harder to game than HumanEval. Updated continuously.
HumanEval / HumanEval+
Code generation at function level. HumanEval is largely saturated. Multiple 6GB models score above 90%. Use LCB and SWE-bench for real discrimination. HumanEval+ has stricter tests than the original.
VRAM figures
All VRAM numbers are at Q4_K_M quantization unless noted. For models under 8B, use Q6_K or Q8_0. Q4 causes ~8-10% variance on coding tasks at that scale.