Skip to main content
Local LLM Guide

The best local LLMs for software engineering

Ranked by SWE-bench Verified, the benchmark that tests real GitHub issues, not toy problems. Opinionated picks for every VRAM tier. Not sure if local models are ready? Read our honest assessment.

47 models evaluated9 research passesSWE-bench · LCB · HumanEval+Updated monthly

What should I run?

Pick your hardware. Get a model recommendation in seconds.

1
2
3
Step 1: Select your platform
VRAM
47 models
≤ 4 GBEdge, integrated GPU, older laptops
★ Top PickBigCode RAIL-MOllama ✓

StarCoder2-3B

BigCode · 3B

The de facto standard for inline code completion. Continue.dev recommends it as the default FIM model. At 2 GB VRAM, it runs on almost any GPU — including integrated graphics with enough shared memory. Use the base model, not instruct — the fine-tuning actually hurts FIM quality. Not for chat; only for autocomplete.

FIM fill-in-middle champion — not an instruct model

~2 GB16K ctxInline autocomplete (FIM)
MoEApache 2.0

Granite 4.0 Tiny

IBM · 7B MoE / 1B active

HumanEval82.41%

IBM's MoE entry in the sub-4GB tier. 82.41% HumanEval from a model that fits on a laptop GPU is genuinely impressive. The trade-off: 8K context is a hard limitation for anything beyond single-file edits. Apache 2.0 — cleanest possible license for enterprise use. IBM tracks data provenance per token, which matters for compliance teams.

~4 GB8K ctxCode generation
MITOllama ✓

DeepSeek-R1-Distill-Qwen-1.5B

DeepSeek · 1.5B

LiveCodeBench16.9%

The smallest model with chain-of-thought reasoning. Distilled from DeepSeek-R1. 16.9% LCB is modest but it reasons through problems instead of pattern-matching. MIT license. Useful for edge devices, CI environments, or when you genuinely have no VRAM budget. Don't expect production-grade code — expect a model that thinks before it answers.

~1.2 GB128K ctxReasoning on edge devices
Apache 2.0

SmolLM3-3B

HuggingFace · 3B

LiveCodeBench30%

HuggingFace's flagship compact model. 30% LCB from a 3B that fits in 2 GB VRAM is a real milestone — it beats every previous 3B coding model and reaches the range of some 7B models from 2024. Apache 2.0. 64K context. Not yet in the official Ollama library, but GGUF files are on HuggingFace. The pick for edge deployments, CI environments, or devices where even 4 GB isn't available. Solid code generation within genuine hardware constraints.

Surpasses every previous 3B model on coding — near 7B 2024-class quality

~2 GB64K ctxEdge coding - top accuracy at 2 GB
Apache 2.0Ollama ✓

Qwen3.5-4B

Alibaba · 4B

LiveCodeBench55.8%

The best coding model under 4 GB that doesn't get enough attention. 55.8% LCB at 4B parameters puts it ahead of some 7B models from 2024. 262K context window at 3 GB VRAM is unmatched in this tier — every other tiny model tops out at 64K or less. Multimodal: reads images, screenshots, and diagrams alongside code. Apache 2.0. Available on Ollama. If your hardware limit is a 4 GB GPU or you're deploying to memory-constrained devices, this is the pick.

262K context and multimodal at 3 GB — ahead of some 7B models from 2024

~3 GB262K ctxEdge coding with long context
MITOllama ✓

Phi-4-mini

Microsoft · 3.8B

LiveCodeBench19.9%
HumanEval+68.3%

74.4% HumanEval at 3.8B and 2.3 GB VRAM. MIT license. The caveat: LiveCodeBench is only 19.9% — it handles function-level code generation well but struggles with harder algorithmic problems. The reason to pick it over Qwen3.5-4B: brand diversity, MIT license, and solid HumanEval score. Best choice if you're in a Microsoft-oriented stack or need a non-Qwen option at the ultra-small tier.

Strong HumanEval but weak LCB — better at function-level tasks than hard algorithmic problems

~2.3 GB128K ctxEdge coding — Microsoft stack
Apache 2.0

Bonsai 8B

PrismML · 8B (1-bit)

1-bit quantization stores each weight in a single bit instead of 16 or 32. That gets 8B parameters down to 1.15 GB of RAM. Commercially licensed under Apache 2.0. First 1-bit model to hold up on real tasks: chat, summarization, tool calling. No Ollama support yet. Load via Hugging Face Transformers or PrismML's custom llama.cpp fork at github.com/PrismML-Eng/llama.cpp. Good fit for embedded hardware, edge deployments, or anything where every GB matters.

No standard benchmark results published yet. Community reports competitive with Qwen3.5-8B on general tasks.

~1.2 GB4K ctxExtreme RAM-constrained devices
Apache 2.0

Bonsai 1.7B

PrismML · 1.7B (1-bit)

Runs in under 256 MB of RAM using 1-bit quantization. Apache 2.0 license with no usage caps. Confirmed running on Raspberry Pi 5 at useful speeds. The use case is narrow but real: anywhere you need an LLM and have less than 512 MB to spare. Load via PrismML's llama.cpp fork or Hugging Face Transformers. No Ollama support yet.

No published benchmarks. Designed for Raspberry Pi 5 and sub-1 GB RAM hardware.

~0.24 GB4K ctxRaspberry Pi 5 / embedded AI
5–8 GBRTX 3050/4050, MacBook M1/M2 base
★ Top PickMoEMITOllama ✓

GLM-4.7-Flash

Zhipu AI (Z.ai) · ~9B active (MoE)

SWE-bench Verified59.2%
LiveCodeBench84.9%
HumanEval94.2%

The single most surprising model in this guide. 84.9% LCB and 94.2% HumanEval from a MoE that runs on a 6GB laptop GPU. For competitive programming and pure code generation, it outperforms most 32B dense models. SWE-bench is 59.2% (the 73.8% score belongs to its full 355B parent model — not Flash). MIT license. Available on Ollama. The sleeper pick of 2026.

SWE-bench for Flash distill (NOT the full 355B parent's 73.8%)

~6 GB200K ctxCompetitive programming & code gen
★ Top PickMIT

DeepSeek-R1-0528-Qwen3-8B

DeepSeek · 8B

LiveCodeBench60.5%

60.5% LCB at 8B and 5GB VRAM. That's what 32B models scored in 2024. The jump from older R1-Distill-Qwen-7B (37.6% LCB) happened because this was distilled from a fundamentally stronger teacher — R1-0528, which itself jumped +9.8 LCB over the original R1. Best reasoning model at the laptop GPU tier. MIT license.

Distilled from the updated R1-0528 teacher — dramatically better than older R1 distills

~5 GB128K ctxReasoning + coding at 8B
MoEApache 2.0

Qwen3-30B-A3B

Alibaba · 30B / 3.3B active (MoE)

30 billion parameters of knowledge accessible through a 3.3B-active-parameter window — all on 6GB VRAM. The 256K context window is extraordinary for this tier; every other laptop GPU model tops out at 128K or less. Apache 2.0. Best pick when you need to hold a large codebase in context on limited hardware. Speed is slower than a pure 8B dense model.

256K context at laptop VRAM — major differentiator

~6 GB256K ctxLong-context coding at low VRAM
★ Top PickApache 2.0

Granite 3.3 8B Instruct

IBM · 8B

HumanEval89.73%

A massive jump from IBM's old code-specific Granite models. 89.73% HumanEval at 8B and 5GB VRAM. Apache 2.0 with IBM's full data provenance tracking — every training token is documented. The cleanest enterprise license in this tier. If your legal team needs to audit the training data, this is the only sub-8GB model that can satisfy that requirement.

~5 GB128K ctxEnterprise code generation
MIT

Seed-Coder-8B Instruct

ByteDance Seed · 8B

HumanEval84.8%

Released June 2025 with 6 trillion tokens of training data — more code-specific pretraining than almost any model at this size. 84.8% HumanEval is state-of-the-art for an 8B model. MIT license. Essentially unknown outside research circles despite being one of the strongest 8B code models available. Requires GGUF download from HuggingFace — not in Ollama library.

~5 GB64K ctxCode generation
Apache 2.0Ollama ✓

Yi-Coder-9B

01.ai · 9B

HumanEval85.4%

85.4% HumanEval at 9B, 128K context, Apache 2.0, and full Ollama support. Supports 52 programming languages — broader than most models in this tier. The best Apache 2.0 coding model from a non-Alibaba Chinese lab at the laptop GPU size. Clean license, strong benchmark, straightforward to run.

~6 GB128K ctxMultilingual code generation
HybridApache 2.0Ollama ✓

MiniCPM4-8B

Tsinghua / OpenBMB · 8B

Built for on-device and battery-constrained environments. Uses InfLLM v2 infinite inference architecture for 7x faster inference than comparable models on end-device chips. Quality matches Qwen3-8B. Apache 2.0. Best pick if you're deploying to user machines or a constrained environment where inference speed and power draw matter as much as benchmark scores.

Matches Qwen3-8B quality with 7x faster inference on device chips

~5 GB32K ctxEdge / on-device deployment
Apache 2.0

InternLM3-8B

Shanghai AI Lab · 8B

LiveCodeBench17.8%
HumanEval82.3%

Trained on only 4 trillion tokens — extraordinarily data-efficient for its benchmark scores. Supports a deep thinking mode for harder problems. 82.3% HumanEval beats Llama 3.1 8B and Qwen2.5-7B at the same size. Apache 2.0. LCB is modest at 17.8% — stronger on function-level tasks than algorithmic competition problems. Community GGUF available.

~5 GB128K ctxCode generation with thinking mode
Llama 3.1Ollama ✓

Llama 3.1 8B Instruct

Meta · 8B

HumanEval72.6%

The baseline model the community measures everything else against. 72.6% HumanEval — not the highest in this tier, but Llama 3.1 8B has 108M+ Ollama pulls. Every local IDE integration (Continue.dev, Aider, etc.) has explicit docs for it. Tutorials, community support, and known behavior are unmatched at this size. The Llama license has a >700M MAU cap — check it before building a commercial product.

108M+ Ollama pulls — the ecosystem standard everyone compares against

~5 GB128K ctxGeneral-purpose baseline
Apache 2.0Ollama ✓

Qwen2.5-Coder-7B

Alibaba · 7B

HumanEval+84.1%

88.4% HumanEval at 7B and 5 GB VRAM. The gap between this and the 14B is minimal on single-file tasks (88.4% vs 89%). Apache 2.0. Ollama support. Strong FIM support for inline autocomplete. If you want a serious coding model on a MacBook M1 or an RTX 3050, this is the most efficient pick in the Qwen2.5-Coder family.

Near-identical to the 14B on single-file tasks — 2x more VRAM-efficient

~5 GB128K ctxCode generation at laptop VRAM
Apache 2.0Ollama ✓

Qwen3-8B

Alibaba · 8B

Qwen3's 8B dense member — a generalist with built-in thinking mode. Handles code, analysis, math, and multi-step reasoning within 5 GB VRAM. Improved instruction following over Qwen2.5-7B, which shows in complex prompts and continued conversations. Apache 2.0. Ollama support. The practical default if you want a thinking-capable model at 5 GB VRAM without chasing specialized coding benchmarks.

Thinking mode included — better instruction following than Qwen2.5-7B across multi-step tasks

~5 GB128K ctxGeneral-purpose coding assistant
Apache 2.0Ollama ✓

Qwen3.5-9B

Alibaba · 9B

LiveCodeBench65.6%

65.6% LCB at 9B is genuinely impressive — most 14B models from 2025 don't score that high. 262K context window extensible to 1M via YaRN. Multimodal: reads images, screenshots, and diagrams alongside code. Gated DeltaNet hybrid architecture (3:1 linear-to-softmax attention ratio) makes the long context practical without the VRAM penalty. Apache 2.0. Ollama support. If you're choosing between Qwen3-8B and Qwen3.5-9B on similar hardware — pick the 9B. It's stronger on real coding tasks at essentially the same VRAM.

Outperforms Qwen3-30B (3x its size) on reasoning — 262K context, multimodal

~6.6 GB262K ctxGeneral-purpose coding assistant
Apache 2.0Ollama ✓

Gemma 4 E4B

Google DeepMind · ~11B total / 4B effective

LiveCodeBench52%

Google's entry into the laptop-GPU tier, and the default Gemma 4 Ollama pull. E4B means 4B effective parameters -- the model actually has ~11B total parameters but uses per-layer embeddings and alternating attention to run inference like a 4B while accessing knowledge from a much larger base. 52% LCB at 5 GB VRAM is competitive with other models in this tier. Multimodal: reads images, diagrams, and screenshots alongside code. 128K context. Apache 2.0 -- cleaner than the old Gemma license, which required a separate Google terms review. If you already run Gemma 3, this is a drop-in upgrade.

E = Effective params. Per-layer embeddings + alternating attention — runs like 4B, draws on 11B knowledge.

~5 GB128K ctxMultimodal coding at laptop VRAM
8–12 GBRTX 3060 / 4060, MacBook M2 Pro
★ Top PickMIT

Phi-4-Reasoning

Microsoft · 14B

LiveCodeBench53.8%
HumanEval+92.9%

Microsoft's hidden gem. 92.9% HumanEval+ — the same score as o1-mini — at 14B and 9GB VRAM. 53.8% LCB is strong for the size. The key differentiator is chain-of-thought: it reasons through problems before answering, catching logical errors that direct-generation models miss. MIT license. Trained on high-density synthetic reasoning data. Not on Ollama but GGUF available.

HumanEval+ ties o1-mini

~9 GB32K ctxReasoning-heavy coding
MITOllama ✓

DeepSeek-R1-Distill-Qwen-14B

DeepSeek · 14B

LiveCodeBench53.1%

Distilled from the original DeepSeek-R1 (not the stronger R1-0528 update). 53.1% LCB at 14B, MIT license, available on Ollama — the most accessible reasoning model at this VRAM tier. Strong for algorithm problems and math-heavy code. Note: if you can tolerate a GGUF download, the 8B R1-0528 distill actually scores higher on LCB at lower VRAM.

~9 GB128K ctxReasoning + coding
Apache 2.0Ollama ✓

Qwen3-14B

Alibaba · 14B

The balanced choice at the 9GB tier. Strong general coding ability, thinking mode for harder problems, 128K context, Ollama support, and Apache 2.0 license. Doesn't top any single benchmark but performs reliably across code generation, explanation, refactoring, and debugging. The default recommendation when someone asks "what 14B model should I run?"

128K context, thinking mode, strong generalist coding

~9 GB128K ctxGeneral coding assistant
Gemma TermsOllama ✓

Gemma 3 12B

Google DeepMind · 12B

HumanEval85.4%

The strongest model in the 8–12 GB tier with multimodal capability — it reads screenshots, UI mockups, diagrams, or error images alongside code. 85.4% HumanEval, 128K context, Ollama support. License is Google's Gemma Terms: commercial use is allowed for most applications, but it requires reviewing Google's terms — not as clean as Apache 2.0 or MIT. Best pick if your workflow involves visual inputs.

Multimodal — reads images and diagrams alongside code

~8 GB128K ctxMultimodal code generation
★ Top PickApache 2.0Ollama ✓

Qwen2.5-Coder-14B

Alibaba · 14B

HumanEval89.1%

The top code model in the 8–12 GB VRAM range. 89.1% HumanEval, 128K context, Apache 2.0, Ollama support, FIM for autocomplete. The sweet spot of the Qwen2.5-Coder family: the 14B hits 89%+ on HumanEval while the 32B adds roughly 4 more points at 2x the VRAM. If you have mid-tier hardware and want the best pure coding performance available, this is it.

Best HumanEval score in the 8–12 GB tier

~10 GB128K ctxCode generation — best in tier
12–16 GBRTX 3080 / 4070, MacBook M2 Pro 16GB
★ Top PickApache 2.0Ollama ✓

Devstral Small 2

Mistral AI · 24B

SWE-bench Verified68%

The best Apache 2.0 model that runs on a single consumer GPU. 68% SWE-bench — up from 46.8% at v1. That's a 21-point jump in one release, the largest single-model improvement of 2025. 256K context. Available on Ollama. Purpose-built for agentic workflows: fixing real GitHub issues, multi-file edits, running tests. If you need commercial-clean agentic coding on one card, this is the pick.

~16 GB256K ctxAgentic coding (Apache 2.0)
Mistral CLOllama ✓

Codestral 22B

Mistral AI · 22B

HumanEval86.6%

86.6% HumanEval and excellent fill-in-middle performance for both chat and autocomplete use cases. 256K context on a 14GB GPU. The catch: Mistral's Commercial License means you need a Mistral agreement for production deployment — it's free for personal/research use only. If you're building a product, factor that in. For personal coding work, it's an outstanding model at this tier.

~14 GB256K ctxFIM completion + code gen
Apache 2.0Ollama ✓

Mistral Small 3.2

Mistral AI · 24B

HumanEval92.9%

Strong generalist model at 24B. 92.9% HumanEval is Pass@5 (five attempts, best selected) — not Pass@1. Individual generation quality is lower than that number implies. Still a capable coding model with 128K context and Apache 2.0. Good choice if you want a well-rounded assistant rather than a specialized coding model. Ollama support makes setup trivial.

HumanEval score is Pass@5, not Pass@1 — individual generation quality lower

~15 GB128K ctxGeneral coding assistant
MoEApache 2.0

GPT-OSS-20B

OpenAI · 20B / 3.6B active (MoE)

The headline: this is OpenAI's first open-weight release since GPT-2. Apache 2.0 — the most permissive license OpenAI has ever shipped. 20B total, 3.6B active (MoE), runs within 16 GB VRAM, available on Ollama with 7.7M+ pulls. Community reception was mixed — independent testing found it performs solidly but doesn't stand out against Qwen3.5 or DeepSeek at similar parameter counts. The reason to run it: drop-in API compatibility with OpenAI's format and the widest ecosystem support of any open-weight model.

OpenAI's first open-weight model since GPT-2 (2019). Apache 2.0.

~14 GB128K ctxOpenAI-compatible local model
MoEApache 2.0Ollama ✓

Gemma 4 26B

Google DeepMind · 26B / 4B active (MoE)

LiveCodeBench70%

The practical Gemma 4 for the 16 GB tier. MoE architecture: 26B total parameters, 4B active per inference step. That ratio keeps VRAM at ~14 GB while giving the model access to 26B worth of knowledge in routing decisions. 256K context. 70% LCB puts it in range of the best models in this tier. Ranks #6 among all open models on Arena AI. Apache 2.0 -- no usage restrictions to review. Multimodal: text, image, and audio inputs. Ollama support with a direct pull command.

#6 open model on Arena AI (April 2026). MoE: 26B total, 4B active per token.

~14 GB256K ctxAgentic coding — Apache 2.0
16–24 GBRTX 3090 / 4090, Mac M2 Max 32GB
★ Top PickApache 2.0Ollama ✓

Gemma 4 31B

Google DeepMind · 31B

LiveCodeBench80%

The top-of-stack Gemma 4 and the most significant open-weight release of early 2026. 80% LCB and 89.2% AIME -- the AIME jump from Gemma 3's 20.8% is the largest single-generation reasoning improvement in open-source model history. Ranks #3 among all open models on the Arena AI text leaderboard. 256K context. Apache 2.0 -- commercially clean with no usage caps. Fits on a single RTX 4090 or Mac M2 Max at Q4_K_M (~19 GB). Multimodal: reads images, diagrams, and audio alongside code. For serious agentic workflows on a single consumer GPU, this is the new benchmark.

#3 open model on Arena AI (April 2026). AIME 2026: 89.2% (Gemma 3 was 20.8%).

~19 GB256K ctxAgentic coding + reasoning
★ Top PickApache 2.0Ollama ✓

Qwen3.5-27B

Alibaba · 27B

SWE-bench Verified72.4%
LiveCodeBench80.7%

Best SWE-bench score that fits on a single consumer GPU. 72.4% SWE-bench, 80.7% LCB — outperforming its own 122B-A10B MoE sibling on coding tasks because dense beats MoE when full parameter engagement matters for complex multi-file reasoning. GatedDeltaNet hybrid architecture means 256K context at practical speed. Apache 2.0. Ollama support. The single-GPU pick of 2026.

Outperforms the 122B-A10B MoE sibling on coding — dense beats MoE here

~17 GB256K ctxAgentic coding — best single GPU
Apache 2.0

KAT-Dev-32B

Kwaipilot (Kuaishou AI) · 32B

SWE-bench Verified62.4%

KAT = Kwai-AutoThink. Built by Kuaishou AI (the ByteDance competitor). 62.4% SWE-bench at 32B — runs on a single RTX 3090 or 4090. Almost entirely missed by mainstream AI coverage. Beats Qwen2.5-Coder-32B on real-world agentic tasks through a 3-stage training pipeline: SFT, RL on coding tasks, RLHF alignment. Apache 2.0. GGUF available via community.

~20 GB128K ctxAgentic coding
Apache 2.0Ollama ✓

Qwen3-32B

Alibaba · 32B

HumanEval72.05%

The 32B version of Qwen3 — strong generalist with thinking mode, 128K context, Ollama support, Apache 2.0. For pure coding performance in this VRAM tier, Qwen3.5-27B (72.4% SWE-bench) is the better pick. Qwen3-32B is the choice when you want a generalist that handles code, analysis, and reasoning equally well within the same VRAM budget.

EvalPlus score; strong generalist with thinking mode

~20 GB128K ctxGeneral coding assistant
Apache 2.0

OLMo 3.1 32B Think

Allen AI · 32B

LiveCodeBench83.3%
HumanEval+91.5%

91.5% HumanEval+ and 83.3% LCB are legitimate frontier-level numbers. But the real story is the license: Apache 2.0 with fully open training data under the ODC-BY license (Dolma dataset). If your organization needs to audit the training data for IP or compliance reasons — not just model weights — this is the only competitive coding model that can satisfy that requirement.

~20 GB128K ctxFully open / compliance use
Research NC

EXAONE Deep 32B

LG AI Research · 32B

LiveCodeBench59.5%

LG AI Research's reasoning model. 59.5% LCB — competitive with DeepSeek-R1-Distill-Qwen-32B on competitive programming despite being a different architecture. Requires a `<thought>` tag prefix to activate reasoning mode. Non-commercial research license — not suitable for product deployment. GGUF available. Best use: hard algorithm problems, AIME/math olympiad style tasks, research environments.

~18 GB32K ctxAlgorithm reasoning
Llama 3.1

Hermes 4.3 36B

NousResearch · 36B

One remarkable property: 512K context on a single consumer GPU. No other single-GPU model comes close. Built on a ByteDance Seed base, post-trained using a decentralized compute network (Solana/Psyche). The Llama 3.1 license limits use to <700M MAU. If your use case requires holding enormous codebases, documentation corpora, or long conversation histories in context, this is the only option that fits on one card.

512K context — only single-GPU model in this class

~22 GB512K ctxMassive context window
★ Top PickApache 2.0Ollama ✓

Qwen2.5-Coder-32B

Alibaba · 32B

HumanEval+86.2%

The 2025 community gold standard for local coding agents. 92.7% HumanEval, 86.2% HumanEval+ — among the highest code generation scores of any model that fits on a single consumer GPU. Apache 2.0. Ollama support. 128K context. Purpose-built for software engineering, and the 32B hit the sweet spot: strong enough for real production code, small enough for an RTX 3090 or Mac M2 Max. Qwen3.5-27B has since taken the SWE-bench crown (72.4%), but for raw code generation, Qwen2.5-Coder-32B is still a top-tier pick.

2025 community gold standard for local coding agents

~20 GB128K ctxCode generation — community gold standard
MoEApache 2.0Ollama ✓

Qwen3.5-35B-A3B

Alibaba · 35B / 3B active (MoE)

SWE-bench Verified69.2%

The MoE alternative to Qwen3.5-27B in the 20 GB tier. 69.2% SWE-bench — 3 points below the dense 27B sibling (72.4%), but faster throughput because only 3B parameters activate per token. 256K context. Apache 2.0. The trade-off makes sense for agentic loops: when your workflow involves hundreds of model calls across a long session, that inference speed difference adds up more than the benchmark gap.

MoE — faster inference than the 27B dense sibling at similar VRAM

~22 GB256K ctxAgentic coding — speed-optimized
HybridNVIDIA Nemotron LicenseOllama ✓

Nemotron 3 Nano 30B-A3B

NVIDIA · 31.6B / 3.2B active (MoE)

LiveCodeBench68.3%
HumanEval78.05%

NVIDIA's first consumer-runnable open model. 78.05% HumanEval and 68.3% LCB from a hybrid Mamba-2 + Transformer architecture — the only model at this tier built on SSM technology. 1M token context is practical here because Mamba-2 processes sequences in linear time, not quadratic. Outperforms Qwen3-30B-A3B on both math (82.88% vs 61.14%) and code (78.05% vs 70.73%). Fits on a single RTX 3090/4090. NVIDIA's own Nemotron license — commercially permissive, but review the terms before shipping a product.

Hybrid Mamba-2 + Transformer — 1M context at 24 GB; outperforms Qwen3-30B-A3B on math and code

~24 GB1000K ctxLong-context coding on 24 GB GPU
40–48 GBDual RTX 3090, A6000, Mac Pro 192GB
MoEApache 2.0

Mistral Small 4

Mistral AI · 119B (6B active, MoE)

The name is misleading. "Small" refers to 6B active parameters per forward pass, not total model size. Total is 119B across 128 experts, with 4 active per token. At Q4_K_M quantization all 119B weights load into ~67 GB VRAM, which is above a standard workstation card but within range of Mac Pro 192GB unified memory or a single A100 80GB. The payoff: 256K context window, Apache 2.0 license, strong multilingual benchmark results. No Ollama support at launch. Load via llama.cpp or Hugging Face Transformers.

Strong multilingual benchmarks. "Small" refers to 6B active params per forward pass, not total model size.

~67 GB256K ctxHigh-context MoE on high-VRAM workstations
★ Top PickMoEQwen License

Qwen3-Coder-Next

Alibaba · 80B / 3B active (MoE)

SWE-bench Verified71.3%
LiveCodeBench70.6%

71.3% SWE-bench on an 80B MoE that activates only 3B parameters per token. Purpose-built for software engineering. The catch: requires ~45-49GB VRAM at Q4_K_M — not a single RTX 4090. You need a 48GB workstation card (A6000, RTX 6000), a dual-GPU setup, or Apple Silicon 64GB+. If you have the hardware, this is the best locally-runnable coding model in the world.

Needs 48GB+ card or dual GPU — NOT runnable on single RTX 4090

~45–49 GB256K ctxFrontier agentic coding
★ Top PickApache 2.0

KAT-Dev-72B-Exp

Kwaipilot (Kuaishou AI) · 72B

SWE-bench Verified74.6%

74.6% SWE-bench at release made this the highest-scoring open-weight coding model in the world for several weeks. Dense 72B — same VRAM tier as Llama 3.3 70B and Kimi-Dev-72B. Apache 2.0. Community GGUF available. Built by Kwaipilot, Kuaishou's developer tooling team. Still almost entirely unknown outside Chinese research circles. If you have dual 3090s and want top-tier SWE-bench performance with a clean license, this is worth serious consideration.

Was #1 SWE-bench at release (Jan 2026)

~40 GB128K ctxAgentic coding — Apache 2.0 frontier
Apache 2.0

Kimi-Dev-72B

Moonshot AI · 72B

SWE-bench Verified60.4%

Purpose-built via large-scale reinforcement learning on Docker test suites — the model literally practiced resolving GitHub issues in containerized environments. 60.4% SWE-bench. Apache 2.0. Best MIT/Apache-licensed 72B coding model when VRAM budget allows. Community GGUF available. Kimi's cloud API is easier, but the weights are yours if you run it locally.

~40 GB128K ctxAgentic coding
MITOllama ✓

DeepSeek-R1-Distill-Llama-70B

DeepSeek · 70B

SWE-bench Verified49.2%
LiveCodeBench57.5%

Reasoning-capable distill of the original DeepSeek-R1 at 70B. 57.5% LCB, 49.2% SWE-bench, MIT license, Ollama support. The main trade-off vs newer models: distilled from the original R1 (not R1-0528), so it's been surpassed on coding benchmarks. Still useful if you specifically want reasoning chains at 70B scale with a clean MIT license and want Ollama convenience.

~40 GB128K ctxReasoning at 70B
Llama 3.3Ollama ✓

Llama 3.3 70B

Meta · 70B

HumanEval88.4%

88.4% HumanEval, 128K context, Ollama support — the workhorse 70B. The Llama license has a >700M MAU cap. For pure agentic coding at 40GB VRAM, KAT-Dev-72B-Exp (74.6% SWE-bench) and Kimi-Dev-72B (60.4% SWE-bench) are stronger. Llama 3.3 is the choice when you want a capable generalist that handles code as one of many tasks rather than a specialized coding agent.

~40 GB128K ctxGeneral-purpose coding

All these models and more work in Bodega One

No config files. No YAML. Pick a model, connect a provider, start coding. One-time purchase.

Join the waitlist →

Best local LLMs by use case

SWE-bench tells you which models write code. These picks cover everything else: reasoning, research, writing, and math. All run locally, all work in Bodega One.

Reasoning

Chain-of-thought analysis, logical problem solving, and extended thinking for complex multi-step tasks.

  • ~5 GB

    DeepSeek-R1-0528-Qwen3-8B

    60.5% LCB at 5GB. Best reasoning per watt available.

  • ~10 GB

    Phi-4-Reasoning

    Distilled from o3-mini. Math and logic specialist from Microsoft.

  • ~5 GB

    OLMo 3.1 Think

    Fully open Apache 2.0 thinking model. No license restrictions.

Long-context research

Document analysis, knowledge synthesis, and multi-source research requiring large context windows.

  • ~24 GB

    Hermes 4.3 36B

    512K context window. Reads entire codebases or document sets.

  • ~18 GB

    Qwen3.5-27B

    Best dense model at this weight. Strong on long-context and reasoning.

  • Server

    Llama 3.3 70B

    128K context. Meta flagship, top open-weight instruction follower.

Writing & editing

Prose, documentation, structured output, and natural instruction following for content tasks.

  • ~5 GB

    Qwen3-8B

    Punches above its weight. Excellent at structured writing at 5GB.

  • Server

    Llama 3.3 70B

    Best open-weight instruction follower at any size class.

  • ~8 GB

    Mistral Nemo 12B

    Strong multilingual writing. Apache 2.0, runs on 8GB cards.

Math & science

Symbolic computation, step-by-step proofs, competition math, and STEM reasoning tasks.

  • ~10 GB

    Phi-4-Reasoning

    Purpose-built for mathematical reasoning. Top performer at 10GB.

  • ~5 GB

    DeepSeek-R1-0528-Qwen3-8B

    Extended thinking mode. Strong on competition-level math.

  • ~20 GB

    QwQ-32B

    72.9% MATH-500. Qwen reasoning model, math specialist.

Local AI that actually works

Every model on this page runs inside a full IDE with AI chat and an autonomous coding agent. Your data stays on your machine.

Join the waitlist →

What the benchmarks actually tell you

HumanEval is saturated

GLM-4.7-Flash scores 94.2% HumanEval on a 6GB laptop GPU. The benchmark is done. SWE-bench Verified and LiveCodeBench are the only meaningful signals for 2026.

Dense beats MoE on hard tasks

Qwen3.5-27B (dense, 27B params) outperforms Qwen3.5-122B-A10B (MoE, 10B active) on coding. When complex multi-file reasoning needs full parameter engagement, dense wins.

The 8B tier is now actually good

DeepSeek-R1-0528-Qwen3-8B scores 60.5% LCB at 5GB VRAM. That's what 32B models scored in 2024. Entry-level hardware is now competitive.

Devstral's 21-point jump

Devstral Small went from 46.8% to 68% SWE-bench between v1 and v2. The largest single-model improvement of the year. Best Apache 2.0 coding model on a single GPU.

Quantization matters sub-8B

Q4_K_M causes ~8-10% variance on coding tasks at 7B. Use Q6_K or Q8_0 for models under 8B. Q4 is fine at 14B and above.

Context floor: 32K minimum

8K context disqualifies a model for repo-level work. 32K is the minimum. 64K–128K is the sweet spot. Larger than 128K can hurt via 'lost in the middle' degradation.

Rankings updated monthly. Get notified when they change.

Beyond consumer hardware

These models require server infrastructure or multi-GPU setups. They set the ceiling for what open-weight models can achieve.

ModelOrgParamsSWE-benchMin VRAMLicense
MiniMax M2.5
MiniMax229B / 10B active80.2%~128 GB (3-bit)Commercial OK
GLM-5
Zhipu AI744B / 40B active77.8%~180 GB (2-bit)MIT
Kimi K2.5
Moonshot AI1T / 32B active76.8%~375 GB (2-bit)Modified MIT
Qwen3.5-397B-A17B
Alibaba397B / 17B active76.4%~220 GBApache 2.0
KAT-Dev-72B-ExpBorderline consumer
Kwaipilot72B74.6%~40 GB (dual GPU)Apache 2.0
DeepSeek V3.2
DeepSeek685B / 37B active74.1%ServerMIT
GLM-4.7 (full)
Zhipu AI355B / 9B active73.8%ServerMIT
MiMo-V2-Flash150 tok/s via MTP
Xiaomi309B / 15B active73.4%Multi-GPUApache 2.0
Devstral 2 Large
Mistral AI123B72.2%Multi-GPUApache 2.0
Qwen3.5-122B-A10BLCB 78.9% — 27B dense actually beats it on coding at 1/4 the VRAM
Alibaba122B / 10B active72%~70–81 GB (multi-GPU)Apache 2.0
Nemotron 3 Super 120B-A12BLCB 81.19% — Hybrid Mamba-2 MoE, 1M context, 7.5x faster than Qwen3.5-122B
NVIDIA120.6B / 12.7B active60.47%~87 GB Q4 (64 GB+ unified)NVIDIA Nemotron

Benchmark glossary

SWE-bench Verified
% of real GitHub issues resolved autonomously. A human validated each issue. The most practical benchmark: it tests actual software engineering, not toy problems. Frontier models top out around 80%.
LiveCodeBench (LCB)
Contamination-free competitive programming problems collected after the training cutoffs of the models being tested. Harder to game than HumanEval. Updated continuously.
HumanEval / HumanEval+
Code generation at function level. HumanEval is largely saturated. Multiple 6GB models score above 90%. Use LCB and SWE-bench for real discrimination. HumanEval+ has stricter tests than the original.
VRAM figures
All VRAM numbers are at Q4_K_M quantization unless noted. For models under 8B, use Q6_K or Q8_0. Q4 causes ~8-10% variance on coding tasks at that scale.

Running local models efficiently also depends on KV cache reuse and observation masking to cut token waste by 40-70%.

Run these models in a full IDE.

Bodega One supports every model on this page. One-time purchase. Your data never leaves.