Local LLM Guide

The best local LLMs for software engineering

Ranked by SWE-bench Verified, the benchmark that tests real GitHub issues, not toy problems. Opinionated picks for every VRAM tier. Not sure if local models are ready? Read our honest assessment or jump straight to the Ollama setup guide.

52 models evaluated9 research passesSWE-bench · LCB · HumanEval+Updated monthly

What should I run?

Pick your hardware. Get a model recommendation in seconds.

Platform

Hardware

VRAM

52 models

≤ 4 GBEdge, integrated GPU, older laptops

Model	Org	Params	Bench	VRAM	License
★StarCoder2-3B	BigCode	3B	-	~2 GB	BigCode RAIL-M
BigCode3BBigCode RAIL-M The de facto standard for inline code completion. Continue.dev recommends it as the default FIM model. At 2 GB VRAM, it runs on almost any GPU - including integrated graphics with enough shared memory. Use the base model, not instruct - the fine-tuning actually hurts FIM quality. Not for chat; only for autocomplete. † FIM fill-in-middle champion - not an instruct model 16K contextInline autocomplete (FIM)Ollama supported
Granite 4.0 TinyMoE	IBM	7B MoE / 1B active	82.41%HE	~4 GB	Apache 2.0
HumanEval82.41% IBM7B MoE / 1B activeApache 2.0 IBM's MoE entry in the sub-4GB tier. 82.41% HumanEval from a model that fits on a laptop GPU is genuinely impressive. The trade-off: 8K context is a hard limitation for anything beyond single-file edits. Apache 2.0 - cleanest possible license for enterprise use. IBM tracks data provenance per token, which matters for compliance teams. 8K contextCode generation
DeepSeek-R1-Distill-Qwen-1.5B	DeepSeek	1.5B	16.9%LCB	~1.2 GB	MIT
LiveCodeBench16.9% DeepSeek1.5BMIT The smallest model with chain-of-thought reasoning. Distilled from DeepSeek-R1. 16.9% LCB is modest but it reasons through problems instead of pattern-matching. MIT license. Useful for edge devices, CI environments, or when you genuinely have no VRAM budget. Don't expect production-grade code - expect a model that thinks before it answers. 128K contextReasoning on edge devicesOllama supported
SmolLM3-3B	HuggingFace	3B	30%LCB	~2 GB	Apache 2.0
LiveCodeBench30% HuggingFace3BApache 2.0 HuggingFace's flagship compact model. 30% LCB from a 3B that fits in 2 GB VRAM is a real milestone - it beats every previous 3B coding model and reaches the range of some 7B models from 2024. Apache 2.0. 64K context. Not yet in the official Ollama library, but GGUF files are on HuggingFace. The pick for edge deployments, CI environments, or devices where even 4 GB isn't available. Solid code generation within genuine hardware constraints. † Surpasses every previous 3B model on coding - near 7B 2024-class quality 64K contextEdge coding - top accuracy at 2 GB
Qwen3.5-4B	Alibaba	4B	55.8%LCB	~3 GB	Apache 2.0
LiveCodeBench55.8% Alibaba4BApache 2.0 The best coding model under 4 GB that doesn't get enough attention. 55.8% LCB at 4B parameters puts it ahead of some 7B models from 2024. 262K context window at 3 GB VRAM is unmatched in this tier - every other tiny model tops out at 64K or less. Multimodal: reads images, screenshots, and diagrams alongside code. Apache 2.0. Available on Ollama. If your hardware limit is a 4 GB GPU or you're deploying to memory-constrained devices, this is the pick. † 262K context and multimodal at 3 GB - ahead of some 7B models from 2024 262K contextEdge coding with long contextOllama supported
Phi-4-mini	Microsoft	3.8B	19.9%LCB	~2.3 GB	MIT
LiveCodeBench19.9% HumanEval+68.3% Microsoft3.8BMIT 74.4% HumanEval at 3.8B and 2.3 GB VRAM. MIT license. The caveat: LiveCodeBench is only 19.9% - it handles function-level code generation well but struggles with harder algorithmic problems. The reason to pick it over Qwen3.5-4B: brand diversity, MIT license, and solid HumanEval score. Best choice if you're in a Microsoft-oriented stack or need a non-Qwen option at the ultra-small tier. † Strong HumanEval but weak LCB - better at function-level tasks than hard algorithmic problems 128K contextEdge coding - Microsoft stackOllama supported
Bonsai 8B	PrismML	8B (1-bit)	-	~1.2 GB	Apache 2.0
PrismML8B (1-bit)Apache 2.0 1-bit quantization stores each weight in a single bit instead of 16 or 32. That gets 8B parameters down to 1.15 GB of RAM. Commercially licensed under Apache 2.0. First 1-bit model to hold up on real tasks: chat, summarization, tool calling. No Ollama support yet. Load via Hugging Face Transformers or PrismML's custom llama.cpp fork at github.com/PrismML-Eng/llama.cpp. Good fit for embedded hardware, edge deployments, or anything where every GB matters. † No standard benchmark results published yet. Community reports competitive with Qwen3.5-8B on general tasks. 4K contextExtreme RAM-constrained devices
Bonsai 1.7B	PrismML	1.7B (1-bit)	-	~0.24 GB	Apache 2.0
PrismML1.7B (1-bit)Apache 2.0 Runs in under 256 MB of RAM using 1-bit quantization. Apache 2.0 license with no usage caps. Confirmed running on Raspberry Pi 5 at useful speeds. The use case is narrow but real: anywhere you need an LLM and have less than 512 MB to spare. Load via PrismML's llama.cpp fork or Hugging Face Transformers. No Ollama support yet. † No published benchmarks. Designed for Raspberry Pi 5 and sub-1 GB RAM hardware. 4K contextRaspberry Pi 5 / embedded AI
Gemma 4 E2B	Google DeepMind	~5.1B total / ~2B effective (PLE)	-	~3 GB	Apache 2.0
Google DeepMind~5.1B total / ~2B effective (PLE)Apache 2.0 Google's smallest Gemma 4 variant, designed to run on phones and Raspberry Pi-class hardware. "E2B" means ~2B effective compute footprint via Per-Layer Embeddings - it runs like a 2B model at ~3 GB VRAM while drawing on a larger parameter base. Full multimodal: text, image, video, and audio inputs. 256K context window at sub-4 GB VRAM is unique in the tiny tier. 140+ languages. Apache 2.0. Ollama support. The pick for edge deployments that need multimodal capability and Ollama convenience at the smallest possible VRAM footprint. † No separate coding benchmarks published. Google reports it outperforms Gemma 3 27B on competitive coding despite running at phone-class VRAM. 256K contextEdge / mobile / on-device AIOllama supported

5–8 GBRTX 3050/4050, MacBook M1/M2 base

Model	Org	Params	Bench	VRAM	License
★GLM-4.7-FlashMoE	Zhipu AI (Z.ai)	30B / 3B active (MoE)	59.2%SWE	~6 GB	MIT
SWE-bench Verified†59.2% LiveCodeBench84.9% HumanEval94.2% Zhipu AI (Z.ai)30B / 3B active (MoE)MIT The single most surprising model in this guide. 84.9% LCB and 94.2% HumanEval from a MoE that runs on a 6GB laptop GPU. For competitive programming and pure code generation, it outperforms most 32B dense models. SWE-bench is 59.2% (the 73.8% score belongs to its full 355B parent model - not Flash). MIT license. Available on Ollama. The sleeper pick of 2026. † SWE-bench for Flash distill (NOT the full 355B parent's 73.8%) 200K contextCompetitive programming & code genOllama supported
★DeepSeek-R1-0528-Qwen3-8B	DeepSeek	8B	60.5%LCB	~5 GB	MIT
LiveCodeBench60.5% DeepSeek8BMIT 60.5% LCB at 8B and 5GB VRAM. That's what 32B models scored in 2024. The jump from older R1-Distill-Qwen-7B (37.6% LCB) happened because this was distilled from a fundamentally stronger teacher - R1-0528, which itself jumped +9.8 LCB over the original R1. Best reasoning model at the laptop GPU tier. MIT license. † Distilled from the updated R1-0528 teacher - dramatically better than older R1 distills 128K contextReasoning + coding at 8B
Qwen3-30B-A3BMoE	Alibaba	30B / 3.3B active (MoE)	-	~6 GB	Apache 2.0
Alibaba30B / 3.3B active (MoE)Apache 2.0 30 billion parameters of knowledge accessible through a 3.3B-active-parameter window - all on 6GB VRAM. The 256K context window is extraordinary for this tier; every other laptop GPU model tops out at 128K or less. Apache 2.0. Best pick when you need to hold a large codebase in context on limited hardware. Speed is slower than a pure 8B dense model. † 256K context at laptop VRAM - major differentiator 256K contextLong-context coding at low VRAM
★Granite 3.3 8B Instruct	IBM	8B	89.73%HE	~5 GB	Apache 2.0
HumanEval89.73% IBM8BApache 2.0 A massive jump from IBM's old code-specific Granite models. 89.73% HumanEval at 8B and 5GB VRAM. Apache 2.0 with IBM's full data provenance tracking - every training token is documented. The cleanest enterprise license in this tier. If your legal team needs to audit the training data, this is the only sub-8GB model that can satisfy that requirement. 128K contextEnterprise code generation
Seed-Coder-8B Instruct	ByteDance Seed	8B	84.8%HE	~5 GB	MIT
HumanEval84.8% ByteDance Seed8BMIT Released June 2025 with 6 trillion tokens of training data - more code-specific pretraining than almost any model at this size. 84.8% HumanEval is state-of-the-art for an 8B model. MIT license. Essentially unknown outside research circles despite being one of the strongest 8B code models available. Requires GGUF download from HuggingFace - not in Ollama library. 64K contextCode generation
Yi-Coder-9B	01.ai	9B	85.4%HE	~6 GB	Apache 2.0
HumanEval85.4% 01.ai9BApache 2.0 85.4% HumanEval at 9B, 128K context, Apache 2.0, and full Ollama support. Supports 52 programming languages - broader than most models in this tier. The best Apache 2.0 coding model from a non-Alibaba Chinese lab at the laptop GPU size. Clean license, strong benchmark, straightforward to run. 128K contextMultilingual code generationOllama supported
MiniCPM4-8BHybrid	Tsinghua / OpenBMB	8B	-	~5 GB	Apache 2.0
Tsinghua / OpenBMB8BApache 2.0 Built for on-device and battery-constrained environments. Uses InfLLM v2 infinite inference architecture for 7x faster inference than comparable models on end-device chips. Quality matches Qwen3-8B. Apache 2.0. Best pick if you're deploying to user machines or a constrained environment where inference speed and power draw matter as much as benchmark scores. † Matches Qwen3-8B quality with 7x faster inference on device chips 32K contextEdge / on-device deploymentOllama supported
InternLM3-8B	Shanghai AI Lab	8B	17.8%LCB	~5 GB	Apache 2.0
LiveCodeBench17.8% HumanEval82.3% Shanghai AI Lab8BApache 2.0 Trained on only 4 trillion tokens - extraordinarily data-efficient for its benchmark scores. Supports a deep thinking mode for harder problems. 82.3% HumanEval beats Llama 3.1 8B and Qwen2.5-7B at the same size. Apache 2.0. LCB is modest at 17.8% - stronger on function-level tasks than algorithmic competition problems. Community GGUF available. 128K contextCode generation with thinking mode
Llama 3.1 8B Instruct	Meta	8B	72.6%HE	~5 GB	Llama 3.1
HumanEval72.6% Meta8BLlama 3.1 The baseline model the community measures everything else against. 72.6% HumanEval - not the highest in this tier, but Llama 3.1 8B has 108M+ Ollama pulls. Every local IDE integration (Continue.dev, Aider, etc.) has explicit docs for it. Tutorials, community support, and known behavior are unmatched at this size. The Llama license has a >700M MAU cap - check it before building a commercial product. † 108M+ Ollama pulls - the ecosystem standard everyone compares against 128K contextGeneral-purpose baselineOllama supported
Qwen2.5-Coder-7B	Alibaba	7B	84.1%HE+	~5 GB	Apache 2.0
HumanEval+84.1% Alibaba7BApache 2.0 88.4% HumanEval at 7B and 5 GB VRAM. The gap between this and the 14B is minimal on single-file tasks (88.4% vs 89%). Apache 2.0. Ollama support. Strong FIM support for inline autocomplete. If you want a serious coding model on a MacBook M1 or an RTX 3050, this is the most efficient pick in the Qwen2.5-Coder family. † Near-identical to the 14B on single-file tasks - 2x more VRAM-efficient 128K contextCode generation at laptop VRAMOllama supported
Qwen3-8B	Alibaba	8B	-	~5 GB	Apache 2.0
Alibaba8BApache 2.0 Qwen3's 8B dense member - a generalist with built-in thinking mode. Handles code, analysis, math, and multi-step reasoning within 5 GB VRAM. Improved instruction following over Qwen2.5-7B, which shows in complex prompts and continued conversations. Apache 2.0. Ollama support. The practical default if you want a thinking-capable model at 5 GB VRAM without chasing specialized coding benchmarks. † Thinking mode included - better instruction following than Qwen2.5-7B across multi-step tasks 128K contextGeneral-purpose coding assistantOllama supported
Qwen3.5-9B	Alibaba	9B	65.6%LCB	~6.6 GB	Apache 2.0
LiveCodeBench65.6% Alibaba9BApache 2.0 65.6% LCB at 9B is genuinely impressive - most 14B models from 2025 don't score that high. 262K context window extensible to 1M via YaRN. Multimodal: reads images, screenshots, and diagrams alongside code. Gated DeltaNet hybrid architecture (3:1 linear-to-softmax attention ratio) makes the long context practical without the VRAM penalty. Apache 2.0. Ollama support. If you're choosing between Qwen3-8B and Qwen3.5-9B on similar hardware - pick the 9B. It's stronger on real coding tasks at essentially the same VRAM. † Outperforms Qwen3-30B (3x its size) on reasoning - 262K context, multimodal 262K contextGeneral-purpose coding assistantOllama supported
Gemma 4 E4B	Google DeepMind	~11B total / 4B effective	52%LCB	~5 GB	Apache 2.0
LiveCodeBench52% Google DeepMind~11B total / 4B effectiveApache 2.0 Google's entry into the laptop-GPU tier, and the default Gemma 4 Ollama pull. E4B means 4B effective parameters -- the model actually has ~11B total parameters but uses per-layer embeddings and alternating attention to run inference like a 4B while accessing knowledge from a much larger base. 52% LCB at 5 GB VRAM is competitive with other models in this tier. Multimodal: reads images, diagrams, and screenshots alongside code. 128K context. Apache 2.0 -- cleaner than the old Gemma license, which required a separate Google terms review. If you already run Gemma 3, this is a drop-in upgrade. † E = Effective params. Per-layer embeddings + alternating attention - runs like 4B, draws on 11B knowledge. 128K contextMultimodal coding at laptop VRAMOllama supported

8–12 GBRTX 3060 / 4060, MacBook M2 Pro

Model	Org	Params	Bench	VRAM	License
★Phi-4-Reasoning	Microsoft	14B	53.8%LCB	~9 GB	MIT
LiveCodeBench53.8% HumanEval+92.9% Microsoft14BMIT Microsoft's hidden gem. 92.9% HumanEval+ - the same score as o1-mini - at 14B and 9GB VRAM. 53.8% LCB is strong for the size. The key differentiator is chain-of-thought: it reasons through problems before answering, catching logical errors that direct-generation models miss. MIT license. Trained on high-density synthetic reasoning data. Not on Ollama but GGUF available. † HumanEval+ ties o1-mini 32K contextReasoning-heavy coding
DeepSeek-R1-Distill-Qwen-14B	DeepSeek	14B	53.1%LCB	~9 GB	MIT
LiveCodeBench53.1% DeepSeek14BMIT Distilled from the original DeepSeek-R1 (not the stronger R1-0528 update). 53.1% LCB at 14B, MIT license, available on Ollama - the most accessible reasoning model at this VRAM tier. Strong for algorithm problems and math-heavy code. Note: if you can tolerate a GGUF download, the 8B R1-0528 distill actually scores higher on LCB at lower VRAM. 128K contextReasoning + codingOllama supported
Qwen3-14B	Alibaba	14B	-	~9 GB	Apache 2.0
Alibaba14BApache 2.0 The balanced choice at the 9GB tier. Strong general coding ability, thinking mode for harder problems, 128K context, Ollama support, and Apache 2.0 license. Doesn't top any single benchmark but performs reliably across code generation, explanation, refactoring, and debugging. The default recommendation when someone asks "what 14B model should I run?" † 128K context, thinking mode, strong generalist coding 128K contextGeneral coding assistantOllama supported
Gemma 3 12B	Google DeepMind	12B	85.4%HE	~8 GB	Gemma Terms
HumanEval85.4% Google DeepMind12BGemma Terms The strongest model in the 8–12 GB tier with multimodal capability - it reads screenshots, UI mockups, diagrams, or error images alongside code. 85.4% HumanEval, 128K context, Ollama support. License is Google's Gemma Terms: commercial use is allowed for most applications, but it requires reviewing Google's terms - not as clean as Apache 2.0 or MIT. Best pick if your workflow involves visual inputs. † Multimodal - reads images and diagrams alongside code 128K contextMultimodal code generationOllama supported
Gemma 4 12B	Google DeepMind	12B	-	~8 GB	Gemma Terms
Google DeepMind12BGemma Terms The Gemma 4 family's mid-size model, and the first mid-size Gemma to take native audio input alongside text and images. It jumps to a 256K context, up from Gemma 3 12B's 128K, and still runs in about 8 GB at Q4, so it fits a single laptop GPU. License is Google's Gemma Terms: commercial use is allowed for most apps but you have to accept Google's terms, so it is not as clean as Apache 2.0 or MIT. Best pick when your inputs include audio or images, not when you want a pure coding specialist. † Multimodal with native audio input; 256K context. Generalist, not a coding specialist 256K contextMultimodal generalist (text, image, audio)Ollama supported
★Qwen2.5-Coder-14B	Alibaba	14B	89.1%HE	~10 GB	Apache 2.0
HumanEval89.1% Alibaba14BApache 2.0 The top code model in the 8–12 GB VRAM range. 89.1% HumanEval, 128K context, Apache 2.0, Ollama support, FIM for autocomplete. The sweet spot of the Qwen2.5-Coder family: the 14B hits 89%+ on HumanEval while the 32B adds roughly 4 more points at 2x the VRAM. If you have mid-tier hardware and want the best pure coding performance available, this is it. † Best HumanEval score in the 8–12 GB tier 128K contextCode generation - best in tierOllama supported

12–16 GBRTX 3080 / 4070, MacBook M2 Pro 16GB

Model	Org	Params	Bench	VRAM	License
★Devstral Small 2	Mistral AI	24B	68%SWE	~16 GB	Apache 2.0
SWE-bench Verified68% Mistral AI24BApache 2.0 The best Apache 2.0 model that runs on a single consumer GPU. 68% SWE-bench - up from 46.8% at v1. That's a 21-point jump in one release, the largest single-model improvement of 2025. 256K context. Available on Ollama. Purpose-built for agentic workflows: fixing real GitHub issues, multi-file edits, running tests. If you need commercial-clean agentic coding on one card, this is the pick. 256K contextAgentic coding (Apache 2.0)Ollama supported
Codestral 22B	Mistral AI	22B	86.6%HE	~14 GB	Mistral CL
HumanEval86.6% Mistral AI22BMistral CL 86.6% HumanEval and excellent fill-in-middle performance for both chat and autocomplete use cases. 256K context on a 14GB GPU. The catch: Mistral's Commercial License means you need a Mistral agreement for production deployment - it's free for personal/research use only. If you're building a product, factor that in. For personal coding work, it's an outstanding model at this tier. 256K contextFIM completion + code genOllama supported
Mistral Small 3.2	Mistral AI	24B	92.9%HE	~15 GB	Apache 2.0
HumanEval92.9% Mistral AI24BApache 2.0 Strong generalist model at 24B. 92.9% HumanEval is Pass@5 (five attempts, best selected) - not Pass@1. Individual generation quality is lower than that number implies. Still a capable coding model with 128K context and Apache 2.0. Good choice if you want a well-rounded assistant rather than a specialized coding model. Ollama support makes setup trivial. † HumanEval score is Pass@5, not Pass@1 - individual generation quality lower 128K contextGeneral coding assistantOllama supported
GPT-OSS-20BMoE	OpenAI	20B / 3.6B active (MoE)	-	~14 GB	Apache 2.0
OpenAI20B / 3.6B active (MoE)Apache 2.0 The headline: this is OpenAI's first open-weight release since GPT-2. Apache 2.0 - the most permissive license OpenAI has ever shipped. 20B total, 3.6B active (MoE), runs within 16 GB VRAM, available on Ollama with 7.7M+ pulls. Community reception was mixed - independent testing found it performs solidly but doesn't stand out against Qwen3.5 or DeepSeek at similar parameter counts. The reason to run it: drop-in API compatibility with OpenAI's format and the widest ecosystem support of any open-weight model. † OpenAI's first open-weight model since GPT-2 (2019). Apache 2.0. 128K contextOpenAI-compatible local model
Gemma 4 26BMoE	Google DeepMind	26B / 4B active (MoE)	70%LCB	~14 GB	Apache 2.0
LiveCodeBench70% Google DeepMind26B / 4B active (MoE)Apache 2.0 The practical Gemma 4 for the 16 GB tier. MoE architecture: 26B total parameters, 4B active per inference step. That ratio keeps VRAM at ~14 GB while giving the model access to 26B worth of knowledge in routing decisions. 256K context. 70% LCB puts it in range of the best models in this tier. Ranks #6 among all open models on Arena AI. Apache 2.0 -- no usage restrictions to review. Multimodal: text, image, and audio inputs. Ollama support with a direct pull command. † #6 open model on Arena AI (April 2026). MoE: 26B total, 4B active per token. 256K contextAgentic coding - Apache 2.0Ollama supported

16–24 GBRTX 3090 / 4090, Mac M2 Max 32GB

Model	Org	Params	Bench	VRAM	License
★Gemma 4 31B	Google DeepMind	31B	80%LCB	~19 GB	Apache 2.0
LiveCodeBench80% Google DeepMind31BApache 2.0 The top-of-stack Gemma 4 and the most significant open-weight release of early 2026. 80% LCB and 89.2% AIME -- the AIME jump from Gemma 3's 20.8% is the largest single-generation reasoning improvement in open-source model history. Ranks #3 among all open models on the Arena AI text leaderboard. 256K context. Apache 2.0 -- commercially clean with no usage caps. Fits on a single RTX 4090 or Mac M2 Max at Q4_K_M (~19 GB). Multimodal: reads images, diagrams, and audio alongside code. For serious agentic workflows on a single consumer GPU, this is the new benchmark. † #3 open model on Arena AI (April 2026). AIME 2026: 89.2% (Gemma 3 was 20.8%). 256K contextAgentic coding + reasoningOllama supported
Qwen3.5-27B	Alibaba	27B	72.4%SWE	~17 GB	Apache 2.0
SWE-bench Verified72.4% LiveCodeBench80.7% Alibaba27BApache 2.0 Strong 72.4% SWE-bench, 80.7% LCB on a single consumer GPU, outperforming its own 122B-A10B MoE sibling because dense beats MoE when full parameter engagement matters for complex multi-file reasoning. GatedDeltaNet hybrid architecture means 256K context at practical speed. Apache 2.0. Ollama support. Superseded by Qwen3.6-27B (77.2% SWE-bench) as the single-GPU pick, but still a capable fallback. † Outperforms the 122B-A10B MoE sibling on coding - dense beats MoE here 256K contextAgentic coding - solid prior-gen pickOllama supported
★Qwen3.6-27BHybrid	Alibaba	27B (dense)	77.2%SWE	~17 GB	Apache 2.0
SWE-bench Verified†77.2% LiveCodeBench83.9% Alibaba27B (dense)Apache 2.0 77.2% SWE-bench Verified is the highest score of any locally-runnable model as of May 2026, beating the prior leader Qwen3.6-35B-A3B (73.4%) and matching the closed-source Qwen3.5-397B-A17B MoE despite having 14x fewer total parameters. Dense 27B architecture with hybrid Gated DeltaNet attention. Apache 2.0. Ollama-ready. 262K native context, extensible past 1M via YaRN. 83.9% LiveCodeBench v6 and 94.1% AIME26 round out frontier-level math and competitive programming. If you have an RTX 4090 or a Mac with 24+ GB unified memory, this is the agentic coding model to run right now. † 77.2% SWE-bench Verified. 83.9% LiveCodeBench v6. 94.1% AIME26. Highest score of any locally-runnable model as of May 2026. 262K contextAgentic coding: top single-GPU SWE-benchOllama supported
Qwen3.6-35B-A3BMoE	Alibaba	35B / 3B active (MoE)	73.4%SWE	~20 GB	Apache 2.0
SWE-bench Verified†73.4% Alibaba35B / 3B active (MoE)Apache 2.0 Sparse MoE that runs on the compute budget of a 3B model while drawing on the learned capacity of a 35B. At Q4_K_M (~20 GB) it fits a single RTX 3090 or 4090. 73.4% SWE-bench Verified at release made it the open-weight leader for six days, before Qwen3.6-27B (dense, 77.2%) overtook it on the same tier. Native multimodal (text, image, video), 262K native context (1M via YaRN), trained with Multi-Token Prediction and thinking preservation. Apache 2.0. Choose this if you want the MoE compute economics; choose Qwen3.6-27B if you want the higher score on the same GPU. † 73.4% SWE-bench Verified, 51.5% Terminal-Bench 2.0, 92.6% AIME26. Released six days before the dense Qwen3.6-27B, which now leads SWE-bench at this tier. 262K contextAgentic coding: compute-efficient MoE, fits 24 GBOllama supported
KAT-Dev-32B	Kwaipilot (Kuaishou AI)	32B	62.4%SWE	~20 GB	Apache 2.0
SWE-bench Verified62.4% Kwaipilot (Kuaishou AI)32BApache 2.0 KAT = Kwai-AutoThink. Built by Kuaishou AI (the ByteDance competitor). 62.4% SWE-bench at 32B - runs on a single RTX 3090 or 4090. Almost entirely missed by mainstream AI coverage. Beats Qwen2.5-Coder-32B on real-world agentic tasks through a 3-stage training pipeline: SFT, RL on coding tasks, RLHF alignment. Apache 2.0. GGUF available via community. 128K contextAgentic coding
Qwen3-32B	Alibaba	32B	72.05%HE	~20 GB	Apache 2.0
HumanEval72.05% Alibaba32BApache 2.0 The 32B version of Qwen3 - strong generalist with thinking mode, 128K context, Ollama support, Apache 2.0. For pure coding performance in this VRAM tier, Qwen3.6-27B (77.2% SWE-bench) is the better pick. Qwen3-32B is the choice when you want a generalist that handles code, analysis, and reasoning equally well within the same VRAM budget. † EvalPlus score; strong generalist with thinking mode 128K contextGeneral coding assistantOllama supported
OLMo 3.1 32B Think	Allen AI	32B	83.3%LCB	~20 GB	Apache 2.0
LiveCodeBench83.3% HumanEval+91.5% Allen AI32BApache 2.0 91.5% HumanEval+ and 83.3% LCB are legitimate frontier-level numbers. But the real story is the license: Apache 2.0 with fully open training data under the ODC-BY license (Dolma dataset). If your organization needs to audit the training data for IP or compliance reasons - not just model weights - this is the only competitive coding model that can satisfy that requirement. 128K contextFully open / compliance use
EXAONE Deep 32B	LG AI Research	32B	59.5%LCB	~18 GB	Research NC
LiveCodeBench59.5% LG AI Research32BResearch NC LG AI Research's reasoning model. 59.5% LCB - competitive with DeepSeek-R1-Distill-Qwen-32B on competitive programming despite being a different architecture. Requires a `<thought>` tag prefix to activate reasoning mode. Non-commercial research license - not suitable for product deployment. GGUF available. Best use: hard algorithm problems, AIME/math olympiad style tasks, research environments. 32K contextAlgorithm reasoning
Hermes 4.3 36B	NousResearch	36B	-	~22 GB	Llama 3.1
NousResearch36BLlama 3.1 One remarkable property: 512K context on a single consumer GPU. No other single-GPU model comes close. Built on a ByteDance Seed base, post-trained using a decentralized compute network (Solana/Psyche). The Llama 3.1 license limits use to <700M MAU. If your use case requires holding enormous codebases, documentation corpora, or long conversation histories in context, this is the only option that fits on one card. † 512K context - only single-GPU model in this class 512K contextMassive context window
★Qwen2.5-Coder-32B	Alibaba	32B	86.2%HE+	~20 GB	Apache 2.0
HumanEval+86.2% Alibaba32BApache 2.0 The 2025 community gold standard for local coding agents. 92.7% HumanEval, 86.2% HumanEval+ - among the highest code generation scores of any model that fits on a single consumer GPU. Apache 2.0. Ollama support. 128K context. Purpose-built for software engineering, and the 32B hit the sweet spot: strong enough for real production code, small enough for an RTX 3090 or Mac M2 Max. Qwen3.5-27B has since taken the SWE-bench crown (72.4%), but for raw code generation, Qwen2.5-Coder-32B is still a top-tier pick. † 2025 community gold standard for local coding agents 128K contextCode generation - community gold standardOllama supported
Qwen3.5-35B-A3BMoE	Alibaba	35B / 3B active (MoE)	69.2%SWE	~22 GB	Apache 2.0
SWE-bench Verified69.2% Alibaba35B / 3B active (MoE)Apache 2.0 The MoE alternative to Qwen3.5-27B in the 20 GB tier. 69.2% SWE-bench - 3 points below the dense 27B sibling (72.4%), but faster throughput because only 3B parameters activate per token. 256K context. Apache 2.0. The trade-off makes sense for agentic loops: when your workflow involves hundreds of model calls across a long session, that inference speed difference adds up more than the benchmark gap. † MoE - faster inference than the 27B dense sibling at similar VRAM 256K contextAgentic coding - speed-optimizedOllama supported
Nemotron 3 Nano 30B-A3BHybrid	NVIDIA	31.6B / 3.2B active (MoE)	68.3%LCB	~24 GB	NVIDIA Nemotron License
LiveCodeBench68.3% HumanEval78.05% NVIDIA31.6B / 3.2B active (MoE)NVIDIA Nemotron License NVIDIA's first consumer-runnable open model. 78.05% HumanEval and 68.3% LCB from a hybrid Mamba-2 + Transformer architecture - the only model at this tier built on SSM technology. 1M token context is practical here because Mamba-2 processes sequences in linear time, not quadratic. Outperforms Qwen3-30B-A3B on both math (82.88% vs 61.14%) and code (78.05% vs 70.73%). Fits on a single RTX 3090/4090. NVIDIA's own Nemotron license - commercially permissive, but review the terms before shipping a product. † Hybrid Mamba-2 + Transformer - 1M context at 24 GB; outperforms Qwen3-30B-A3B on math and code 1000K contextLong-context coding on 24 GB GPUOllama supported

40–48 GBDual RTX 3090, A6000, Mac Pro 192GB

Model	Org	Params	Bench	VRAM	License
Mistral Small 4MoE	Mistral AI	119B (6B active, MoE)	-	~67 GB	Apache 2.0
Mistral AI119B (6B active, MoE)Apache 2.0 The name is misleading. "Small" refers to 6B active parameters per forward pass, not total model size. Total is 119B across 128 experts, with 4 active per token. At Q4_K_M quantization all 119B weights load into ~67 GB VRAM, which is above a standard workstation card but within range of Mac Pro 192GB unified memory or a single A100 80GB. The payoff: 256K context window, Apache 2.0 license, strong multilingual benchmark results. No Ollama support at launch. Load via llama.cpp or Hugging Face Transformers. † Strong multilingual benchmarks. "Small" refers to 6B active params per forward pass, not total model size. 256K contextHigh-context MoE on high-VRAM workstations
★Qwen3-Coder-NextMoE	Alibaba	80B / 3B active (MoE)	71.3%SWE	~45–49 GB	Qwen License
SWE-bench Verified71.3% LiveCodeBench70.6% Alibaba80B / 3B active (MoE)Qwen License 71.3% SWE-bench on an 80B MoE that activates only 3B parameters per token. Purpose-built for software engineering. The catch: requires ~45-49GB VRAM at Q4_K_M - not a single RTX 4090. You need a 48GB workstation card (A6000, RTX 6000), a dual-GPU setup, or Apple Silicon 64GB+. If you have the hardware, this is the best locally-runnable coding model in the world. † Needs 48GB+ card or dual GPU - NOT runnable on single RTX 4090 256K contextFrontier agentic coding
★KAT-Dev-72B-Exp	Kwaipilot (Kuaishou AI)	72B	74.6%SWE	~40 GB	Apache 2.0
SWE-bench Verified†74.6% Kwaipilot (Kuaishou AI)72BApache 2.0 74.6% SWE-bench at release made this the highest-scoring open-weight coding model in the world for several weeks. Dense 72B - same VRAM tier as Llama 3.3 70B and Kimi-Dev-72B. Apache 2.0. Community GGUF available. Built by Kwaipilot, Kuaishou's developer tooling team. Still almost entirely unknown outside Chinese research circles. If you have dual 3090s and want top-tier SWE-bench performance with a clean license, this is worth serious consideration. † Was #1 SWE-bench at release (Jan 2026) 128K contextAgentic coding - Apache 2.0 frontier
Kimi-Dev-72B	Moonshot AI	72B	60.4%SWE	~40 GB	Apache 2.0
SWE-bench Verified60.4% Moonshot AI72BApache 2.0 Purpose-built via large-scale reinforcement learning on Docker test suites - the model literally practiced resolving GitHub issues in containerized environments. 60.4% SWE-bench. Apache 2.0. Best MIT/Apache-licensed 72B coding model when VRAM budget allows. Community GGUF available. Kimi's cloud API is easier, but the weights are yours if you run it locally. 128K contextAgentic coding
DeepSeek-R1-Distill-Llama-70B	DeepSeek	70B	49.2%SWE	~40 GB	MIT
SWE-bench Verified49.2% LiveCodeBench57.5% DeepSeek70BMIT Reasoning-capable distill of the original DeepSeek-R1 at 70B. 57.5% LCB, 49.2% SWE-bench, MIT license, Ollama support. The main trade-off vs newer models: distilled from the original R1 (not R1-0528), so it's been surpassed on coding benchmarks. Still useful if you specifically want reasoning chains at 70B scale with a clean MIT license and want Ollama convenience. 128K contextReasoning at 70BOllama supported
Llama 4 ScoutMoE	Meta	109B / 17B active (MoE, 16 experts)	47.3%SWE	~67 GB	Llama 4 License
SWE-bench Verified47.3% Meta109B / 17B active (MoE, 16 experts)Llama 4 License The largest context window of any open-weight model by a wide margin: 10 million tokens. Load an entire monorepo, months of conversation history, or a full documentation corpus in one shot. First MoE architecture in the Llama family. Natively multimodal: text and image inputs. Ollama support. VRAM honesty: Q4_K_M requires ~67 GB - you need a 48 GB workstation card, dual 3090s, or a high-memory Apple Silicon Mac. Aggressive 1.78-bit quants claim to fit in 24 GB but quality degrades significantly - not something to recommend for production use. SWE-bench is 47.3%, which is modest for its size. The use case is context length and multimodal reasoning, not raw agentic coding. Llama 4 license: commercial use permitted under 700M MAU. † VRAM reality check: Q4_K_M requires ~67 GB - needs an A6000, dual 3090s, or high-memory Mac. 1.78-bit quants fit in 24 GB but quality degrades significantly. 10000K context10M context - full monorepo in one shotOllama supported
Llama 3.3 70B	Meta	70B	88.4%HE	~40 GB	Llama 3.3
HumanEval88.4% Meta70BLlama 3.3 88.4% HumanEval, 128K context, Ollama support - the workhorse 70B. The Llama license has a >700M MAU cap. For pure agentic coding at 40GB VRAM, KAT-Dev-72B-Exp (74.6% SWE-bench) and Kimi-Dev-72B (60.4% SWE-bench) are stronger. Llama 3.3 is the choice when you want a capable generalist that handles code as one of many tasks rather than a specialized coding agent. 128K contextGeneral-purpose codingOllama supported

All these models work in Bodega One Code

No config files. No YAML. Pick a model, connect a provider, start coding. One-time purchase.

Download free →

Best local LLMs by use case

SWE-bench tells you which models write code. These picks cover everything else: reasoning, research, writing, and math. All run locally, all work in Bodega One Code.

Reasoning

Chain-of-thought analysis, logical problem solving, and extended thinking for complex multi-step tasks.

~5 GB
DeepSeek-R1-0528-Qwen3-8B
60.5% LCB at 5GB. Best reasoning per watt available.
~10 GB
Phi-4-Reasoning
Distilled from o3-mini. Math and logic specialist from Microsoft.
~5 GB
OLMo 3.1 Think
Fully open Apache 2.0 thinking model. No license restrictions.

Long-context research

Document analysis, knowledge synthesis, and multi-source research requiring large context windows.

~24 GB
Hermes 4.3 36B
512K context window. Reads entire codebases or document sets.
~18 GB
Qwen3.5-27B
Best dense model at this weight. Strong on long-context and reasoning.
Server
Llama 3.3 70B
128K context. Meta flagship, top open-weight instruction follower.

Writing & editing

Prose, documentation, structured output, and natural instruction following for content tasks.

~5 GB
Qwen3-8B
Punches above its weight. Excellent at structured writing at 5GB.
Server
Llama 3.3 70B
Best open-weight instruction follower at any size class.
~8 GB
Mistral Nemo 12B
Strong multilingual writing. Apache 2.0, runs on 8GB cards.

Math & science

Symbolic computation, step-by-step proofs, competition math, and STEM reasoning tasks.

~10 GB
Phi-4-Reasoning
Purpose-built for mathematical reasoning. Top performer at 10GB.
~5 GB
DeepSeek-R1-0528-Qwen3-8B
Extended thinking mode. Strong on competition-level math.
~20 GB
QwQ-32B
72.9% MATH-500. Qwen reasoning model, math specialist.

Local AI that actually works

Every model on this page runs inside a full IDE with AI chat and an autonomous coding agent. Your data stays on your machine.

Download free →

What the benchmarks actually tell you

HumanEval is saturated

GLM-4.7-Flash scores 94.2% HumanEval on a 6GB laptop GPU. The benchmark is done. SWE-bench Verified and LiveCodeBench are the only meaningful signals for 2026.

Dense beats MoE on hard tasks

Qwen3.5-27B (dense, 27B params) outperforms Qwen3.5-122B-A10B (MoE, 10B active) on coding. When complex multi-file reasoning needs full parameter engagement, dense wins.

The 8B tier is now actually good

DeepSeek-R1-0528-Qwen3-8B scores 60.5% LCB at 5GB VRAM. That's what 32B models scored in 2024. Entry-level hardware is now competitive.

Devstral's 21-point jump

Devstral Small went from 46.8% to 68% SWE-bench between v1 and v2. The largest single-model improvement of the year. Best Apache 2.0 coding model on a single GPU.

Quantization matters sub-8B

Q4_K_M causes ~8-10% variance on coding tasks at 7B. Use Q6_K or Q8_0 for models under 8B. Q4 is fine at 14B and above.

Context floor: 32K minimum

8K context disqualifies a model for repo-level work. 32K is the minimum. 64K-128K is the sweet spot. Larger than 128K can hurt via 'lost in the middle' degradation.

Beyond consumer hardware

These models require server infrastructure or multi-GPU setups. They set the ceiling for what open-weight models can achieve.

Model	Org	Params	SWE-bench	Min VRAM	License
DeepSeek V4-ProDeepSeek1.6T / 49B active (MoE)MITApr 24 2026. 1M context, 384K max output. Hybrid attention (CSA + DSA + HCA) cuts per-token FLOPs ~73% and KV cache memory ~90% vs V3.2. FP8 and FP4+FP8 precisions. Supersedes V3.2 for production.	DeepSeek	1.6T / 49B active (MoE)	80.6%	Server	MIT
DeepSeek V4-FlashDeepSeek284B / 13B active (MoE)MITApr 24 2026. Smaller sibling to V4-Pro with the same hybrid-attention architecture, 1M context, MIT license. Benchmarks lighter than V4-Pro; positioned as the cost-efficient V4 variant. Open weights on Hugging Face.	DeepSeek	284B / 13B active (MoE)	0%	Server	MIT
MiniMax M3MiniMaxFrontier MoE (params TBD)Open weights (~10 days after Jun 1 announce)Jun 1 2026. 59.0% SWE-Bench Pro (the harder benchmark, surpasses GPT-5.5 and Gemini 3.1 Pro on the same test; trails Opus 4.8 at 69.2%). First open-weight model to combine frontier coding + 1M context + native multimodal. Novel MSA architecture cuts per-token compute at 1M to 1/20th of prior generation. API live at launch; open weights + technical report due ~10 days after announcement.	MiniMax	Frontier MoE (params TBD)	0%	Server	Open weights (~10 days after Jun 1 announce)
MiniMax M2.5MiniMax229B / 10B activeCommercial OK	MiniMax	229B / 10B active	80.2%	~128 GB (3-bit)	Commercial OK
GLM-5Zhipu AI744B / 40B activeMIT	Zhipu AI	744B / 40B active	77.8%	~180 GB (2-bit)	MIT
GLM-5.1Z.ai754B / 40B active (DSA MoE)MIT58.4% SWE-bench Pro (harder benchmark - #1 open-source at release). Novel GLM_MOE_DSA hybrid architecture. Trained on Huawei Ascend chips.	Z.ai	754B / 40B active (DSA MoE)	58.4%	~640 GB (8x H100)	MIT
Kimi K2.5Moonshot AI1T / 32B activeModified MIT	Moonshot AI	1T / 32B active	76.8%	~375 GB (2-bit)	Modified MIT
Kimi K2.6Moonshot AI1T / 32B active (MoE)Modified MITApr 21 2026. 300 sub-agent orchestration. 13-hour autonomous coding. 80.2% SWE-bench Verified is Moonshot self-reported (in-house framework, not canonical harness). MIT with >100M MAU / $20M rev branding clause.	Moonshot AI	1T / 32B active (MoE)	80.2%	~250 GB	Modified MIT
Qwen3.5-397B-A17BAlibaba397B / 17B activeApache 2.0	Alibaba	397B / 17B active	76.4%	~220 GB	Apache 2.0
KAT-Dev-72B-ExpKwaipilot72BApache 2.0Borderline consumer	Kwaipilot	72B	74.6%	~40 GB (dual GPU)	Apache 2.0
DeepSeek V3.2DeepSeek685B / 37B activeMIT	DeepSeek	685B / 37B active	74.1%	Server	MIT
GLM-4.7 (full)Zhipu AI355B / 9B activeMIT	Zhipu AI	355B / 9B active	73.8%	Server	MIT
MiMo-V2-FlashXiaomi309B / 15B activeApache 2.0150 tok/s via MTP	Xiaomi	309B / 15B active	73.4%	Multi-GPU	Apache 2.0
Devstral 2 LargeMistral AI123BApache 2.0	Mistral AI	123B	72.2%	Multi-GPU	Apache 2.0
Qwen3.5-122B-A10BAlibaba122B / 10B activeApache 2.0LCB 78.9% - 27B dense actually beats it on coding at 1/4 the VRAM	Alibaba	122B / 10B active	72%	~70–81 GB (multi-GPU)	Apache 2.0
Nemotron 3 Super 120B-A12BNVIDIA120.6B / 12.7B activeNVIDIA NemotronLCB 81.19% - Hybrid Mamba-2 MoE, 1M context, 7.5x faster than Qwen3.5-122B	NVIDIA	120.6B / 12.7B active	60.47%	~87 GB Q4 (64 GB+ unified)	NVIDIA Nemotron

Benchmark glossary

SWE-bench Verified: % of real GitHub issues resolved autonomously. A human validated each issue. The most practical benchmark: it tests actual software engineering, not toy problems. Frontier models top out around 80%.
LiveCodeBench (LCB): Contamination-free competitive programming problems collected after the training cutoffs of the models being tested. Harder to game than HumanEval. Updated continuously.
HumanEval / HumanEval+: Code generation at function level. HumanEval is largely saturated. Multiple 6GB models score above 90%. Use LCB and SWE-bench for real discrimination. HumanEval+ has stricter tests than the original.
VRAM figures: All VRAM numbers are at Q4_K_M quantization unless noted. For models under 8B, use Q6_K or Q8_0. Q4 causes ~8-10% variance on coding tasks at that scale.

Running local models efficiently also depends on KV cache reuse and observation masking to cut token waste by 40-70%.

Tools for local AI development

VRAM Calculator

Can your GPU run this model?

Cost Savings Calculator

How much you save vs subscriptions

GPU Guide for Local AI

Every VRAM tier mapped to real GPUs

Set up an offline AI IDE in 2026

Hardware, setup, and verification for a fully offline workflow

Run these models in a full IDE.

Bodega One Code supports every model on this page. Free for personal use. Your data never leaves.

Download Free →Read the docs