What GPU do I need for local AI coding in 2026?

6-8 GB VRAM is where local AI starts being useful for coding. An RTX 3060 12 GB or 4060 Ti handles Qwen3.5-9B well. For serious work, 12-16 GB is the sweet spot. 16-24 GB runs Qwen3.6-27B, the current gold standard for local coding at 77.2% on SWE-bench Verified.

Can I run local LLMs on a 4 GB GPU?

Yes, but it is entry-level. SmolLM3-3B Q4 fits in 2.5-4 GB and handles simple tasks and Q&A. For serious work with multi-file edits and debugging, you will hit limits fast. Treat sub-4 GB as proof of concept, not a production coding setup.

How much VRAM does a 7B model need?

A quantized 7B model at Q4 takes roughly 4-5 GB. Q4_K_M is the community sweet spot for quality and speed. Qwen3-8B Q4_K_M runs well in 6-8 GB of VRAM with room for context, making it the recommended starting point if you actually want to code with the model and not just chat.

Is Apple Silicon different for local AI?

Yes. Apple M-series chips use unified memory, so GPU and CPU share the same pool. A 16 GB MacBook Pro runs Qwen3.5-9B in MLX format comfortably. At 64 GB and up you can run Qwen3.6-27B MLX, matching the gold tier of discrete GPU setups. Bodega One Code detects Apple Silicon and recommends MLX models.

Which GPU for local AI? A developer's guide

Quick answer

For local AI coding in 2026: 6–8 GB VRAM (Qwen3-8B) is where it becomes actually useful, 12–16 GB is the sweet spot for serious work, 24 GB+ is gold-tier. RTX 3090 24 GB is the community's best-value pick. Use BYOLLM to switch between local and cloud when you need more power.

The question that comes up in every local AI thread: "Can my GPU run this?" Usually someone replies with a number, someone else disputes it, and the thread turns into a benchmark debate. Not super useful if you just want to know whether to bother.

Here's the practical answer. Not a spec sheet. This is a breakdown of what each VRAM tier can actually do, which models work well at each level, and what you'll notice in day-to-day coding use.

Why VRAM is the bottleneck

When you run a local model, the weights load into GPU memory. Not system RAM, not disk. VRAM. A quantized 7B model at Q4 takes roughly 4–5 GB. A 32B model at Q4 takes around 20 GB. If the model doesn't fit, it either falls back to CPU (slow) or fails to load.

System RAM matters for context window overflow and CPU fallback. But if you want GPU inference at useful speeds, VRAM is the number to watch.

Quantization (Q4, Q4_K_M, Q8, etc.) is how the community squeezes large models into smaller VRAM budgets. Q4_K_M is the sweet spot most people land on: good quality loss-to-size ratio, fast enough for real use.

The VRAM tier table

These are the tiers Bodega One Code uses internally to match hardware to model recommendations. If you're using a local provider like Ollama or LM Studio, this maps directly to what you should be pulling.

VRAM	Best agent model	Quality tier
Under 4 GB	SmolLM3-3B Q4	Low
6–8 GB	Qwen3-8B Q4_K_M	Good
8–12 GB	Qwen3-14B Q4_K_M	Strong
12–16 GB	GLM-4.7-Flash 30B-A3B Q4 (MoE)	Excellent
16–24 GB	Qwen3.6-27B Q4	Gold
24–48 GB	Llama 3.3 70B Q4	Frontier
Apple 16 GB	Qwen3-8B MLX Q4	Good
Apple 64 GB+	Qwen3.6-27B MLX	Gold

"Best agent model" here means the largest model that fits comfortably with room for a reasonable context window. You can often run a bigger model if you're careful about context length, but these are the settings that work reliably.

What each tier actually feels like

Under 4 GB: entry-level GPUs, integrated graphics

You can run something. SmolLM3-3B Q4 fits in around 2.5 GB. It handles simple tasks, short completions, quick Q&A. For serious coding work (multi-file edits, debugging complex logic, long context), it hits a wall fast. Treat this tier as proof-of-concept, not production workflow.

6–8 GB: RTX 3060, RTX 4060, RX 6600

This is where local AI becomes actually useful. Qwen3-8B Q4_K_M fits here and it punches above its size on code tasks. You'll feel the context limit on larger projects, but for focused single-file or small multi-file work it's solid. The RTX 3060 at 12 GB is actually a sweet-spot card, bumping you up to the next tier.

8–12 GB: RTX 3060 12GB, RTX 4060 Ti, Arc A770

Qwen3-14B Q4_K_M lands here. A real jump in reasoning quality. This tier handles most day-to-day coding tasks well: refactors, new component generation, debugging with reasonable context. If you're buying a single card for local AI work and have a $300–400 budget, aim here.

12–16 GB: RTX 3080, RTX 4070, RTX 4080

GLM-4.7-Flash 30B-A3B Q4 is a mixture-of-experts model. It activates only 3B parameters per token despite having 30B total, so it fits in 12 GB but reasons like a much larger dense model. This tier works extremely well for agentic coding work. Most people who use Bodega One Code as their primary dev tool are in this range.

16–24 GB: RTX 3090, RTX 4090, RTX 4080 Super

Qwen3.6-27B at Q4 is the gold standard for local coding models as of 2026. At 77.2% SWE-bench Verified, it's the highest-scoring locally-runnable coding model and matches or exceeds GPT-4 class performance on many benchmarks. At this tier, the main limit is context window, not model quality. The RTX 3090 at 24 GB hits this tier and is the current community favorite for pure local AI work per dollar.

24–48 GB: RTX 3090 Ti, A5000, multi-card setups

Llama 3.3 70B Q4 fits here. Frontier-class quality. Unless you're doing something that specifically needs a 70B model (complex multi-step reasoning, research tasks, long context document analysis), most developers won't notice the difference over 32B for typical coding work. The 3090 24 GB is better value for coding unless you have specific needs.

Apple Silicon is different

Apple M-series chips use unified memory, so the GPU and CPU share the same pool. That means a MacBook Pro with 16 GB RAM has 16 GB available for model weights, compared to a 16 GB discrete GPU. The effective headroom is similar but the architecture is different: Apple uses MLX-optimized model formats, not GGUF.

At 16 GB unified: Qwen3-8B MLX Q4 runs well and is the recommended starting point. At 32 GB: Qwen3-14B or Qwen2.5-Coder-7B MLX. At 64 GB+: Qwen2.5-Coder-32B MLX, the same gold tier as a high-end discrete GPU setup.

The M4 Max at 128 GB can run 70B class models comfortably. Expensive, but it's the only laptop setup that gets you frontier-quality local inference.

Bodega One Code detects Apple Silicon automatically and recommends MLX provider format. If you're on a Mac, set your provider to Ollama with an MLX-tagged model variant for best performance, or use the MLX provider preset directly.

Coding agents need more than chat

Chat performance and agent performance aren't the same thing. A model that handles simple Q&A at 6 GB can struggle when used as an autonomous coding agent, because:

Agent loops maintain longer context: system prompt, memory, file content, tool history, all at once
Multi-file tasks accumulate tokens fast
Instruction-following degradation at the end of long context hits smaller models harder

As a rough rule: bump up one tier from whatever you'd use for chat. If you run 8B for chatting comfortably, a 14B model will serve you better for agentic coding work. If you want serious autonomous agent work (the kind that rewrites files, runs tests, and handles multi-step tasks), 12 GB+ is the practical floor.

Bodega One Code's Quality Enforcement Layer runs verification steps throughout the agent loop, which helps compensate for smaller models' tendency to make hard-to-catch errors. But there's no substitute for raw model quality on complex tasks.

What about CPU-only?

It works for small models. llama.cpp runs on CPU; Ollama falls back to CPU if no GPU is detected. Speeds are typically 1–5 tokens/second depending on your CPU and model size, versus 30–80+ tokens/second on a mid-range GPU.

For interactive chat that's usable. For an agentic coding loop that might run 50+ tool calls per task, 2 tokens/second is painful. CPU-only is a fallback, not a recommendation.

If you're buying hardware

For pure local AI coding work in 2026, the community consensus:

Budget pick: RTX 3060 12 GB. Cheap used, gets you into the 8–12 GB tier, runs 14B models well.
Best value: RTX 3090 24 GB. Used market is strong. Fits Qwen2.5-Coder-32B Q4 with room for context. The card most serious local AI developers actually run.
New card pick: RTX 4080 Super 16 GB. New warrantied hardware, excellent CUDA performance, fits into the excellent-tier models.
Mac: M3 Pro 36 GB or M4 Max 48 GB+ if you're staying in the Apple ecosystem.

VRAM is the spec to optimize for. Everything else (CUDA cores, clock speed, power consumption) matters less for inference workloads than raw memory capacity.

Getting started

If you already have a GPU and want to wire it up to a local AI coding environment, the Bodega One Code quick start guide covers the full setup: provider configuration, model selection, and the first agent task. Bodega One Code auto-detects your hardware on launch and recommends a model tier based on your VRAM.

And if you want to go deeper on what BYOLLM means and why it matters that you can swap models freely.That's the follow-up read.

For a full side-by-side comparison of every model at each tier, with benchmarks, license, Ollama availability, and descriptions, see the Local LLM Guide.

Common questions

What GPU do I need for local AI coding in 2026?: 6-8 GB VRAM is where local AI starts being useful for coding. An RTX 3060 12 GB or 4060 Ti handles Qwen3.5-9B well. For serious work, 12-16 GB is the sweet spot. 16-24 GB runs Qwen3.6-27B, the current gold standard for local coding at 77.2% on SWE-bench Verified.
Can I run local LLMs on a 4 GB GPU?: Yes, but it is entry-level. SmolLM3-3B Q4 fits in 2.5-4 GB and handles simple tasks and Q&A. For serious work with multi-file edits and debugging, you will hit limits fast. Treat sub-4 GB as proof of concept, not a production coding setup.
How much VRAM does a 7B model need?: A quantized 7B model at Q4 takes roughly 4-5 GB. Q4_K_M is the community sweet spot for quality and speed. Qwen3-8B Q4_K_M runs well in 6-8 GB of VRAM with room for context, making it the recommended starting point if you actually want to code with the model and not just chat.
Is Apple Silicon different for local AI?: Yes. Apple M-series chips use unified memory, so GPU and CPU share the same pool. A 16 GB MacBook Pro runs Qwen3.5-9B in MLX format comfortably. At 64 GB and up you can run Qwen3.6-27B MLX, matching the gold tier of discrete GPU setups. Bodega One Code detects Apple Silicon and recommends MLX models.

Written by the Bodega One team. We build Bodega One Code, the local-first AI IDE, and we write here about local models, AI costs, and what we learn shipping it. More about the team and why we build local-first on the about page.

Stay in the loop

Build-in-public updates, model picks, and Copilot/Cursor news as it breaks.

Ready to own your tools?

Beta is free and open to everyone. Download free.

Download Free →See Pricing

← Back to the blog