Quick answer
For local AI coding in 2026: 6–8 GB VRAM (Qwen3-8B) is where it becomes actually useful, 12–16 GB is the sweet spot for serious work, 24 GB+ is gold-tier. RTX 3090 24 GB is the community's best-value pick. Use BYOLLM to switch between local and cloud when you need more power.
The question that comes up in every local AI thread: "Can my GPU run this?" Usually someone replies with a number, someone else disputes it, and the thread turns into a benchmark debate. Not super useful if you just want to know whether to bother.
Here's the practical answer. Not a spec sheet. This is a breakdown of what each VRAM tier can actually do, which models work well at each level, and what you'll notice in day-to-day coding use.
Why VRAM is the bottleneck
When you run a local model, the weights load into GPU memory. Not system RAM, not disk. VRAM. A quantized 7B model at Q4 takes roughly 4–5 GB. A 32B model at Q4 takes around 20 GB. If the model doesn't fit, it either falls back to CPU (slow) or fails to load.
System RAM matters for context window overflow and CPU fallback. But if you want GPU inference at useful speeds, VRAM is the number to watch.
Quantization (Q4, Q4_K_M, Q8, etc.) is how the community squeezes large models into smaller VRAM budgets. Q4_K_M is the sweet spot most people land on: good quality loss-to-size ratio, fast enough for real use.
The VRAM tier table
These are the tiers Bodega One uses internally to match hardware to model recommendations. If you're using a local provider like Ollama or LM Studio, this maps directly to what you should be pulling.
| VRAM | Best agent model | Quality tier |
|---|---|---|
| Under 4 GB | SmolLM3-3B Q4 | Low |
| 6–8 GB | Qwen3-8B Q4_K_M | Good |
| 8–12 GB | Qwen3-14B Q4_K_M | Strong |
| 12–16 GB | GLM-4.7-Flash 30B-A3B Q4 (MoE) | Excellent |
| 16–24 GB | Qwen2.5-Coder-32B Q4 | Gold |
| 24–48 GB | Llama 3.3 70B Q4 | Frontier |
| Apple 16 GB | Qwen3-8B MLX Q4 | Good |
| Apple 64 GB+ | Qwen2.5-Coder-32B MLX | Gold |
"Best agent model" here means the largest model that fits comfortably with room for a reasonable context window. You can often run a bigger model if you're careful about context length, but these are the settings that work reliably.
What each tier actually feels like
Under 4 GB: entry-level GPUs, integrated graphics
You can run something. SmolLM3-3B Q4 fits in around 2.5 GB. It handles simple tasks, short completions, quick Q&A. For serious coding work (multi-file edits, debugging complex logic, long context), it hits a wall fast. Treat this tier as proof-of-concept, not production workflow.
6–8 GB: RTX 3060, RTX 4060, RX 6600
This is where local AI becomes actually useful. Qwen3-8B Q4_K_M fits here and it punches above its size on code tasks. You'll feel the context limit on larger projects, but for focused single-file or small multi-file work it's solid. The RTX 3060 at 12 GB is actually a sweet-spot card, bumping you up to the next tier.
8–12 GB: RTX 3060 12GB, RTX 4060 Ti, Arc A770
Qwen3-14B Q4_K_M lands here. A real jump in reasoning quality. This tier handles most day-to-day coding tasks well: refactors, new component generation, debugging with reasonable context. If you're buying a single card for local AI work and have a $300–400 budget, aim here.
12–16 GB: RTX 3080, RTX 4070, RTX 4080
GLM-4.7-Flash 30B-A3B Q4 is a mixture-of-experts model. It activates only 3B parameters per token despite having 30B total, so it fits in 12 GB but reasons like a much larger dense model. This tier works extremely well for agentic coding work. Most people who use Bodega One as their primary dev tool are in this range.
16–24 GB: RTX 3090, RTX 4090, RTX 4080 Super
Qwen2.5-Coder-32B Q4 is the gold standard for local coding models as of 2026. It matches or exceeds GPT-4 class performance on many coding benchmarks. At this tier, the main limit is context window, not model quality. The RTX 3090 at 24 GB hits this tier and is the current community favorite for pure local AI work per dollar.
24–48 GB: RTX 3090 Ti, A5000, multi-card setups
Llama 3.3 70B Q4 fits here. Frontier-class quality. Unless you're doing something that specifically needs a 70B model (complex multi-step reasoning, research tasks, long context document analysis), most developers won't notice the difference over 32B for typical coding work. The 3090 24 GB is better value for coding unless you have specific needs.
Apple Silicon is different
Apple M-series chips use unified memory, so the GPU and CPU share the same pool. That means a MacBook Pro with 16 GB RAM has 16 GB available for model weights, compared to a 16 GB discrete GPU. The effective headroom is similar but the architecture is different: Apple uses MLX-optimized model formats, not GGUF.
At 16 GB unified: Qwen3-8B MLX Q4 runs well and is the recommended starting point. At 32 GB: Qwen3-14B or Qwen2.5-Coder-7B MLX. At 64 GB+: Qwen2.5-Coder-32B MLX, the same gold tier as a high-end discrete GPU setup.
The M4 Max at 128 GB can run 70B class models comfortably. Expensive, but it's the only laptop setup that gets you frontier-quality local inference.
Bodega One detects Apple Silicon automatically and recommends MLX provider format. If you're on a Mac, set your provider to Ollama with an MLX-tagged model variant for best performance, or use the MLX provider preset directly.
Coding agents need more than chat
Chat performance and agent performance aren't the same thing. A model that handles simple Q&A at 6 GB can struggle when used as an autonomous coding agent, because:
- Agent loops maintain longer context: system prompt, memory, file content, tool history, all at once
- Multi-file tasks accumulate tokens fast
- Instruction-following degradation at the end of long context hits smaller models harder
As a rough rule: bump up one tier from whatever you'd use for chat. If you run 8B for chatting comfortably, a 14B model will serve you better for agentic coding work. If you want serious autonomous agent work (the kind that rewrites files, runs tests, and handles multi-step tasks), 12 GB+ is the practical floor.
Bodega One's Quality Enforcement Layer runs verification steps throughout the agent loop, which helps compensate for smaller models' tendency to make hard-to-catch errors. But there's no substitute for raw model quality on complex tasks.
What about CPU-only?
It works for small models. llama.cpp runs on CPU; Ollama falls back to CPU if no GPU is detected. Speeds are typically 1–5 tokens/second depending on your CPU and model size, versus 30–80+ tokens/second on a mid-range GPU.
For interactive chat that's usable. For an agentic coding loop that might run 50+ tool calls per task, 2 tokens/second is painful. CPU-only is a fallback, not a recommendation.
If you're buying hardware
For pure local AI coding work in 2026, the community consensus:
- Budget pick: RTX 3060 12 GB. Cheap used, gets you into the 8–12 GB tier, runs 14B models well.
- Best value: RTX 3090 24 GB. Used market is strong. Fits Qwen2.5-Coder-32B Q4 with room for context. The card most serious local AI developers actually run.
- New card pick: RTX 4080 Super 16 GB. New warrantied hardware, excellent CUDA performance, fits into the excellent-tier models.
- Mac: M3 Pro 36 GB or M4 Max 48 GB+ if you're staying in the Apple ecosystem.
VRAM is the spec to optimize for. Everything else (CUDA cores, clock speed, power consumption) matters less for inference workloads than raw memory capacity.
Getting started
If you already have a GPU and want to wire it up to a local AI coding environment, the Bodega One quick start guide covers the full setup: provider configuration, model selection, and the first agent task. Bodega One auto-detects your hardware on launch and recommends a model tier based on your VRAM.
And if you want to go deeper on what BYOLLM means and why it matters that you can swap models freely.That's the follow-up read.
For a full side-by-side comparison of every model at each tier, with benchmarks, license, Ollama availability, and descriptions, see the Local LLM Guide.
Related posts
Ready to own your tools?
Beta opens May 2026. Complete 14 days and earn a $30 promo code.