Why does VRAM matter more than system RAM for running local LLMs?

The model weights must load into GPU memory (VRAM), not system RAM. If the model doesn't fit, most runtimes either refuse to load it or fall back to CPU offloading, which runs 5-10x slower. For agentic tasks with multiple sequential LLM calls, this slowdown compounds into minutes of dead time per session.

How much VRAM do I need for a 7B parameter model?

At FP16 (full precision), a 7B model needs about 14 GB VRAM. At Q4 (4-bit quantization), it needs roughly 3.5 GB. The formula is model VRAM (GB) = (parameters x bits per weight) / 8. Quantization matters significantly for consumer hardware because it reduces the bits per weight while maintaining reasonable quality.

What's the difference between minimum and recommended VRAM in the calculator?

Minimum VRAM means the model barely loads with a cramped KV cache, shorter context window, and slower inference. Recommended VRAM provides full context window availability, comfortable inference speed, and headroom for OS and driver overhead. Always target recommended, especially for agentic tasks that make many sequential calls and pass large files.

Why is Apple Silicon popular for local AI work?

Apple Silicon uses unified memory architecture where GPU and CPU share the same physical memory pool. A 16 GB M3 MacBook provides the full 16 GB as effective VRAM, unlike discrete GPU systems where memory is split. This makes Apple devices particularly practical for running local language models.

How to use our free VRAM calculator for local LLMs

Quick answer

VRAM is the hard ceiling for local LLMs. If the model does not fit, it will not load. Use our free VRAM calculator to find exactly how much memory your model needs at any quantization level, and which GPU to buy.

Why VRAM matters more than RAM for local AI

When you run a local LLM, the model weights need to live in GPU memory (VRAM), not system RAM. Your CPU and system RAM handle everything else, but the model itself must fit on your GPU. If it does not fit, most runtimes will either refuse to load it or fall back to CPU offloading.

CPU offloading works, but it is 5-10x slower than full GPU inference. For a quick summary it is fine. For agentic tasks that make dozens of LLM calls, that slowdown compounds into minutes of dead time per session.

The rough formula: model VRAM (GB) = (parameters x bits per weight) / 8. A 7 billion parameter model at FP16 (16 bits) needs about 14 GB. At Q4 (4 bits), it needs about 3.5 GB. That is why quantization matters so much for consumer hardware.

How to use the calculator

Go to our free VRAM calculator. Pick your model size (7B, 13B, 34B, 70B, or a custom parameter count), then pick your quantization level (Q4_K_M, Q6_K, Q8_0, or FP16). That is it.

The calculator outputs a total VRAM requirement broken down into three buckets: model weights, KV cache, and runtime overhead. It shows both a minimum figure (model barely fits) and a recommended figure (comfortable inference with a full context window). Always aim for the recommended number. The minimum is technically true and practically painful.

Understanding quantization

Quantization reduces the number of bits used to store each model weight. Less precision means smaller file, smaller VRAM footprint, faster inference. The quality tradeoff is real but smaller than most people expect.

FP16: 2 bytes per weight. Full precision. Best quality. Needs the most VRAM. Use when you have enough headroom and quality is critical.
Q8_0: 1 byte per weight. Minimal quality loss vs FP16. A good midpoint if you have the VRAM to spare.
Q4_K_M: Around 0.5 bytes per weight. Roughly 5% quality drop vs FP16 on most benchmarks. This is the sweet spot for most developer hardware. You get roughly 4x the model for the same VRAM budget.
Q6_K: Between Q4 and Q8. Useful when you want better quality than Q4 but cannot fit Q8.

Rule of thumb: start with Q4_K_M. If code quality feels off on harder reasoning tasks, try Q6_K or Q8_0. Most developers using Bodega One Code run Q4_K_M or Q6_K for day-to-day work.

VRAM tiers and what runs on each

Here is what fits at each common consumer VRAM level, using Q4_K_M quantization:

VRAM tier	Example GPU	Best model at Q4_K_M	Quality level
6-8 GB	RTX 3060, 4060	Qwen3-8B	Good for light tasks
8-12 GB	RTX 3070, 4070	Qwen3-14B	Strong code quality
16-24 GB	RTX 3090, 4080	Qwen3.6-27B	Production grade
24-48 GB	RTX 4090, dual 3090	Llama 3.3 70B	Frontier local quality
Apple 16 GB unified	M3 16 GB	Qwen3-8B MLX Q4	Good

Apple Silicon is a special case. The unified memory architecture means your GPU and CPU share the same physical memory pool. A 16 GB M3 MacBook gives you a full 16 GB effective VRAM, not the split you get on a discrete GPU system. This is one of the reasons Apple Silicon is popular for local AI work.

Why the calculator shows recommended vs minimum

Minimum VRAM means the model loads, but barely. At minimum headroom, the KV cache is cramped, which means a shorter effective context window. Inference is slower because the GPU has no buffer. And any other process that uses VRAM (display, other apps) can push you over the edge mid-session.

Recommended VRAM means the full context window is available, inference runs at comfortable speed, and you have overhead for the OS and driver overhead. Always target recommended if you are doing agentic tasks. Agents make many sequential LLM calls and pass large files back and forth. A cramped context window kills them fast.

How Bodega One Code uses VRAM estimates

Bodega One Code's Context Inspector shows your live token budget before each LLM call. You can see how much of your context window is being used by the system prompt, conversation history, and the files you have open. This prevents the most common failure mode: hitting the context limit mid-task without realizing it.

BYOLLM means you can swap providers mid-session. If a local model is too slow for a complex task, switch to a cloud provider for that call. If you hit a VRAM limit, drop to a smaller quantization without restarting. The provider selection is live, not baked into the session.

Air-gap mode works identically to normal mode from a VRAM perspective. The only difference is network egress is blocked. VRAM constraints are a hardware limitation, not a network one. They apply equally regardless of whether you are online or offline.

Next steps

Try the VRAM calculator to find your exact VRAM requirement. Then check our local LLM rankings by SWE-bench score to find the best model for your hardware tier. If you want a full breakdown of which GPU to buy at each budget, read our developer GPU guide.

Common questions

Why does VRAM matter more than system RAM for running local LLMs?: The model weights must load into GPU memory (VRAM), not system RAM. If the model doesn't fit, most runtimes either refuse to load it or fall back to CPU offloading, which runs 5-10x slower. For agentic tasks with multiple sequential LLM calls, this slowdown compounds into minutes of dead time per session.
How much VRAM do I need for a 7B parameter model?: At FP16 (full precision), a 7B model needs about 14 GB VRAM. At Q4 (4-bit quantization), it needs roughly 3.5 GB. The formula is model VRAM (GB) = (parameters x bits per weight) / 8. Quantization matters significantly for consumer hardware because it reduces the bits per weight while maintaining reasonable quality.
What's the difference between minimum and recommended VRAM in the calculator?: Minimum VRAM means the model barely loads with a cramped KV cache, shorter context window, and slower inference. Recommended VRAM provides full context window availability, comfortable inference speed, and headroom for OS and driver overhead. Always target recommended, especially for agentic tasks that make many sequential calls and pass large files.
Why is Apple Silicon popular for local AI work?: Apple Silicon uses unified memory architecture where GPU and CPU share the same physical memory pool. A 16 GB M3 MacBook provides the full 16 GB as effective VRAM, unlike discrete GPU systems where memory is split. This makes Apple devices particularly practical for running local language models.

Written by the Bodega One team. We build Bodega One Code, the local-first AI IDE, and we write here about local models, AI costs, and what we learn shipping it. More about the team and why we build local-first on the about page.

Stay in the loop

Build-in-public updates, model picks, and Copilot/Cursor news as it breaks.

Ready to own your tools?

Beta is free and open to everyone. Download free.

Download Free →See Pricing

← Back to the blog