Skip to main content
local-firsttoolsvramlocal-llms

How to use our free VRAM calculator for local LLMs

Bodega One6 min read

Quick answer

VRAM is the hard ceiling for local LLMs. If the model does not fit, it will not load. Use our free VRAM calculator to find exactly how much memory your model needs at any quantization level, and which GPU to buy.

Why VRAM matters more than RAM for local AI

When you run a local LLM, the model weights need to live in GPU memory (VRAM), not system RAM. Your CPU and system RAM handle everything else, but the model itself must fit on your GPU. If it does not fit, most runtimes will either refuse to load it or fall back to CPU offloading.

CPU offloading works, but it is 5-10x slower than full GPU inference. For a quick summary it is fine. For agentic tasks that make dozens of LLM calls, that slowdown compounds into minutes of dead time per session.

The rough formula: model VRAM (GB) = (parameters x bits per weight) / 8. A 7 billion parameter model at FP16 (16 bits) needs about 14 GB. At Q4 (4 bits), it needs about 3.5 GB. That is why quantization matters so much for consumer hardware.

How to use the calculator

Go to our free VRAM calculator. Pick your model size (7B, 13B, 34B, 70B, or a custom parameter count), then pick your quantization level (Q4_K_M, Q6_K, Q8_0, or FP16). That is it.

The calculator outputs a total VRAM requirement broken down into three buckets: model weights, KV cache, and runtime overhead. It shows both a minimum figure (model barely fits) and a recommended figure (comfortable inference with a full context window). Always aim for the recommended number. The minimum is technically true and practically painful.

Understanding quantization

Quantization reduces the number of bits used to store each model weight. Less precision means smaller file, smaller VRAM footprint, faster inference. The quality tradeoff is real but smaller than most people expect.

  • FP16: 2 bytes per weight. Full precision. Best quality. Needs the most VRAM. Use when you have enough headroom and quality is critical.
  • Q8_0: 1 byte per weight. Minimal quality loss vs FP16. A good midpoint if you have the VRAM to spare.
  • Q4_K_M: Around 0.5 bytes per weight. Roughly 5% quality drop vs FP16 on most benchmarks. This is the sweet spot for most developer hardware. You get roughly 4x the model for the same VRAM budget.
  • Q6_K: Between Q4 and Q8. Useful when you want better quality than Q4 but cannot fit Q8.

Rule of thumb: start with Q4_K_M. If code quality feels off on harder reasoning tasks, try Q6_K or Q8_0. Most developers using Bodega One run Q4_K_M or Q6_K for day-to-day work.

VRAM tiers and what runs on each

Here is what fits at each common consumer VRAM level, using Q4_K_M quantization:

VRAM tierExample GPUBest model at Q4_K_MQuality level
6-8 GBRTX 3060, 4060Qwen3-8BGood for light tasks
8-12 GBRTX 3070, 4070Qwen3-14BStrong code quality
16-24 GBRTX 3090, 4080Qwen2.5-Coder-32BProduction grade
24-48 GBRTX 4090, dual 3090Llama 3.3 70BFrontier local quality
Apple 16 GB unifiedM3 16 GBQwen3-8B MLX Q4Good

Apple Silicon is a special case. The unified memory architecture means your GPU and CPU share the same physical memory pool. A 16 GB M3 MacBook gives you a full 16 GB effective VRAM, not the split you get on a discrete GPU system. This is one of the reasons Apple Silicon is popular for local AI work.

Why the calculator shows recommended vs minimum

Minimum VRAM means the model loads, but barely. At minimum headroom, the KV cache is cramped, which means a shorter effective context window. Inference is slower because the GPU has no buffer. And any other process that uses VRAM (display, other apps) can push you over the edge mid-session.

Recommended VRAM means the full context window is available, inference runs at comfortable speed, and you have overhead for the OS and driver overhead. Always target recommended if you are doing agentic tasks. Agents make many sequential LLM calls and pass large files back and forth. A cramped context window kills them fast.

How Bodega One uses VRAM estimates

Bodega One's Context Inspector shows your live token budget before each LLM call. You can see how much of your context window is being used by the system prompt, conversation history, and the files you have open. This prevents the most common failure mode: hitting the context limit mid-task without realizing it.

BYOLLM means you can swap providers mid-session. If a local model is too slow for a complex task, switch to a cloud provider for that call. If you hit a VRAM limit, drop to a smaller quantization without restarting. The provider selection is live, not baked into the session.

Air-gap mode works identically to normal mode from a VRAM perspective. The only difference is network egress is blocked. VRAM constraints are a hardware limitation, not a network one. They apply equally regardless of whether you are online or offline.

Next steps

Try the VRAM calculator to find your exact VRAM requirement. Then check our local LLM rankings by SWE-bench score to find the best model for your hardware tier. If you want a full breakdown of which GPU to buy at each budget, read our developer GPU guide.

Ready to own your tools?

Beta opens May 2026. Complete 14 days and earn a $30 promo code.