Start here

Hardware & System Requirements

Bodega One runs entirely on your hardware - no cloud required. This section covers what Bodega detects at startup, how it uses that data to recommend models and cap context, and what hardware gets you which results.

Minimum and recommended specs

Minimum - the app runs, CPU-only inference with small models:


OS	Windows 10+, macOS 12+, Ubuntu 22.04+
RAM	8 GB
Storage	10 GB free
GPU	None required (CPU-only mode works with 1–4B models)
Internet	Not required

Recommended - comfortable agentic use with mid-size models:


RAM	16–32 GB
GPU	NVIDIA with 8+ GB VRAM (RTX 3060 or better; RTX 5070+ ideal)
Storage	50 GB free

If you have no discrete GPU, Bodega detects CPU-only mode and caps context based on system RAM. You'll see CPU mode - context capped for sustainable inference in the AI panels. 1–4B models are usable; anything larger will be slow.

What Bodega detects at startup

On every launch, Bodega probes your GPU VRAM using the systeminformation library. The result feeds three things:

Model recommendations - the Discover tab's Fits My GPU filter and first-run model cards
Context window ceiling - the max tokens Bodega will send to a local model (written to hardware.local_context_ceiling)
Cloud Boost escalation - whether your hardware tier triggers automatic escalation on reasoning-heavy tasks

The detection runs once at startup. VRAM capacity is cached for the session. Free VRAM refreshes every 3 seconds so that a model already loaded is correctly accounted for.

Apple Silicon: Bodega uses total system RAM minus 3 GB as the effective VRAM pool. This matches how unified memory actually works - your GPU and CPU share the same physical pool. The onboarding badge shows your chip name and memory figure.

No GPU: vramGB is reported as 0. Bodega sets isCpuOnly = true and switches to RAM-based context ceilings.

Multi-GPU: Bodega uses the largest single GPU's VRAM. Ollama loads models onto one GPU, not split across cards.

Your System card

A live hardware readout appears at the top of the Hardware section in the Help panel.

Open the Help panel: gear icon (top right) → Help
Select Hardware in the left rail
The Your System card at the top shows: CPU model, total RAM, GPU model, VRAM, and a one-line capability summary

The summary uses these thresholds:

VRAM	Summary
40 GB+	Can run 70B models comfortably
20–40 GB	Can run 32B models comfortably
14–20 GB	Can run 14B models comfortably
7–14 GB	Can run 7–8B models comfortably
Under 7 GB	Limited VRAM - 7B models with short context
No GPU	CPU inference only (7–8B)

The card fetches from the backend on mount. If the backend isn't running when you open it, the card won't appear - no error is shown.

Two tier systems (and why they differ)

Bodega uses two separate hardware classification systems. They serve different purposes and have different thresholds - do not confuse them.

Discover tab tiers (used for model recommendations and the Fits My GPU filter):

Tier	VRAM
tiny	≤ 4 GB
small	4–8 GB
medium	8–16 GB
large	16–24 GB
xl	24+ GB

Routing tiers (used by the agentic loop for model selection and Cloud Boost escalation decisions):

Tier	VRAM	Largest model at Q4_K_M
minimal	< 6 GB	up to 3B
budget	6–10 GB	up to 7B
mid	10–18 GB	up to 13B
high	18–32 GB	up to 30B
prosumer	32+ GB	70B+

The routing tiers also expose two capability flags: canRunSmartTier (13B+ accessible, requires 10 GB) and canRunCodeTier (code-specialized 7B models, requires 6 GB).

VRAM sizing: the Q4_K_M rule of thumb

For Q4_K_M quantization - the default in Bodega's managed llama.cpp mode - model weight size in VRAM is roughly:

VRAM (GB) ≈ parameters (B) × 0.55 + KV cache + ~1 GB overhead

Actual sizes from Bodega's catalog (Q4_K_M):

Model size	VRAM (weights only)
1.5B	1.1 GB
3B	1.9 GB
4B	2.5 GB
7B	4.7 GB
8B	5.2 GB
14B	9.0 GB
24B (MoE)	14.3 GB
30B MoE	18.6 GB
32B	19.9 GB
70B	42.5 GB

KV cache is the hidden cost. An 8B model's KV cache grows from about 1.2 GB at 8K tokens to ~20 GB at 128K tokens. Context length is where VRAM gets consumed faster than the model weights table suggests. Bodega's context ceilings are set conservatively to prevent thrashing during agentic runs.

Context window ceilings

After VRAM detection, Bodega computes a context ceiling and writes it to hardware.local_context_ceiling. This caps how many tokens the agentic loop sends to local models.

GPU path (VRAM-based):

Effective VRAM	Context ceiling
≤ 4 GB	8,192 tokens
4–8 GB	16,384 tokens
8–24 GB	32,768 tokens
24+ GB	65,536 tokens

CPU-only path (RAM-based, no discrete GPU):

System RAM	Context ceiling
≤ 8 GB	4,096 tokens
8–16 GB	8,192 tokens
16–32 GB	16,384 tokens
32+ GB	32,768 tokens

The ceiling calculation uses max(free VRAM, total VRAM − 4 GB) rather than free VRAM alone. This prevents a model that's already loaded from causing the probe to misclassify your card. For example: a 32 GB GPU with a 26B model resident still probes as a 28 GB card for ceiling purposes, not a 4 GB card.

To raise the ceiling manually: Settings → (search) context_window_cap → set llm.context_window_cap to a positive integer. Set it to 0 to return to automatic. The budget bar in each AI panel reflects the active ceiling.

Reducing KV cache memory with Ollama

If you're using Ollama as your local provider and running into VRAM pressure at longer contexts, you can cut KV cache size by roughly 50% by setting an environment variable before launching Ollama:

export OLLAMA_KV_CACHE_TYPE=q8_0

On Windows, set this in System Properties → Environment Variables before starting Ollama.

This is an Ollama setting, not a Bodega setting - Bodega has no UI control for it. It also has no effect in llama.cpp managed mode (where Bodega controls the server directly).

GPU value picks

LLM inference is memory-bandwidth-limited, not compute-limited. VRAM capacity and bandwidth matter more than GPU core count.

Pick	GPU	VRAM	Notes
Budget entry	Intel Arc B580	12 GB	$249; solid for 7–8B models
Best value used	RTX 3090	24 GB	$800–950; handles 32B at ~112 tok/s
Best value new	RTX 5070	12 GB GDDR7	$550–750; fast bandwidth
No compromises	RTX 5090	32 GB GDDR7	$2,000+; 70B models comfortably

AMD: The RX 7900 XTX has 24 GB VRAM but ROCm software support on Windows is inconsistent. On Linux, AMD gets the ROCm llama.cpp binary. On Windows, AMD gets the Vulkan build (HIP Radeon support is experimental).

NVIDIA on Linux: Bodega ships a Vulkan llama.cpp binary for Linux NVIDIA - not CUDA. CUDA binaries for Linux are not included in Bodega's managed installation.

Apple Silicon

On M-series Macs, the GPU and CPU share the same memory pool. Bodega detects Apple Silicon via process.platform === 'darwin' && process.arch === 'arm64' and uses total RAM − 3 GB as the effective VRAM figure.

Approximate memory bandwidth by chip (from Apple's published specs):

Chip	Bandwidth
M4 (base, 16–32 GB)	~120 GB/s
M4 Pro (24–48 GB)	~273 GB/s
M4 Max (64–128 GB)	~546 GB/s

Higher bandwidth means faster token generation - bandwidth scales matter more than raw VRAM here.

When you install llama.cpp through Bodega on macOS, it installs the macos-arm64 binary, which uses Metal acceleration. No configuration required.

If you're using Ollama on Apple Silicon, Ollama handles MLX acceleration automatically. MLX runs roughly 20–30% faster than llama.cpp for inference on Apple Silicon.

Free memory note: macOS manages unified memory dynamically (compression, paging). The free memory reading Bodega sees can fluctuate. The probe uses os.freemem() with a 1 GB kernel reserve as a proxy.

llama.cpp binary selection

When you install llama.cpp through Bodega (first-run onboarding or Settings → Providers → llama.cpp → Install), Bodega picks the right binary for your system automatically:

Platform	Condition	Binary
Windows NVIDIA	Driver reports CUDA ≥ 13.3	CUDA 13
Windows NVIDIA	Driver reports CUDA ≥ 12.4	CUDA 12
Windows NVIDIA	No parseable CUDA version	Vulkan
Windows AMD	Any	Vulkan
Linux NVIDIA	Any	Vulkan
Linux AMD	ROCm installed	ROCm
Linux AMD	ROCm absent	CPU fallback
macOS arm64	Any	Metal (macos-arm64)
CPU-only fallback	Any	CPU

Bodega reads the CUDA version directly from nvidia-smi output (not inferred from driver version numbers). If nvidia-smi times out or is absent, it falls back to driver major version, then to Vulkan or CPU. The install is never blocked by a missing nvidia-smi.

Air-gap mode: the binary install endpoint returns 403. Use llamacpp.binary_path_override in Settings to point to a manually downloaded binary.

Cloud Boost as a no-GPU path

Cloud Boost lets you configure a cloud provider (OpenAI, Anthropic, and others) as a secondary fallback. On low-end hardware, it activates automatically for tasks that need it.

To configure:

Go to Settings → Cloud Boost
Enable the toggle
Choose a cloud provider
Enter your API key
Set optional daily and monthly budget limits

Auto-escalation triggers - both conditions must be true for trigger B:

(A) The local model's QEL verification score falls below 50 for two consecutive iterations in the same session, OR
(B) Hardware tier is minimal (< 6 GB VRAM) or budget (6–10 GB VRAM) AND the current task is classified as reasoning-heavy

Simple conversational messages on minimal hardware do not trigger auto-escalation. The task has to be a planning or reasoning task.

You can also toggle Cloud Boost manually in the chat input area.

Cost tracking: gear icon → Usage Dashboard

Air-gap mode: Cloud Boost is fully blocked when air-gap is enabled (Settings → Privacy & Safety → Air-Gap).

Free VRAM detection limits

Bodega primarily reads free VRAM from the GPU driver's memoryFree field via systeminformation. This works on most Windows NVIDIA and Linux AMD (ROCm) systems. On some Windows/driver combos that field comes back null even on an NVIDIA GPU - Bodega falls back to querying nvidia-smi directly for a free-VRAM reading (total VRAM is never overridden by this fallback, only free).

On Apple Silicon and some Windows AMD configurations, free VRAM still isn't reported by either source - Bodega uses total VRAM as the effective figure.

When free VRAM isn't available from any source, model fit scoring uses total VRAM. This means the Discover tab may show models as fitting even when another model is already resident. The requiresEviction indicator on a model card means it fits in your total VRAM but would require the current model to be unloaded first.

Keyboard shortcuts

Keys	Action
`Gear icon → Help → Hardware`	Open Hardware section with Your System card

This page mirrors the in-app docs hub for app version 1.0.0-beta.32.1. Found something unclear or out of date? Tell us on Discord. New here? Download the free beta and follow along.