Start here
Hardware & System Requirements
Bodega One runs entirely on your hardware - no cloud required. This section covers what Bodega detects at startup, how it uses that data to recommend models and cap context, and what hardware gets you which results.
Minimum and recommended specs
Minimum - the app runs, CPU-only inference with small models:
| OS | Windows 10+, macOS 12+, Ubuntu 22.04+ |
| RAM | 8 GB |
| Storage | 10 GB free |
| GPU | None required (CPU-only mode works with 1–4B models) |
| Internet | Not required |
Recommended - comfortable agentic use with mid-size models:
| RAM | 16–32 GB |
| GPU | NVIDIA with 8+ GB VRAM (RTX 3060 or better; RTX 5070+ ideal) |
| Storage | 50 GB free |
If you have no discrete GPU, Bodega detects CPU-only mode and caps context based on system RAM. You'll see CPU mode - context capped for sustainable inference in the AI panels. 1–4B models are usable; anything larger will be slow.
What Bodega detects at startup
On every launch, Bodega probes your GPU VRAM using the systeminformation library. The result feeds three things:
- Model recommendations - the Discover tab's Fits My GPU filter and first-run model cards
- Context window ceiling - the max tokens Bodega will send to a local model (written to
hardware.local_context_ceiling) - Cloud Boost escalation - whether your hardware tier triggers automatic escalation on reasoning-heavy tasks
The detection runs once at startup. VRAM capacity is cached for the session. Free VRAM refreshes every 3 seconds so that a model already loaded is correctly accounted for.
Apple Silicon: Bodega uses total system RAM minus 3 GB as the effective VRAM pool. This matches how unified memory actually works - your GPU and CPU share the same physical pool. The onboarding badge shows your chip name and memory figure.
No GPU: vramGB is reported as 0. Bodega sets isCpuOnly = true and switches to RAM-based context ceilings.
Multi-GPU: Bodega uses the largest single GPU's VRAM. Ollama loads models onto one GPU, not split across cards.
Your System card
A live hardware readout appears at the top of the Hardware section in the Help panel.
- Open the Help panel: gear icon (top right) → Help
- Select Hardware in the left rail
- The Your System card at the top shows: CPU model, total RAM, GPU model, VRAM, and a one-line capability summary
The summary uses these thresholds:
| VRAM | Summary |
|---|---|
| 40 GB+ | Can run 70B models comfortably |
| 20–40 GB | Can run 32B models comfortably |
| 14–20 GB | Can run 14B models comfortably |
| 7–14 GB | Can run 7–8B models comfortably |
| Under 7 GB | Limited VRAM - 7B models with short context |
| No GPU | CPU inference only (7–8B) |
The card fetches from the backend on mount. If the backend isn't running when you open it, the card won't appear - no error is shown.
Two tier systems (and why they differ)
Bodega uses two separate hardware classification systems. They serve different purposes and have different thresholds - do not confuse them.
Discover tab tiers (used for model recommendations and the Fits My GPU filter):
| Tier | VRAM |
|---|---|
| tiny | ≤ 4 GB |
| small | 4–8 GB |
| medium | 8–16 GB |
| large | 16–24 GB |
| xl | 24+ GB |
Routing tiers (used by the agentic loop for model selection and Cloud Boost escalation decisions):
| Tier | VRAM | Largest model at Q4_K_M |
|---|---|---|
| minimal | < 6 GB | up to 3B |
| budget | 6–10 GB | up to 7B |
| mid | 10–18 GB | up to 13B |
| high | 18–32 GB | up to 30B |
| prosumer | 32+ GB | 70B+ |
The routing tiers also expose two capability flags: canRunSmartTier (13B+ accessible, requires 10 GB) and canRunCodeTier (code-specialized 7B models, requires 6 GB).
VRAM sizing: the Q4_K_M rule of thumb
For Q4_K_M quantization - the default in Bodega's managed llama.cpp mode - model weight size in VRAM is roughly:
VRAM (GB) ≈ parameters (B) × 0.55 + KV cache + ~1 GB overhead
Actual sizes from Bodega's catalog (Q4_K_M):
| Model size | VRAM (weights only) |
|---|---|
| 1.5B | 1.1 GB |
| 3B | 1.9 GB |
| 4B | 2.5 GB |
| 7B | 4.7 GB |
| 8B | 5.2 GB |
| 14B | 9.0 GB |
| 24B (MoE) | 14.3 GB |
| 30B MoE | 18.6 GB |
| 32B | 19.9 GB |
| 70B | 42.5 GB |
KV cache is the hidden cost. An 8B model's KV cache grows from about 1.2 GB at 8K tokens to ~20 GB at 128K tokens. Context length is where VRAM gets consumed faster than the model weights table suggests. Bodega's context ceilings are set conservatively to prevent thrashing during agentic runs.
Context window ceilings
After VRAM detection, Bodega computes a context ceiling and writes it to hardware.local_context_ceiling. This caps how many tokens the agentic loop sends to local models.
GPU path (VRAM-based):
| Effective VRAM | Context ceiling |
|---|---|
| ≤ 4 GB | 8,192 tokens |
| 4–8 GB | 16,384 tokens |
| 8–24 GB | 32,768 tokens |
| 24+ GB | 65,536 tokens |
CPU-only path (RAM-based, no discrete GPU):
| System RAM | Context ceiling |
|---|---|
| ≤ 8 GB | 4,096 tokens |
| 8–16 GB | 8,192 tokens |
| 16–32 GB | 16,384 tokens |
| 32+ GB | 32,768 tokens |
The ceiling calculation uses max(free VRAM, total VRAM − 4 GB) rather than free VRAM alone. This prevents a model that's already loaded from causing the probe to misclassify your card. For example: a 32 GB GPU with a 26B model resident still probes as a 28 GB card for ceiling purposes, not a 4 GB card.
To raise the ceiling manually: Settings → (search) context_window_cap → set llm.context_window_cap to a positive integer. Set it to 0 to return to automatic. The budget bar in each AI panel reflects the active ceiling.
Reducing KV cache memory with Ollama
If you're using Ollama as your local provider and running into VRAM pressure at longer contexts, you can cut KV cache size by roughly 50% by setting an environment variable before launching Ollama:
export OLLAMA_KV_CACHE_TYPE=q8_0
On Windows, set this in System Properties → Environment Variables before starting Ollama.
This is an Ollama setting, not a Bodega setting - Bodega has no UI control for it. It also has no effect in llama.cpp managed mode (where Bodega controls the server directly).
GPU value picks
LLM inference is memory-bandwidth-limited, not compute-limited. VRAM capacity and bandwidth matter more than GPU core count.
| Pick | GPU | VRAM | Notes |
|---|---|---|---|
| Budget entry | Intel Arc B580 | 12 GB | $249; solid for 7–8B models |
| Best value used | RTX 3090 | 24 GB | $800–950; handles 32B at ~112 tok/s |
| Best value new | RTX 5070 | 12 GB GDDR7 | $550–750; fast bandwidth |
| No compromises | RTX 5090 | 32 GB GDDR7 | $2,000+; 70B models comfortably |
AMD: The RX 7900 XTX has 24 GB VRAM but ROCm software support on Windows is inconsistent. On Linux, AMD gets the ROCm llama.cpp binary. On Windows, AMD gets the Vulkan build (HIP Radeon support is experimental).
NVIDIA on Linux: Bodega ships a Vulkan llama.cpp binary for Linux NVIDIA - not CUDA. CUDA binaries for Linux are not included in Bodega's managed installation.
Apple Silicon
On M-series Macs, the GPU and CPU share the same memory pool. Bodega detects Apple Silicon via process.platform === 'darwin' && process.arch === 'arm64' and uses total RAM − 3 GB as the effective VRAM figure.
Approximate memory bandwidth by chip (from Apple's published specs):
| Chip | Bandwidth |
|---|---|
| M4 (base, 16–32 GB) | ~120 GB/s |
| M4 Pro (24–48 GB) | ~273 GB/s |
| M4 Max (64–128 GB) | ~546 GB/s |
Higher bandwidth means faster token generation - bandwidth scales matter more than raw VRAM here.
When you install llama.cpp through Bodega on macOS, it installs the macos-arm64 binary, which uses Metal acceleration. No configuration required.
If you're using Ollama on Apple Silicon, Ollama handles MLX acceleration automatically. MLX runs roughly 20–30% faster than llama.cpp for inference on Apple Silicon.
Free memory note: macOS manages unified memory dynamically (compression, paging). The free memory reading Bodega sees can fluctuate. The probe uses os.freemem() with a 1 GB kernel reserve as a proxy.
llama.cpp binary selection
When you install llama.cpp through Bodega (first-run onboarding or Settings → Providers → llama.cpp → Install), Bodega picks the right binary for your system automatically:
| Platform | Condition | Binary |
|---|---|---|
| Windows NVIDIA | Driver reports CUDA ≥ 13.3 | CUDA 13 |
| Windows NVIDIA | Driver reports CUDA ≥ 12.4 | CUDA 12 |
| Windows NVIDIA | No parseable CUDA version | Vulkan |
| Windows AMD | Any | Vulkan |
| Linux NVIDIA | Any | Vulkan |
| Linux AMD | ROCm installed | ROCm |
| Linux AMD | ROCm absent | CPU fallback |
| macOS arm64 | Any | Metal (macos-arm64) |
| CPU-only fallback | Any | CPU |
Bodega reads the CUDA version directly from nvidia-smi output (not inferred from driver version numbers). If nvidia-smi times out or is absent, it falls back to driver major version, then to Vulkan or CPU. The install is never blocked by a missing nvidia-smi.
Air-gap mode: the binary install endpoint returns 403. Use llamacpp.binary_path_override in Settings to point to a manually downloaded binary.
Cloud Boost as a no-GPU path
Cloud Boost lets you configure a cloud provider (OpenAI, Anthropic, and others) as a secondary fallback. On low-end hardware, it activates automatically for tasks that need it.
To configure:
- Go to Settings → Cloud Boost
- Enable the toggle
- Choose a cloud provider
- Enter your API key
- Set optional daily and monthly budget limits
Auto-escalation triggers - both conditions must be true for trigger B:
- (A) The local model's QEL verification score falls below 50 for two consecutive iterations in the same session, OR
- (B) Hardware tier is
minimal(< 6 GB VRAM) orbudget(6–10 GB VRAM) AND the current task is classified as reasoning-heavy
Simple conversational messages on minimal hardware do not trigger auto-escalation. The task has to be a planning or reasoning task.
You can also toggle Cloud Boost manually in the chat input area.
Cost tracking: gear icon → Usage Dashboard
Air-gap mode: Cloud Boost is fully blocked when air-gap is enabled (Settings → Privacy & Safety → Air-Gap).
Free VRAM detection limits
Bodega can only read free VRAM on systems where the GPU driver reports the memoryFree field via systeminformation. This works on most Windows NVIDIA and Linux AMD (ROCm) systems.
On Apple Silicon and some Windows AMD configurations, free VRAM is not reported separately - Bodega uses total VRAM as the effective figure.
When free VRAM isn't available, model fit scoring uses total VRAM. This means the Discover tab may show models as fitting even when another model is already resident. The requiresEviction indicator on a model card means it fits in your total VRAM but would require the current model to be unloaded first.
Keyboard shortcuts
| Keys | Action |
|---|---|
| Gear icon → Help → Hardware | Open Hardware section with Your System card |
This page mirrors the in-app docs hub for app version 1.0.0-beta.26.1. Found something unclear or out of date? Tell us on Discord. New here? Download the free beta and follow along.