Skip to main content
engineeringperformanceBYOLLMlocal-first

KV cache: how Bodega One gets 40-70% reuse on every LLM session

Bodega One5 min read

Every time you send a message to an LLM, it recomputes your entire context from scratch. Every token in your system prompt, every tool description, every rule you've defined gets re-tokenized and re-processed. For cloud APIs, you pay for it. For local models running on your GPU, you eat the latency.

Bodega One's agentic loop was built around three specific engineering decisions that avoid most of this waste. The result is 40-70% KV cache reuse on a typical session. Here's what we did and why it works.

What KV cache actually is

When an LLM processes tokens, it computes a set of "keys" and "values" for each one. These K/V pairs represent the attention state for that token. If the input on the next request is identical up to a certain point in the sequence, those pairs can be reused instead of recomputed.

Cloud providers like Anthropic and OpenAI cache this server-side. If your system prompt is identical to the previous request, you're billed at a lower rate for the cached portion. Local models running in Ollama keep KV pairs in GPU VRAM. When the prefix matches, follow-up messages in the same session are noticeably faster.

The catch: the cache is invalidated the moment anything in the sequence changes. Change a single token in your system prompt, reorder your tools, or compact history in a way that touches the beginning of the context, and the cache starts over.

The three strategies

CACHE-01: Static system prompt

The system prompt is a constant string. Same content, same order, same whitespace, on every call, every session. Every message uses the same two-message pattern: static system message followed by a dynamic per-turn user message.

Cloud providers cache the system prompt after the first request. Every subsequent turn in the same session is only billed for the new tokens you added. For Ollama, the model keeps those KV pairs in VRAM and doesn't recompute them.

CACHE-02: Deterministic tool ordering

All tools are registered in alphabetical order and appear in the system prompt in that same sorted order, every time. This matters because a single reordered tool invalidates the cache for every token that follows it.

Sorting by name means the cache key is stable regardless of which tools were recently added or updated. A new tool gets inserted at the right alphabetical position. The prefix above it stays intact.

CACHE-03: Cache-safe compaction

When the context window approaches its limit, the loop compacts conversation history. The key decision is where to compact from: we trim from the end of history, not the beginning. The static system prompt block at the top stays untouched.

Compacting from the tail preserves the cached prefix. Cloud providers see the same system prompt they've already cached. Compacting from the front would destroy the prefix and cold-start the cache on the next request.

What 40-70% reuse means in practice

The first message in a session always cold-starts the context. Every token gets computed. From the second message onward, the caching kicks in.

For cloud APIs: if a session would cost $0.10 in raw tokens, it costs closer to $0.04-0.06 after caching reduces the billed tokens on each turn. Across a day of active coding, that compounds.

For local models: the first message loads the context into GPU VRAM. Subsequent messages in the same session skip that work. Response latency drops on turn two and stays lower for the rest of the session.

None of this is magic. It's three engineering decisions applied consistently across every part of the loop that touches context assembly. The savings come from not undoing the work the model already did.

Why this matters for BYOLLM

These optimizations work regardless of which LLM you're running. Anthropic, OpenAI, Groq, Ollama, LM Studio. The three strategies apply across all of them because they're about how context is structured, not which model processes it.

If you're on a local model, you get faster follow-up responses. If you're on a cloud API, you get a lower token bill. Bring your own LLM; the cache savings come with it.


Questions about how the agentic loop handles context on your hardware or provider? Join us on Discord. And if you're picking a local provider, see all 15+ supported providers.

Ready to own your tools?

Beta opens May 2026. Complete 14 days and earn a $30 promo code.