How much faster are subsequent messages in the same LLM session?

Subsequent messages reuse 40-70% of the KV cache, cutting cloud API billing and local latency after turn one. The first message cold-starts the context, but later turns skip re-computation through KV cache reuse. For local models running Ollama or similar, response latency drops noticeably on turn two and stays lower for the rest of the session.

What happens to the KV cache when I add a new tool to the system prompt?

The cache is invalidated the moment anything in the sequence changes, including reordered tools. Bodega One Code keeps tools in alphabetical order on every call so the cache key stays stable. A new tool gets inserted at its alphabetical position, and the prefix above it remains intact, preserving the cache for tokens that come before the insertion.

Does KV cache reuse work with both local and cloud LLM providers?

Yes. The three strategies work regardless of which LLM you're running: Anthropic, OpenAI, Groq, Ollama, or LM Studio. KV cache reuse is about how context is structured, not which model processes it. Local models get faster follow-up responses, while cloud APIs get lower token bills.

Why compact conversation history from the end instead of the beginning?

Compacting from the tail preserves the cached system prompt at the top of the context, which remains untouched. If you compacted from the front, you would destroy the prefix and force a cold-start on the next request, throwing away all KV cache savings. Tail trimming keeps the cached section intact.

KV cache: how we get 40-70% reuse per LLM session

Quick answer

KV cache reuse saves the attention state an LLM computes for your prompt so it is not recomputed on every turn. Bodega One Code's agentic loop hits 40-70% reuse by keeping the system prompt byte-identical, registering tools in alphabetical order, and compacting history from the tail. The result: faster local inference and lower cached-token billing on cloud providers.

Every time you send a message to an LLM, it recomputes your entire context from scratch. Every token in your system prompt, every tool description, every rule you've defined gets re-tokenized and re-processed. For cloud APIs, you pay for it. For local models running on your GPU, you eat the latency.

Bodega One Code's agentic loop was built around three specific engineering decisions that avoid most of this waste. The result is 40-70% KV cache reuse on a typical session. Here's what we did and why it works.

What KV cache actually is

When an LLM processes tokens, it computes a set of "keys" and "values" for each one. These K/V pairs represent the attention state for that token. If the input on the next request is identical up to a certain point in the sequence, those pairs can be reused instead of recomputed.

Cloud providers like Anthropic and OpenAI cache this server-side. If your system prompt is identical to the previous request, you're billed at a lower rate for the cached portion. Local models running in Ollama keep KV pairs in GPU VRAM. When the prefix matches, follow-up messages in the same session are noticeably faster.

The catch: the cache is invalidated the moment anything in the sequence changes. Change a single token in your system prompt, reorder your tools, or compact history in a way that touches the beginning of the context, and the cache starts over.

The three strategies

CACHE-01: Static system prompt

The system prompt is a constant string. Same content, same order, same whitespace, on every call, every session. Every message uses the same two-message pattern: static system message followed by a dynamic per-turn user message.

Cloud providers cache the system prompt after the first request. Every subsequent turn in the same session is only billed for the new tokens you added. For Ollama, the model keeps those KV pairs in VRAM and doesn't recompute them.

CACHE-02: Deterministic tool ordering

All tools are registered in alphabetical order and appear in the system prompt in that same sorted order, every time. This matters because a single reordered tool invalidates the cache for every token that follows it.

Sorting by name means the cache key is stable regardless of which tools were recently added or updated. A new tool gets inserted at the right alphabetical position. The prefix above it stays intact.

CACHE-03: Cache-safe compaction

When the context window approaches its limit, the loop compacts conversation history. The key decision is where to compact from: we trim from the end of history, not the beginning. The static system prompt block at the top stays untouched.

Compacting from the tail preserves the cached prefix. Cloud providers see the same system prompt they've already cached. Compacting from the front would destroy the prefix and cold-start the cache on the next request.

What 40-70% reuse means in practice

The first message in a session always cold-starts the context. Every token gets computed. From the second message onward, the caching kicks in.

For cloud APIs: if a session would cost $0.10 in raw tokens, it costs closer to $0.04-0.06 after caching reduces the billed tokens on each turn. Across a day of active coding, that compounds.

For local models: the first message loads the context into GPU VRAM. Subsequent messages in the same session skip that work. Response latency drops on turn two and stays lower for the rest of the session.

None of this is magic. It's three engineering decisions applied consistently across every part of the loop that touches context assembly. The savings come from not undoing the work the model already did.

Why this matters for BYOLLM

These optimizations work regardless of which LLM you're running. Anthropic, OpenAI, Groq, Ollama, LM Studio. The three strategies apply across all of them because they're about how context is structured, not which model processes it.

If you're on a local model, you get faster follow-up responses. If you're on a cloud API, you get a lower token bill. Bring your own LLM; the cache savings come with it.

Questions about how the agentic loop handles context on your hardware or provider? Join us on Discord. And if you're picking a local provider, see all 15+ supported providers.

Common questions

How much faster are subsequent messages in the same LLM session?: Subsequent messages reuse 40-70% of the KV cache, cutting cloud API billing and local latency after turn one. The first message cold-starts the context, but later turns skip re-computation through KV cache reuse. For local models running Ollama or similar, response latency drops noticeably on turn two and stays lower for the rest of the session.
What happens to the KV cache when I add a new tool to the system prompt?: The cache is invalidated the moment anything in the sequence changes, including reordered tools. Bodega One Code keeps tools in alphabetical order on every call so the cache key stays stable. A new tool gets inserted at its alphabetical position, and the prefix above it remains intact, preserving the cache for tokens that come before the insertion.
Does KV cache reuse work with both local and cloud LLM providers?: Yes. The three strategies work regardless of which LLM you're running: Anthropic, OpenAI, Groq, Ollama, or LM Studio. KV cache reuse is about how context is structured, not which model processes it. Local models get faster follow-up responses, while cloud APIs get lower token bills.
Why compact conversation history from the end instead of the beginning?: Compacting from the tail preserves the cached system prompt at the top of the context, which remains untouched. If you compacted from the front, you would destroy the prefix and force a cold-start on the next request, throwing away all KV cache savings. Tail trimming keeps the cached section intact.

Written by the Bodega One team. We build Bodega One Code, the local-first AI IDE, and we write here about local models, AI costs, and what we learn shipping it. More about the team and why we build local-first on the about page.

Stay in the loop

Build-in-public updates, model picks, and Copilot/Cursor news as it breaks.

Ready to own your tools?

Beta is free and open to everyone. Download free.

Download Free →See Pricing

← Back to the blog