Quick answer
Context windows are finite and they fill up faster than you expect. Use our free context planner to allocate tokens across system prompt, conversation history, code, and response, before you start a task, not after you hit the limit.
Context windows: the resource everyone underestimates
Most developers think about VRAM when planning a local LLM setup. Context window is the second constraint, and it bites you mid-task rather than at startup. Running out of context does not always throw a clear error. It truncates silently, causes the model to lose earlier instructions, or produces output that ignores files you passed in.
32K context and 128K context are not interchangeable. On a 32K model, a typical coding session (system prompt, a few back-and-forth turns, two medium-sized files) can fill the window in 10-15 minutes. On a 128K model, the same session has plenty of room. The difference shows up in output quality, not just session length.
How to use the context planner
Open the context planner and pick your model, or type in a custom context window size if yours is not on the list. Then fill in three numbers: how large your system prompt is, how many conversation turns you typically have, and your average message length.
The planner maps out what remains for code and files, and whether your response buffer is going to survive the task. It shows a visual breakdown so the problem is obvious before you start, not halfway through a two-hour agent session.
The four token budget buckets
Every token in your context window belongs to one of four buckets:
- System prompt: Your persona, rules, tool descriptions, and any standing instructions. Typically 500-2,000 tokens for a well-structured system prompt. Larger agent setups with many tool definitions can run 3,000-5,000 tokens.
- Conversation history: Every previous message in the session, both your messages and the model's responses. This is the bucket that grows automatically as you work. It is the main cause of hitting context limits on long sessions.
- Code and files: The files you are asking the model to read or edit. A single 500-line TypeScript file is roughly 2,000-3,000 tokens. A handful of files can dominate the context budget.
- Response budget: Room you reserve for the model's output. Do not cut this too thin. Agents generating code need 2,000-4,000 tokens minimum. Cut the response budget and you get truncated output that silently drops the last part of a function.
Why conversation history is the silent killer
Each conversation turn adds two entries to the context: your message and the model's response. On a long coding session, this compounds quickly. After 10 turns at average message length, conversation history alone can consume 8,000-15,000 tokens. On a 32K model, that is 25-47% of your total budget used on messages, before you add any files.
Bodega One handles this with four memory layers. Instead of leaving every fact and decision in the raw conversation window, the app pulls useful context into persistent memory and retrieves it on demand. Your session does not need to carry the weight of everything you did two hours ago just to write a new function now.
Matching context expectations to model size
Different model sizes come with different typical context windows:
| Model range | Typical context | Best for |
|---|---|---|
| 7B-8B models | 8K-32K | Focused, single-file tasks |
| 13B-14B models | 32K-128K | Multi-file editing, longer sessions |
| 32B+ models | 128K+ | Full project context, agentic tasks |
A useful planning rule: target 80% utilization as your ceiling, not 100%. Leave the last 20% as headroom for unexpectedly long model responses, tool call outputs, and mid-task file additions. Hitting 100% mid-task is worse than working with a smaller effective window from the start.
Practical strategies to stay within budget
When context is tight, these four approaches have the most impact:
- Per-file injection: Do not open every related file at the start of a task. Inject files only when the model needs them. This keeps the code bucket lean during early planning turns.
- Clear history at boundaries: When you finish one sub-task and move to the next, clear the conversation history. The model does not need to remember the debugging session from an hour ago to write a new function now.
- Use a memory system: Store project facts, key decisions, and architecture notes in persistent memory outside the context window. The model retrieves what it needs rather than holding everything in context.
- Check before calling: Bodega One's Context Inspector shows live token usage before each LLM call. You see the breakdown (system prompt, conversation, code, response buffer) before you send. No surprises mid-task.
Next steps
Try the context planner to map out your token budget for a real task. Then check our local LLM rankings to see which models offer the largest context windows at each hardware tier. For a deeper look at what is possible on local hardware in 2026, read our local LLMs roundup.
Related posts
Ready to own your tools?
Beta opens May 2026. Complete 14 days and earn a $30 promo code.