Skip to main content

Build Log

The messy, honest version of development.

Not polished release notes. Real decisions, wrong turns, and why things took longer than expected. Updated as we build.

Build log entries

  1. Bodega One v1.0.0-beta.31.8 and beta.31.8.1 - out now

    beta.31.8 adds an optional quality gate for Bodega Mixture, makes startup and long sessions faster, and hardens air-gap mode and the approval gate. A same-day hotfix, beta.31.8.1, fixes the in-app Preview occasionally opening Bodega's own local model server instead of your project. Bring your own key; it all runs on your machine.

    Quality gate for Mixture reference drafts

    With Bodega Mixture on, a weak or off-topic reference draft could still be handed to the aggregator and drag down the final answer. A new opt-in setting (mixture.qel_gate) screens each reference draft before synthesis and drops the ones that fail, so the aggregator only works from drafts worth keeping. Off by default, like Mixture itself.

    Faster startup and snappier sessions

    Project file listings are cached instead of re-scanned on every request, session messages load through a new database index, and non-essential startup checks are deferred until after the app is up. Editor integrations also get a second "ready" signal once MCP servers and skills finish loading, so an external client can wait for the complete toolset instead of connecting into a partial one.

    Security hardening

    Air-gap mode now blocks more ways to reach the network from the shell: download and scripting tools that can fetch remote content (certutil, bitsadmin, perl, ruby, php) are blocked while air-gapped, and the check resists quoting tricks that previously slipped past it. If a configured hook rewrites a tool call into something dangerous, the rewritten command is re-checked with a fresh security context and always requires your approval. And two responses can no longer run against the same session at once, nor can a streaming session be deleted out from under itself.

    The hotfix: Preview and Bodega's own ports

    Asking the assistant to open the preview could land on the local model server's built-in chat page (a "Hello there" screen on localhost:8080) instead of your project, because the dev-server port scan counted Bodega's own servers as dev servers. beta.31.8.1 makes the scan skip the ports Bodega itself listens on, corrects the assistant when it aims the preview at one of those addresses, and blocks direct preview navigation to them with a pointer to the right way in.

    Fixes

    Cloud Boost reasoning settings now consistently come from the cloud model that is actually running instead of the local base model. A "build X" request in an empty folder no longer auto-activates helper skills written for modifying an existing codebase. And the contract extractor that verifies a finished task no longer mistakes HTTP verbs for function names or trips on trailing punctuation in route paths, so verification scores fewer false misses.

    Still an open beta. Full notes are on the changelog page. Update through the app, or grab the latest installer from the download page.

  2. Bodega One v1.0.0-beta.31.7 - out now

    beta.31.7 adds Claude Sonnet 5, brings Claude Fable 5 back to the model list, and makes repeated turns on local models faster. It also fixes a Cloud Boost stall on Claude models and a handful of long-task and local-model bugs. Bring your own key; it all runs on your machine.

    Sonnet 5 and Fable 5

    Claude Sonnet 5 is a new Anthropic flagship with a 1M-token context window. Claude Fable 5 is back after being temporarily withdrawn from general availability, following Anthropic restoring it on July 1 once US export controls were lifted. Both are available for the Anthropic provider, OpenRouter, and Cloud Boost. Since Bodega is bring-your-own-model, you point it at your own key and the model list picks them up.

    Faster repeated turns on local models

    llama.cpp and Ollama now reuse the prompt cache and keep the model warm between turns, so follow-up messages in a conversation start responding sooner instead of paying the full warm-up cost every turn.

    Fixes

    Cloud Boost no longer stalls on long tasks with Claude models, where a boosted Claude request driven off a local base could be mis-handled as a local model and return nothing. Capable models now keep their full step budget on long tasks instead of being downgraded to a short limit after a few recoverable tool errors. A "build X" request no longer trips a refactoring helper that kept writing extra verification files after the work was done. Opening the preview with a text-only local model falls back gracefully instead of erroring, switching local models is serialized so rapid swaps do not race, the app still boots if its license file cannot be read, and the Mixture progress card always resolves.

    Still an open beta. Full notes are on the changelog page. Update through the app, or grab the latest installer from the download page.

  3. Bodega One v1.0.0-beta.31.6 - out now

    beta.31.6 adds Bodega Mixture, an optional Mixture-of-Agents engine, and lets the in-app Preview serve plain static sites with no dev server. It also fixes a five-minute cutoff on long Cloud Boost builds and a local-model context-window clamp. All of it runs on your machine.

    Bodega Mixture (Mixture-of-Agents)

    A new optional engine that runs several reference models in parallel on your turn, then has one aggregator model synthesize the single reply you see. Turn it on in Settings under Models, then pick "Mixture" from the model picker. The reference models run with no tools and the conversation text only; the aggregator owns the tools and writes the answer. Local reference models are near-free, so an all-local mixture costs about one cloud call, and an optional QEL quality gate can verify the synthesized output. It is off by default, honors air-gap (cloud references are dropped when you are offline), and falls back to a normal single-model turn when fewer than two references are available.

    Preview static sites and games, no dev server

    The in-app Preview could only attach to a running dev server like Vite or Next, so a plain project (a bare index.html with some script files and no build step, like a generated HTML5 game) had nothing to show. Now, when no dev server is running and the project has an index.html at its root, Bodega serves the folder locally and opens it in the Preview panel. Ask the agent to "open the preview," or use the "Preview this folder" button in the panel. It is served on loopback only, so it stays air-gap safe, and dotfiles are withheld.

    Fixes

    Long Cloud Boost builds no longer stop early at a five-minute mark, where a fixed time limit on the agentic loop could cut off a capable cloud model mid-build. Local GGUF models with no embedded context-length metadata are no longer silently clamped to a 4,096-token window. And Mixture reference models can now be added from the Settings UI, which had been saving the list in a format the backend rejected.

    Still an open beta. Full notes are on the changelog page. Update through the app, or grab the latest installer from the download page.

  4. Bodega One v1.0.0-beta.31.5 - out now

    beta.31.5 is a security and reliability rollup from two app-wide bug-hunt sweeps. No new features. It closes a set of approval-gate, air-gap, and DNS-rebinding gaps, prevents a few silent data-loss paths, and fixes a batch of Cloud Boost, provider, model-management, and streaming bugs. All of it runs on your machine, and air-gap is stricter than it was.

    Security hardening

    A few gaps worth naming plainly. MCP tools now go through the per-tool approval prompt in Ask and Plan modes, the same as the built-in write tools, so a connected server cannot open a pull request or send a message without asking first. Air-gap mode now also covers voice transcription, inline code completion, and the goal reviewer, which could each reach a remote endpoint before. Web fetch is hardened against DNS rebinding (it validates every address a hostname resolves to), and the local API and WebSocket now reject any request whose Host header is not the loopback address, closing a path that could let a web page in your browser reach the local server. Prompts from a committed project config are sanitized before use, and credential scanning now runs before long output is truncated so a secret split across the boundary cannot slip through.

    Cloud Boost stops dropping runs

    When Boost (a cloud model like Opus) drove a long build, history trimming could leave a tool result without its matching tool call, which the cloud API rejected and stopped the run. The history sent to a cloud model is now always structurally valid. The context meter also reads the model that is actually running instead of staying on your small local window, a run that stops because your provider is out of credit now surfaces the real error instead of a generic "iteration limit," and the max-iterations setting is honored on capable models (a stale internal cap was stopping them at 25 steps).

    Local models behave

    Drop a GGUF straight into your llama.cpp models folder and it shows up now: Bodega scans the folder on startup and when you open the model list, instead of only tracking files added through the Model Hub. Truncated downloads are caught and resume instead of failing later with a confusing error, switching models mid-stream actually loads the new one, a model deleted from disk is flagged rather than failing on load, and a custom --ctx-size is read as the real window. And undo on an agent-created file now deletes it and prunes the empty folders it made, instead of leaving a 0-byte file behind.

    Still an open beta. This release is mostly us fixing things we shipped, named plainly, with the security gaps called out rather than buried. Full notes are on the changelog page. Update through the app, or grab the latest installer from the download page.

  5. Bodega One v1.0.0-beta.31.4 - out now

    beta.31.4 is a local-model and provider reliability release. If you run a custom OpenAI-compatible server like vLLM or llama-swap, or a smaller local model that fumbles a tool call now and then, this is the one to grab. All of it runs on your machine.

    Custom servers stop guessing wrong

    On a custom OpenAI-compatible endpoint that serves both chat and embedding models, "Auto" could land on the embedder. The filter that keeps embedding-only models out of auto-selection missed size-suffixed names like Qwen3-Embedding-0.6B and text-embedding-3-*, so Bodega picked a model that cannot chat. It recognizes those names now and leaves real chat and coder models alone. And a new per-model control (Advanced Sampling panel, "Tool calling": Auto / Force on / Force off) lets you send native OpenAI tool definitions to a capable model that Bodega did not auto-detect as tool-capable, for a model behind vLLM or llama-swap exposed under a custom name.

    The agent recovers from a tool slip

    When a model emitted a tool call missing a required argument, only non-native models got an actionable "fix the parameters and try again" nudge. Native-tool models, including local models served with tool calling, got just the raw error and would repeat or stall until a circuit breaker stepped in. Now every model gets a nudge that names the missing parameter with a per-tool hint, so a smaller local model can correct itself instead of looping. Separately, in-loop verification always failed on Windows because the auto-verify step appended a Unix-only no-op that is not a command there. It uses a cross-platform no-op now, so verification reports the real result.

    Privacy and MCP

    A new memory.auto_extract setting (on by default) lets you stop the agent from mining facts out of conversations into the persistent memory store. And when an enabled MCP server exposes file tools that overlap Bodega's built-in file tool, MCP settings now flag it, because two ways to read and write files can confuse smaller local models. It is detection only; nothing is disabled automatically.

    Still an open beta, and these were real edges a few of you hit running local and custom-server setups, so we would rather name them. Full notes are on the changelog page. Update through the app, or grab the latest installer from the download page.

  6. Bodega One v1.0.0-beta.31.3 - out now

    beta.31.3 is a verification and agent-control release, with a new per-model sampling panel and a round of fixes from live testing. Two ideas run through it: Bodega checks its own work harder, and you get finer control over what the agent may do without being interrupted to approve it. Almost everything new here is off until you turn it on, and all of it runs on your machine.

    Bodega reviews the code it just wrote

    After a creation task passes QEL, an optional second pass now reads the files the agent actually changed and flags the obvious problems (single-responsibility violations, weak names, plain bugs) in a collapsible Code Quality section on the verification card. It never gates completion, caps at five findings, skips files over 50 KB, and bails after five seconds instead of stalling the run. The rubric grader is the other half: attach a free-text rubric to a task and, once QEL passes, a one-shot grader scores the output against it and pins a verdict, a score, and a short justification to the same card. Both are opt-in, both fall back to a clean "inconclusive" under air-gap or on any error, and neither adds a new event to the stream.

    Fewer approval prompts, and a way to teach new skills

    Smart approval removes the clicks you do not need. In Ask mode, when the agent calls a low-risk read-only tool that clearly matches what you just asked for (a grep, a glob, a web fetch), it can approve that one call itself instead of stopping to ask you. Write tools like the shell and file edits stay hard-blocked and always need a human click, and the classifier runs last, after every existing security gate, so it can only ever shrink the set of calls you approve, never widen it. Off by default, with a tunable confidence bar.

    The new /learn command authors a skill from a source: point it at a workspace folder or a URL, it drafts a spec-conformant skill file, shows you a preview, and writes it only on approval. URL sources are blocked under air-gap; local folders still work. And auto-skill capture closes the loop: when a creation task passes QEL cleanly through a real multi-step workflow, Bodega can save that run as a lightweight, per-user knowledge card so the next similar request recalls it on its own. Sanitized before it is stored, scoped to you, off by default, and it never leaves the machine.

    Per-model sampling, and a loop that trims itself

    Every installed model now has an Advanced Sampling panel next to the temperature and context controls: top_p, top_k, min_p, repeat_penalty, repeat_last_n, seed, and stop, set per model. Samplers a given backend cannot accept are greyed out for that provider, so you cannot set a value the server will only reject. Separately, the agentic loop now compacts a long conversation proactively, at a fill threshold you choose and on a clean iteration boundary, instead of waiting for the old hardcoded 85% emergency point. A long run trims itself before it hits the wall, and the summary is still pinned for cache stability exactly as before.

    Groundwork for an honest benchmark

    We are building toward publishing a harness-lift number: how much Bodega's harness adds on top of the raw model, measured on the Aider Polyglot set, run fully locally. This release is just the plumbing. A headless run can now be aborted to free its slot, a run honors a maximum-duration cap so one stuck task cannot wedge a batch, and there is a written adapter spec (fresh session per task, the hidden tests as the oracle, harness versus raw model). The numbers come later, and only if they clear a bar we would be willing to publish against ourselves.

    Fixes

    A correct creation could fail QEL with a passing result sitting right underneath it. A Python module with tests might score 68 out of 100 because optional language idioms that happened to be absent still counted against it, and the Python execution proof only ran for app entry points. Optional patterns are bonus-only now, the proof falls back to any Python deliverable, and the same file passes. The first-run beta gate was also collecting junk email addresses typed to get past it, so the screen and the backend now check the address against a shared validator before anything is sent on (local activation still works either way). And the Agent settings note that described the temperature schedule backwards now matches what the loop actually does: steadier while planning, a little higher while writing files.

    Still an open beta, and most of what is new here stays off until you switch it on. Full notes are on the changelog page. Update through the app, or grab the latest installer from the download page.

  7. Bodega One v1.0.0-beta.31.2 - out now

    beta.31.2 is the cleanup marathon after beta.31.1: a hands-off quality-of-life and provider-audit campaign (one pass worked 172 findings high to low), the bugs we hit in live diagnostics, and a round of preview and dev-server fixes. Mostly fixes, a few new tools. All of it runs on your machine, and air-gap is unchanged.

    The agent stops looking stuck

    Three things made capable models look broken. A cloud build of a full multi-module game stalled at iteration 24 because the model profile capped the budget below the user ceiling, so we raised the size-class recommendations and the default ceiling from 25 to 50 (the no-progress and degeneration circuit breakers still bound runaway loops). The contract extractor could mine prose into a garbage contract: a "build a 3D game with Three.js, steering (A/D)" prompt turned Three.js into the only deliverable and "steering (A/D)" into fake routes, so the firewall blocked the real package.json and the model wrote nothing. Known library filenames are no longer treated as deliverables, and load-bearing web scaffold is exempt when the project looks like web or JS. And a force-completed run used to print "Task complete" even when the quality check had failed, one line above the warning that said otherwise. The summary is now an honest "Wrote N file(s)" and leaves the verdict to the QEL line.

    Providers got a real audit

    A long pass over every provider closed a stack of papercuts. BYOK per-message cost showed $0 for Fireworks and cold-cache OpenRouter (pure id-resolution, no price guessing). Cloud rate-limiting over-throttled higher-tier accounts by pacing against a conservative hardcoded default, adding 15 to 45 seconds per iteration; it now learns the real per-window limit from the provider's own response headers. And rate-limit buckets were shared across every OpenAI-compatible provider, so one 429 on Groq throttled Together, OpenRouter, DeepSeek, and Fireworks too. They are keyed per preset now. Plus Anthropic extended-thinking round-trips, Ollama tool-result linkage, and OpenAI complete() parity with streaming.

    Preview owns the preview

    Dev servers used to pop the system browser, and a foreground dev command could hang the model until it timed out, which pushed weaker models into the doom-loop guard. Now dev servers run with the browser disabled so Bodega's in-app Preview owns the URL, a foreground dev server auto-runs in the background, and when it prints its localhost URL the Preview opens to it. The wrong-port bug (Vite increments past 5173, and people set custom ports) is fixed. Two new tools came out of it: open_preview, one action that finds the running server and opens the panel, and shell run_in_background, which starts a long-lived process without blocking the call, with the same approval and credential-scan guarantees as the foreground path.

    Security

    A sandbox escape through a symlinked parent directory is closed, memory routes and loop concurrency are now scoped per user, and PreToolUse hooks plus permission profiles gate read-only tools too. A deny on **/.env now stops a read, not just a write.

    Still an open beta, and a good chunk of this is us fixing things we shipped, named plainly. Full notes are on the changelog page. Update through the app, or grab the latest installer from the download page.

  8. Bodega One v1.0.0-beta.31.1 - out now

    A same-day hotfix on top of beta.31, all of it about two cloud providers and how the app handles a model that cold-starts on the first request. If you run local models, nothing here touches you. If you bring an OpenRouter or Fireworks key, this is the one to grab.

    No more false OpenRouter disconnects

    A single slow health poll could flip the app to "disconnected" and blank the model picker, even though the provider was fine. OpenRouter just has a large catalog that sometimes takes longer than one fetch window. The health check now waits for two failed polls in a row before it shows the banner, clears it instantly on one good poll, and the catalog fetch gets a longer timeout so it finishes and caches instead of failing over and over.

    Cold models no longer time out into an empty reply

    OpenRouter returns headers right away, then holds the stream open while it warms the upstream model. If that warm-up ran long, it sent an error before the first token and the app silently swallowed it, ending with a blank response. Those pre-token errors now surface, cloud providers retry the transient ones automatically (safe, since nothing had streamed yet and the first request already warmed the model), and the connect window is wider so a slow-but-healthy start is not cut off.

    Fireworks shows its models again

    Fireworks' model-list endpoint errors out for serverless accounts, so the picker came up blank and the provider looked broken. Bodega now ships a curated list of current Fireworks serverless models, each one verified to chat, stream, and tool-call. While we were in there, OpenRouter got accurate per-message cost tracking from its live pricing, a picker organized by vendor and popularity, previously hidden Llama and Gemini models back on the list, and app-attribution headers so Bodega One registers on OpenRouter's app directory.

    Still an open beta, and these were real bugs a few of you hit, so we would rather name them than bury them. Full notes are on the changelog page. Update through the app, or grab the latest installer from the download page.

  9. Bodega One v1.0.0-beta.31 - out now

    beta.31 is the advanced-automation and verification wave. Two ideas run through it: you decide exactly what the agent is allowed to do, and Bodega can re-prove the work it already verified still holds. All of it runs on your machine, and air-gap still means nothing leaves.

    You gate what the agent can touch

    Lifecycle hooks let you run your own shell command before or after the agent uses a tool, to lint, format, validate, or block a change before it happens. A PreToolUse hook can stop a tool or rewrite its input; a PostToolUse hook just observes. Named permission profiles are reusable rule sets that only ever narrow what the agent may do: deny a tool by name, or confine file writes to a path glob like src/**. The safety line that matters: hooks you write in your own settings are trusted, but hooks committed in a repo's .bodega/config.json never run until you approve them, per exact command and project, so a cloned repo can't run anything behind your back.

    Bodega re-proves its own work

    When a creation task passes QEL verification, Bodega now remembers the contract it passed (the deliverables, the framework, the score) as a durable "this worked" baseline. Drift Radar re-proves those baselines against your current code and tells you which still hold, which regressed, which had their proof break, and which simply changed. Run it on demand from the new Drift tab, or let an optional nightly sweep watch in the background and raise a pill the moment something newly regresses. It is report-only: it never edits, applies, or retires anything, makes no network calls, and spawns no processes, so it stays air-gap-safe. Off by default.

    Verification you can hand to someone else

    Proof-carrying commits attach a signed Bodega-QEL trailer to a commit you verified: the score, pass or fail, the execution-proof result, and a hash of the contract. It is HMAC-signed with a secret that never leaves your machine, so a "verified" claim can't be forged or edited after the fact, and a Verify history action re-checks recent commits and marks each one verified, tampered, or unsigned. Air-gap attestation does the same for your network posture: a signed, tamper-evident record of whether air-gap is on and how many outbound attempts were recorded or refused. Both are honest by design. They attest what was actually recorded, not a guarantee that zero bytes ever left, and both are fully local.

    The bug we shipped, then caught

    QEL was checking plain JavaScript with the TypeScript compiler. In a dependency-free .js project that fails instantly, which sank otherwise-correct work to a failing score. This was a real cause of the "low score on a local model" reports. Vanilla .js and .jsx now get a parse-only node --check per file, and proof coverage widened to .mjs, .cjs, .php, .c, .cpp, and .swift so valid code in those languages isn't scored zero. If you tried a plain-JS task and got a bad grade, that was on us, and it is fixed.

    Downloads no longer 404 mid-release

    There was a 5 to 18 minute window during a release where the download links returned 404. GitHub flipped the release to public and latest before every platform's installer had finished uploading, so for a few minutes the latest release pointed at assets that were not there yet. Releases now upload into a draft and only flip to public once all platforms finish. Worth naming plainly, because a few of you hit it.

    The rest

    Watch-mode comment triggers act on inline // bodega: comments when you save, through the same verified, worktree-isolated apply gate as every Loop. Opt-in OpenTelemetry export sends run telemetry to an OTLP collector you run on localhost, loopback-only. Custom llama.cpp arguments let power users pass raw llama-server flags to tune for their hardware. Plus MCP per-server tool filtering, path-scoped project rules, contract preview in the composer, and a configurable shell environment policy. All off by default, or unchanged unless you reach for them.

    This is still an open beta. It has rough edges, and the JavaScript-scoring bug above is exactly the kind we would rather fix in the open than pretend away. If something breaks, tell us on Discord or GitHub. Full notes are in the in-app changelog under Settings, or on the changelog page. Update through the app, or grab the latest installer from the download page.

  10. Bodega One v1.0.0-beta.30 - out now

    beta.30 is the release where the agent stops being stateless. It remembers what worked on past tasks, your scheduled Loops learn from their own runs, and the quality checks finally fire on the runs that used to slip through ungraded. All of the learning stays on your machine. Air-gap still means nothing leaves.

    The agent learns across tasks

    Until now every task started cold. beta.30 gives the agent an in-context memory: it records what worked on similar tasks, surfaces those lessons on the next one, reflects after each run to distill a short takeaway, and learns from the tools you reject so it stops suggesting them. This is memory, not fine-tuning. Nothing about the model changes, and none of it leaves your device. It is scoped to you and the project, and it honors air-gap like everything else.

    Loops learn from their own history

    A scheduled Loop now reads its recent runs and carries forward what failed, so it stops repeating the same mistake every cycle. A Loop whose quality is steadily dropping auto-pauses instead of grinding on and burning spend. Loops can also pace themselves and back off when a run changes nothing, expire on a date, cap their total number of runs, and be defined in a committed .bodega/loop.md file so a repo ships its own automations the same way it ships its config.

    The quality checks fire when they matter

    This one is a real bug we shipped and then caught. The Quality Enforcement Layer used to skip grading in two cases: when a model thrashed and the loop force-completed, and when a request was phrased loosely ("make me a dashboard") instead of naming files. Those runs went through marked as passed without ever being checked. Now they are graded honestly, with a real score and a repair path, no matter how the run ended or how the request was worded. Verified live across a small local model, a large local model, and a cloud model.

    Groundwork: learning scoped per user

    Every learned rule, recorded mistake, cached tool pattern, and model-performance record is now tagged to its owner, so nothing crosses between accounts. You will not see this as a feature yet. It is the plumbing for shared deployments later, done now so the memory work above is built on it from the start.

    Full notes are in the in-app changelog under Settings, or on the changelog page. Update through the app, or grab the latest installer from the download page.

  11. Bodega One v1.0.0-beta.29.5 - out now

    beta.29.2 through beta.29.5 shipped over the last few days. None of them is a new direction. They are a reliability and privacy pass on the beta.29 routing line, mostly bug fixes, because this is an open beta and the boot path for local models had real problems worth fixing in the open. Here is the honest rundown.

    The local-model boot path was broken, and now it is not

    If you picked a local GGUF and ran it through llama.cpp, a few things could go wrong. beta.29.3 and .29.4 fix them.

    • A crashed llama-server used to wedge the app. It recovers now instead of hanging.
    • Downloaded GGUFs show in My Models under the llama.cpp preset even before llama-server is running. The list comes from what is on disk, not from a live server.
    • No more silent fall-back to Ollama. Once you pick a local GGUF, Bodega stays on llama.cpp. With nothing loaded it sits idle instead of switching to Ollama and logging a connection error every 30 seconds.
    • A stale model id left over from an old Ollama setup no longer jams startup with "Unknown GGUF id." It is cleared, and Bodega waits for you to pick a model.
    • Weights load in the background, so the window stays responsive while they load.
    • Test Connection with no model loaded now tells you to load a model first, instead of a raw "fetch failed."
    • Speculative-decoding draft models resolve from the right location, with a migration for installs on the old path.

    The editor works fully air-gapped now

    The Code-mode editor used to fetch its Monaco bundle from a CDN at runtime. That meant opening the editor made an outbound request, which is the wrong behavior for a local-first tool. beta.29.5 ships the Monaco bundle inside the app. With air-gap on, opening the editor makes zero outbound requests. It still loads on demand, so the main bundle stays within budget.

    Beta sign-up stopped dropping contacts

    The first-run email capture had a quiet bug: a network error at sign-up could lose your submission. beta.29.5 writes every submission to a durable on-device queue before it contacts Loops, then retries in the background. Offline or a flaky network at sign-up no longer loses the contact. Air-gap still sends nothing until you turn it off. beta.29.3 also added a name field alongside the email on that gate.

    A lightweight file viewer

    Opening a file by association or with "Open with" used to boot the whole app and load a model just to look at one file. beta.29.5 adds a fast viewer instead. Markdown gets a formatted preview. Any file opens in a real Monaco editor with line numbers, minimap, and syntax highlighting. You can edit, save with Ctrl/Cmd+S, and you get an unsaved-changes prompt on close. "Open in Bodega One" promotes the file into the full app when you want it.

    The smaller fixes

    • Installing llama.cpp or Ollama from Settings streams real download progress, keeps running when you leave the tab, has a Cancel button, and no longer offers to install a binary you already have. (beta.29.2)
    • The managed llama-server binary moved to build b9670, and five current GGUF models joined the catalog. More on that on the model-updates page. (beta.29.2)
    • An in-flight GGUF download re-attaches its progress bar after a full app restart. (beta.29.3)
    • "What's New" shows your real version instead of a stale "Unreleased" heading on a shipped build. (beta.29.3)

    Full notes are in the in-app changelog under Settings, or on the changelog page. Update through the app, or grab the latest installer from the download page.

  12. Bodega One v1.0.0-beta.29 - out now

    Bodega One beta.29 is out. Three things this release: you decide which model handles what, the agent does more on its own before it hands work back to you, and it can now take a task or a GitHub issue all the way to a reviewed pull request without your code ever leaving your machine. Still local-first. Air-gap still means nothing leaves your machine.

    Routing rules

    You write the routing table now. Ordered rules, first match wins, sitting above the built-in heuristics and below your explicit picks. Route by mode (chat or code), the kind of question, the agent step (read, write, plan, verify), file path, message size, or how much you have spent today. Send a rule's work to a specific model, or keep it local.

    • OR and NOT conditions for the cases that need them, like "auth or env files, but not read steps."
    • Teach the chat classifier your own vocabulary with custom patterns.
    • Ship rules per project in .bodega/config.json, so a repo routes the same way for the whole team.
    • Every routed message tells you which rule decided it. The Auto pill previews the rule as you type, and a dry-run tester routes a hypothetical message without sending it.
    • Smart Auto and the chat classifier are now visible as toggleable rules in the same place, so the whole routing story reads top to bottom in evaluation order.

    Hard limits still clamp everything afterward. A rule can never bypass air-gap, VRAM limits, or your spend caps.

    Verified Private Automation

    Point Bodega at a plain task or a GitHub issue and it does the whole job on your machine. It works in an isolated git worktree, writes the code, runs full QEL verification (it boots your server and probes it, runs your tests, checks the contract), and opens a pull request with the verification trace right in the description. A passing run gives you a ready PR. A failing one opens a draft, so an unverified change is never presented as merge-ready. A dry run verifies without opening anything.

    Cursor's and Copilot's coding agents do this by shipping your code to their cloud. Ours runs on your hardware and proves its work before it commits. Nothing leaves your machine but the branch and the PR.

    • Run it with one of your own custom agents, so its system prompt, model, and tool allowlist drive the verified run. Three starter agents ship as one-click templates: Code Reviewer, Test Runner, and Doc Checker.
    • Every custom agent now has a run history: status, QEL score, files changed, and the full verification trace per run.
    • Connect a token in Settings → Integrations → GitHub, then click Automate in the Code activity bar, next to Source Control. Fine-grained GitHub tokens are scanned and redacted from agent output the same as any other secret.

    The agentic loop got smarter

    • Goals. Type /goal and what done means. The goal survives every message, its task list resumes where the last run stopped, and an approved plan becomes its tasks. It will not call itself done on its own. When it thinks a goal is finished, a second model you choose is sent in to attack the result and find what breaks. Anything it finds becomes new work.
    • It sees type errors as it writes. After every TypeScript, JavaScript, or Python edit, diagnostics from a bundled language server land in the tool result, so the agent fixes the error in the same pass instead of at the end. No install, works on bug-fix tasks too, works headless in Fleet and Loops runs, works under air-gap.
    • dispatch_scout. Send a read-only sub-agent to explore the codebase and report back a short digest, so the search does not fill your context. You get the answer, not the forty files it read. The scout cannot edit, run commands, or send its own scout.
    • get_diagnostics. The agent can pull the current type errors for a file before it touches it, instead of running the compiler through the shell.
    • Test-driven repair. When a build task fails verification twice and the project has a test runner, a capable model writes one failing test, proves it fails for the right reason, then fixes until it passes.
    • Two new skills. /decompose turns a big objective into a goal with three to seven verifiable tasks, each naming its own evidence. /onboard tours an unfamiliar repo, traces one real flow end to end, and saves what it learned as project knowledge.

    Your editor, your models, your privacy

    • The editor language server graduated from experimental, and now it works for everyone out of the box. Red squiggles, go-to-definition, hover types. The server ships with the app, so there is nothing to install and nothing to configure.
    • The full reasoning range, per model. The dial now shows every tier a model actually has. GPT-5 gains Minimal, and Claude 4.6 and later gain Extra High and Max. Each model's menu shows only the tiers it accepts, and switching models clamps your setting to the nearest one, so a request never gets rejected.
    • Cost tracking for your own models. Set token prices for self-hosted, custom, or internal-gateway models, and their cost shows in the dashboard like any cloud provider. It stays on your machine, no lookup, works in air-gap.
    • Export a session as a self-contained web page that opens in any browser with no network, or a self-hostable viewer you run on your own infrastructure. It is a local file you choose to share. Nothing is uploaded.
    • Driving Bodega from another editor over ACP is verified and documented. Zed, JetBrains AI Assistant, or any ACP client can run it as a headless agent over stdio, with every tool call still gated by your approval.
    • New models and two new providers. A catalog refresh adds fourteen models and two cloud providers, MiniMax and Z.ai. Cloud prices were corrected across the board so the dashboard and pre-send estimates match what you are billed.

    Plus the polish. A visual pass over the whole app: buttons read as their accent color instead of filled purple boxes, the toggles are a thinner minimal design, and the Chat / Code switch is cleaner. You can also stop a running Loop mid-run from the indicator, and the partial work is left in its worktree for review, never auto-applied.

    Plus the usual round of fixes. Routing now fires correctly on live sends, and a budget condition inside an OR or NOT group now reads your real spend. Full notes are in the in-app changelog under Settings, or on the changelog page. Update through the app, or grab the latest installer from the download page.

  13. Bodega One v1.0.0-beta.28 - out now

    Bodega One beta.28 is out, one day after beta.27. This one is the polish release. We ran a 25-agent research sweep over everything already shipped and turned what it found into five waves of work: correctness fixes, performance, a UX consistency pass, and the small features people kept asking for. No new direction, no new flags to learn. beta.27, tightened.

    Run Tasks

    One-click dev-server launch from the terminal tab bar, plus a labeled "Run tasks" chip in the agent panel's action row so it is not hidden. It auto-detects dev, start, serve, and watch scripts from your project's package.json, takes custom named tasks via a run_configs key in .bodega/config.json, runs each task in its own terminal tab, and "Run all" launches everything at once. Stopping sends Ctrl+C so the process exits cleanly and the logs stay readable. If the project has a package.json but no node_modules, Run Tasks offers a one-click npm install first.

    Runs Inbox

    One topbar surface for everything waiting on you: fleet sessions needing approval or ready to apply, loop runs parked for review or below their QEL bar, each with click-through to the owning panel. Hidden when empty. The Fleet and Loops pills keep narrating live activity; the inbox is where the decisions queue up.

    Small features people kept asking for

    • Coming from another editor? Opening a project with Cursor, Copilot, Cline, Windsurf, or Continue rules offers a one-click import into .bodega-rules. Append-only, never overwrites yours, re-import skips what is already in.
    • Auto-approve read-only tools in Ask mode, off by default. Searching and reading can skip the approval prompt while every file write, web call, and shell command still asks. Shell can never auto-approve.
    • Pre-send cost estimate: on cloud models the status bar shows roughly what your next message will cost next to the Context meter. Local models show nothing, because there is nothing to pay.
    • Review only what changed: the source-control Review button skips files that are unchanged since your last review, with a "Review everything" escape hatch.
    • Interval and Map-staleness loops show a live "next in 14 min" countdown, and the topbar loops pill gained a per-loop Run-now button.
    • Every llama.cpp model swap now shows stage labels and elapsed time, not just the vision-triggered ones.
    • The agent can open the Preview tab when you ask it to, instead of only navigating one you already opened.

    Fixes worth naming plainly

    • Cost tracking showed $0.00 for Claude Opus 4.8 and Fable 5. The pricing table was missing both flagships and carried stale rates for older models. If you watched your spend with those models, the number was lying to you. Fixed, and the per-message estimate above exists partly so this class of bug is visible immediately.
    • The air-gap warning about configured cloud providers never fired. It was checking a deprecated setting. The warning works now.
    • Two real leaks: long-lived ACP editor connections grew memory forever (idle sessions now evict after 30 minutes, never mid-turn), and quitting Bodega mid-run could kill headless loops mid-write (shutdown now aborts them cleanly and waits for writes to land).
    • When no model was selected, requests went out with a placeholder model name and failed confusingly. You now get a clear "No model selected" pointing at Settings.
    • A failed send buried your prompt under an error banner with no way out. The banner has a close button now and restores your draft into the composer.
    • Plus: dev-server chips clear themselves when the server dies, dragging a panel tab too far no longer detaches a phantom mini-window, the keyboard-shortcuts overlay stopped omitting eight shortcuts that actually work, and the Windows installer's finish page is readable in dark system themes.

    Consistency pass

    One focus ring across 35 controls instead of a mix of browser defaults. Shared icons for the fleet glyph and the safety shield everywhere they appear. The top bar clips cleanly on narrow windows instead of overlapping the window controls. Status colors read correctly on light themes. And the slow spots got faster: Map staleness checks hash 16 files at a time, llama.cpp loading no longer blocks the backend on boot, and embedding index builds detach instead of hanging the request.

    Full notes are in the in-app changelog under Settings, or on the changelog page here. Update through the app, or grab the latest installer from the download page.

  14. Bodega One v1.0.0-beta.27 - out now

    Bodega One beta.27 is out. The headline is Bodega Loops: scheduled agent tasks that run on your machine, for free, with every change gated by verification before it touches your code.

    Loops

    A loop is a named task on a schedule. Pick what it runs with, one of your custom agents or a specific provider and model, so a fast local model takes the small jobs and a cloud model takes the heavy ones. Trigger it on a cron schedule, a fixed interval, or on Bodega Map staleness: fire when N of your codebase map's module summaries fall out of date with the code. Nothing else can express that trigger.

    Every run executes headless in its own git worktree and lands as a reviewable diff with its QEL score and full trace. Park for review is the default. Auto-apply only happens when the score clears the bar you set. Dry run never applies anything. There is a dashboard in the Code sidebar, quick-start recipes in Settings (nightly doc sync, dead-code sweep, test backfill, Map self-maintenance), run history with expandable traces, and a topbar pill that goes red and stays red if a run fails. A loop can never apply silently and it can never fail silently. Scheduled runs refuse to execute without worktree isolation, respect your spend caps, and inherit air-gap.

    The verifier got rebuilt

    • Execution proofs. For a server task, QEL boots the generated app in a sandbox, forces an ephemeral port, and sends one real request to the route the task asked for. A response is the strongest pass evidence there is. A boot crash or a 5xx is a real failure. Anything environmental stays neutral and never counts against you. The QEL card shows the probe and the verification time.
    • More languages. JS and TS test files run under vitest the way Python tests ran under pytest. Ruby and PHP get syntax gates. SQL and Dockerfiles get structural lints, so a truncated CREATE TABLE or a prose Dockerfile fails instead of sailing through.
    • A calibration harness. 43 labeled known-good and known-bad scenarios run through the real verifier at every model tier with CI floors: broken work scoring a pass fails the build. The first sweep caught two real scoring holes. Scoring is evidence-driven now.
    • The semantic judge runs under air-gap when the judge model is served locally, and it can finally park a marginal pass it is confident is wrong instead of only handing out bonus points.
    • Verification runs once per turn instead of three or four times. Roughly twice as fast, same checks.

    Terminal

    A pinned header keeps showing which long-running command the scrolling output belongs to, with live status and click-to-jump. The Output panel is real now: backend logs with level filters, search, and follow-tail. The Debug Console mirrors the agent loop's live diagnostics with expandable detail. The action bar gained searchable command history.

    Editor and source control

    Open files with Bodega One: double-click a .md or .json or .ts, or use Open with, and it opens in an editor tab, no project folder required. Works from the shell too. In source control: hunk-level staging (stage exactly the lines you mean to commit), a conventional-commit type picker next to the message box, and project search keeps your last 10 queries.

    Map, agents, Fleet

    Bodega Map exports the visible graph as a Mermaid diagram, explains any edge on click (why does A depend on B), and Ask-the-Map gains a Fast mode that answers near-instantly from the generated summaries. Custom agents now show who is driving: the status bar reads AgentName plus model whenever one is selected, and you can export and import your whole agent roster as a JSON bundle. Fleet Parallel's compare view shows each session's QEL badge and the run's total cost, with a one-click re-race for the non-winning models. Fleet apply gets a don't-ask-again option like discard.

    Quality of life

    The Performance tab can clear its model pass-rate history. Crash reporting has an opt-out toggle. Keybindings is now an honest, searchable reference of bindings that actually exist. Cards across Fleet and Loops show status at a glance with color washes and a breathing glow on running work. Buttons, sliders, chips, and settings layout all got a consistency pass.

    Security

    Two items worth naming plainly. The local API server now honors air-gap: it refuses to start with air-gap on and rejects requests if toggled mid-session. And scheduled loop runs are hardened end to end: isolation required, project paths bounded, iteration caps enforced, concurrency cap race-free.

    Here is the long list of fixes. Full notes are in the in-app changelog under Settings → About.

    Verification fixes

    • QEL was failing correct TypeScript. A workspace with no tsconfig made the bare compiler print its help text and exit non-zero, which QEL read as a compile failure and used to hard-cap every correct TypeScript task at 50. This was the real source of the "0% pass on local models" reading, not the models.
    • Edits and bug fixes are verified again. The entire modification path had been dead code since beta.25, so every edit got no score and, under loop auto-apply, parked forever.
    • Model performance stats only counted failures, manufacturing the same fake "0% pass rate." Successes count now, and you can clear the poisoned history.
    • A task the verifier could not actually verify can no longer auto-pass on structural credit alone. It parks for review.
    • An execution proof can no longer pass off a server you already had running. Ports that were already answering before the probe are excluded. Found because the test suite happened to run while the app was open.
    • A hung verification can no longer freeze the app. The post-write build check ran synchronously on the main process (every stream stalled while one build ran), and a stalled judge could block forever. Both have hard timeouts now.

    Loops and Fleet fixes

    • The hard cloud spend cap now covers background sessions, Fleet runs, and loops, and records their cost. All three used to bypass it entirely and their spend was invisible.
    • Stale fleet sessions are discardable again, and discarded cards stay gone instead of resurrecting as "Unknown" on the next refresh. A one-time migration heals cards already stuck this way.
    • A loop that changes nothing now finishes as "no changes" instead of parking an empty worktree as "Ready to apply" with zero files.
    • The diff view's Apply button no longer plays dead when the file count desyncs from the diff you are looking at.
    • Running a loop no longer fires a "signal timed out" toast. The request held the connection open for the whole run while the frontend timed out at 15 seconds. The duplicate completion toast is gone too.
    • Fleet cards show the model that actually ran, not a placeholder.
    • A custom agent's pinned cloud model now routes to its own provider instead of erroring on the active local one.

    Editor, git, and UI fixes

    • Closing a dirty editor tab from the command palette now asks about unsaved changes, like Ctrl+W always did. This was a real data-loss path.
    • Streaming no longer re-renders the entire message history on every token.
    • The docs hub no longer drags the whole markdown pipeline into the startup bundle, undoing a 423 KB regression.
    • A failed search shows what went wrong (bad regex, internal error) instead of pretending there were no results.
    • Git stage and unstage failures surface in the error strip instead of silently doing nothing. Commit-message generation failures surface too.
    • Git review: "nothing to review" is a calm message instead of an API error, and refactor prompts no longer spin the commit loop.
    • Prompts and Skills tabs no longer render a backend failure as an empty list.
    • The terminal agent greeting no longer double-posts.
    • The About-page logo renders cleanly in every theme, and six Settings warning banners are readable on light themes.

    Update through the app, or grab the latest installer from the download page.

  15. Bodega One v1.0.0-beta.26 - out now

    We shipped Bodega One beta.26. This was a big one, so here's the full picture, not the highlight reel.

    Quick context if you're new: Bodega One is a local-first AI coding IDE. Chat mode and code mode in one window, your choice of models — local through Ollama or llama.cpp, or bring your own cloud key — a verification layer that scores the agent's work before it claims it's done, and an air-gap mode that means zero outbound network traffic when you flip it on. The whole point is to give you the modern agent workflow without handing your codebase or your wallet to someone else's server.

    beta.26 had two threads running through it: meet people in the editor they already use, and hand over more control over what the agent does and what it costs you.

    Bodega goes into your editor

    The headline is that you can now run Bodega from Zed.

    We already let you pull other agents into Bodega — Cursor, Claude Code, Gemini CLI, and Codex all run as Fleet members inside the app, sandboxed through Bodega's own tools. This release runs the other direction. Type bodega --acp and Bodega starts headless as an ACP (Agent Client Protocol) agent that an external editor drives over stdio. Register it in Zed's agent_servers and you're talking to Bodega's agent from inside Zed.

    The part I care about is that it's the same agent with the same guarantees. The working directory Zed hands it is confined to a folder you allow (acp.allowed_projects_dir, default your home). Tool calls route through the same permission gate as everywhere else in Bodega. And we route logs to stderr in ACP mode so nothing pollutes the JSON-RPC channel and corrupts the handshake. If your editor is Zed and your trust boundary is Bodega, you don't have to pick one.

    You can define your own agents

    A custom agent is a profile: a system prompt, a pinned model, a tool allowlist, a read-only flag, and a cap on how many iterations it's allowed to run. You build them in Settings → Custom Agents and pick one from the Agent panel's dropdown.

    The design decision worth calling out: the tool allowlist is enforced at the execution gate as an intersection with the panel's own tools. A custom agent can only ever narrow what's allowed to run, never widen it. A profile you label "read-only reviewer" stays read-only no matter what its prompt says or what the model tries to call. The same goes for the iteration cap — it's enforced on the loop, not suggested to it.

    The Bodega Map grew up

    The Map was already a dependency graph of your codebase with an AI-written summary per file, cached by content hash. beta.26 added three things that make it useful instead of just pretty.

    Staleness tracking: when a file changes, its node gets an amber dot, because the cached summary is now describing code that no longer exists. You can see at a glance which explanations to trust.

    A project overview: a generated page that rolls up every module summary into one read, so a new contributor can get the shape of the codebase without clicking thirty nodes.

    And "Ask the Map" is no longer just a panel — it's a tool the agent can call mid-task (query_map). So when the agent needs to know where authentication is handled, it asks the Map and gets an answer grounded in your real code with sources, instead of guessing from the file it happens to have open. As far as I know nobody else exposes their codebase map to the agent as a callable tool. On top of that: per-symbol line-jump from the node drawer, a node accent bar that weights important files visually, and a Generate-Map run you can actually cancel.

    Reviewing what the agent did

    Two changes here, and they're the same story: trusting an agent means being able to check its work fast.

    Local code review — a Review button in the Git panel runs an AI review of your current working diff, the same flow as commit-message generation, on whatever model you've configured. No PR, no remote, no cloud requirement. It's "tell me what's wrong with what I'm about to commit," locally.

    Per-hunk diff review — when the agent edits a file, you don't have to accept the whole thing. You can accept or reject its changes one hunk at a time, and this release surfaces that one click from the Agent panel's change list, not buried in a separate panel. Accept the three good edits, reject the one you don't like, apply, done.

    Models and speed

    Claude Fast mode — a Fast toggle that sits right next to the reasoning control in the composer, shown only for Claude models. It skips extended thinking when you want a quick answer instead of a deliberation. The per-message reasoning dial still overrides it, so you can pull the thinking back for one message without touching the toggle. (And we fixed a bug where the old extended-thinking path was quietly overriding it — more on that below.)

    Managed llama.cpp embeddings — you can now point your codebase embeddings at a Bodega-managed llama-server instead of standing up Ollama just for embeddings. There's a picker that lists the GGUF models you already have, so it's one fewer moving part to run.

    Cost and privacy, made visible

    Spend caps — a hard ceiling on total cloud spend, not just Cloud Boost. When you hit it, new requests stop. This is the difference between "I think I'm being careful" and "the tool will not let me overspend."

    Air-Gap Active in the top bar — when air-gap is on, the top bar says so. The zero-network guarantee shouldn't be something you have to remember you turned on; now it's in your face the whole session.

    The terminal links to your code

    Ctrl/Cmd-click a path in terminal output — src/foo.ts:42, the TypeScript (42,10) form, a stack-trace frame — and it opens that file in the editor at that line. When a test or the compiler tells you where it broke, you click it instead of hunting for it. And the terminal finally has real copy and paste: Ctrl+Shift+C and Ctrl+Shift+V, plus a right-click menu with Copy, Paste, Select All, and Clear. Plain Ctrl+C still sends SIGINT, so interrupting a process still works.

    The app itself

    The docs moved in. The Help panel is a real docs hub now — a section rail, search across everything, and a written guide for every feature (chat, code, the AI panels, models, vision, reasoning, fleet, the Map, memory and knowledge, QEL, tools, terminal, custom agents, integrations, privacy, hardware) plus a full keyboard and slash-command reference. You stop tab-switching to a website to learn how something works.

    Settings got a full overhaul. The whole surface was restructured from one long list into seven labeled groups — Look & Feel, Models & Providers, AI Behavior, Workspace, Integrations, Privacy & Safety, and About & Support — with rewritten copy and a visual pass, so settings sit where you'd expect them. Fleet got its own entry, RAG settings moved under Knowledge, and Knowledge picked up its own icon instead of sharing Memory's.

    And finished sessions now show their QEL verification score right in the chat sidebar, so you can see at a glance whether the agent's last creation task actually passed.

    The unglamorous half

    A real chunk of this release was hardening, and it doesn't demo, so it's easy to skip — but it's why the app holds up.

    The backend used to occasionally outlive the app on Windows and squat port 3000, so the next launch hit "port in use." Now it tree-kills its child process on quit and waits, and as a backstop the backend independently watches the app and self-exits if the app dies abruptly — Task Manager, a crash, a dev SIGINT, doesn't matter. The backend also used to crash when its log pipe closed mid-shutdown (an EPIPE on a routine background write); the logger swallows pipe errors now and the cleanup timer can't throw out of its callback, so it survives the quit-and-reload race.

    We ran a specialist security pass over the whole batch — MCP server-command re-validation on update, a tighter subprocess environment allowlist, a dompurify bump — and it came back zero critical, zero high. Air-gap got extended to cover the Git AI features, so a commit-message or review request can't quietly call a cloud model and leak your diff when you've asked for no network. And we hid two placeholder bottom-panel tabs that did nothing but make the app look half-finished.

    None of that is a feature. All of it is why your next session doesn't fall over.

    The full changelog ships in the app under Settings → About. If you want to try it, the download's in the replies.

    Bug-fix TL;DR: backend port-3000 orphan on quit (fixed), backend EPIPE crash on reload (fixed), security review came back 0 critical / 0 high, air-gap now covers commit/PR/review, Claude Fast mode now actually takes precedence (it was being overridden), the Bodega Map renders Markdown instead of raw text and opens reliably on both layout engines, the fleet list refreshes the instant a background run finishes, per-hunk review is surfaced in the Agent panel, the queued-message indicator is a static dot instead of a spinner, the terminal Bodega agent no longer prints its greeting twice, and a pass of theme and legibility fixes across the Problems, Outline, Debug, and status-bar panels.

  16. Bodega One v1.0.0-beta.25 - out now

    beta.25 shipped today, and it's the one where Bodega really became an orchestration platform.

    Three things made it click:

    Agents can race each other now. Fleet Parallel fans one task out to multiple models in isolated worktrees, then QEL scores the diffs so you pick the winner. Watching three models attempt the same refactor and diffing the results is genuinely useful.

    You can bring your own agents. Through ACP, Cursor / Claude Code / Gemini CLI / Codex run inside Bodega's fleet — but routed through our own sandboxed filesystem and shell, so air-gap still holds. We didn't want to rebuild every agent; we wanted to conduct them.

    Your codebase talks back. We wired the dependency-graph view into the embedding index we were already building and turned it into a wiki you can interrogate — explain a file, document the repo, or just ask it questions grounded in your real code.

    Then the dogfooding caught the good bugs. Two favorites:

    → The Context Inspector was quietly applying our local GPU's 64K VRAM ceiling to cloud models with 1M context. Three call sites hardcoded "local." (The agent was never actually capped — the display just lied.)

    → Typing "yo" in code mode got classified as a command, so the agent spent 12 iterations trying to "do" a greeting until it rate-limited itself.

    Before shipping we ran a 6-agent review fleet over everything — security, architecture, perf, tests, docs — in parallel. 6,203 backend + 1,682 frontend tests green.

    Still local-first, top to bottom. beta.26's already cooking.

  17. Bodega One v1.0.0-beta.24 - out now

    beta.23 was the layout reshape. beta.24 was supposed to be a quiet reach-and-polish pass: port one feature to one more engine, ship a first-run tour, done.

    Then we put the live build in front of a screen and actually used it. Four bugs fell out, three of them ship-blockers, none caught by a single unit test. One had been sitting in the codebase since beta.17. Here is the whole story.

    Local vision, now on llama.cpp

    beta.23 gave local text-only coding models a way to see images: the text model stays in the driver's seat, and when it needs to look at a screenshot it routes the question through a separately-bound local vision model in the background. You never leave your coding model. That shipped for Ollama. beta.24 ports the whole orchestration to llama.cpp, the managed engine Bodega installs and runs for you.

    The flow, end to end: you attach an image to a local text model, Bodega notices the active model can't read images but a VLM is bound, it hot-swaps the vision model in, answers your question against the image, and swaps your coding model back. One round trip, you stay in context. Cloud vision (Claude, GPT, Gemini) handles images natively and was never part of this. The architecture was already provider-agnostic from beta.23. llama.cpp was "just" the missing adapter.

    The model that deleted itself

    First smoke, fresh install. Onboarding downloads a small local model, everything green. Flip the engine preset from Ollama to llama.cpp to test the new path, and the model is gone. "llama.cpp not reachable."

    Root cause: the preset-switch cleanup routine wiped the onboarding-set model on the flip. It had been doing this since roughly beta.17, but every prior smoke happened to skip the onboarding model, so nobody ever flipped the preset with a freshly-downloaded model still pinned. The one path real users take on day one was the one path we had never exercised. The fix is small. The lesson is not: a bug that only fires on the exact first-run sequence is invisible to every test that starts from a configured state. You have to start from zero.

    Image input is not supported

    Model preserved, llama.cpp reachable. Attach a screenshot to the local text model and ask about it, and you get a raw HTTP 500: image input is not supported. The orchestration only fired when the model itself asked to look at an image via a tool call. A plain user attachment took a completely different path: it flowed inline, straight to the active text-only model, which has no idea what a PNG is. There was no guard that said "this model can't see, but a VLM is bound, divert."

    So we built one: a pre-loop check. Images present, active model is local text-only, a VLM is bound, route through the vision path instead of handing the image to a model that will choke on it. Confirmed it fires in both chat and code mode before the model call ever happens.

    The thrash

    Divert works. Now force a real cold swap, text model out, vision model in, and it intermittently 500s. The managed llama.cpp engine runs one model at a time, so a swap means terminating the outgoing process and spawning the new one. But the manager also runs a crash-recovery watcher: if the server process dies unexpectedly, it respawns it. You see where this goes. The manager terminated the outgoing model on purpose, its own watcher saw a dead process and called it a crash, respawned the model we were trying to replace, and raced the swap. Two processes fighting for one port. 500.

    The fix: detach the exit listener before the intentional terminate, so the manager does not mistake its own cleanup for a crash. After that, forced cold swaps ran single, clean transitions each way, every time.

    The narration nobody could see

    Everything worked now, except the inline progress card that is supposed to show the swap happening never appeared. We only caught it by putting computer-use eyes on the live app and watching the actual pixels. The card was rendering fine. It was just sitting behind the full-screen overlay that pops for any llama.cpp swap. And that overlay fired twice per vision query: once for the swap in, once for the eager swap-back.

    Fix: tag every swap with a reason, manual vs vision. The full-screen overlay suppresses itself for vision swaps so the purpose-built inline card owns the screen. The progress emitter locks onto the first model and shows a friendly name, so the swap-back is not narrated as a confusing second event. This is the kind of bug no assertion finds. You have to look.

    How this got built

    Every one of those four surfaced from the same thing: sitting in front of the real binary and using it like a user would (fresh install, real local models, real screenshots) driven live through the app via computer-use. Unit tests verify what you thought to assert. Live smoke finds what you did not. Each fix was specced before code, reviewed by specialized agents at the merge gate (architecture, security, QA, in parallel), then live-re-verified against a forced cold swap before it shipped.

    The quieter wins that also landed: the guided-tour overhaul (a real walkthrough of Chat, Code, Fleet, the model picker, and settings), an in-app changelog in Settings, and a What's New popup trimmed to just the current release.

    By the numbers

    • 5,130 backend tests / 1,278 frontend (+168 / +154 since beta.23), zero regressions
    • SSE event count 21 to 22 (vision_swap_progress, the live swap narration)
    • Full local two-model vision pipeline working end to end on llama.cpp: forced cold swap each way, zero crash-misfires, the VLM read the test image correctly
    • Beta extended to November 1. The June 1 cutoff is gone.

    What did not ship (beta.25)

    The persistent "analyzed with X" summary that stays on the message after the swap completes. The swap narration fires correctly, the card just unmounts when it is done, and making it stick is a frontend lifecycle fix we would rather get right than rush. Multi-image and token-streaming for the vision path are queued behind it.

    Update through the app or grab the latest installer from the usual place.

  18. Bodega One v1.0.0-beta.23 - out now

    The biggest layout reshape since the IDE shipped. ~50 commits since beta.21. The headline: panels you can actually move.

    Dockview migration - drag any panel anywhere

    • Editor, Explorer, Terminal, Agent, Preview, Fleet, Outline - drag tabs by their headers, split panels in any direction, dock as tabs, reset with one click in Settings
    • A 5-zone overlay shows exactly where the drop will land (top / right / bottom / left / center) - no more "where did that go" guessing
    • Layout survives reload, crash, and project switch - schema-versioned, migration-guarded, hidden-state recovery built in
    • Legacy CSS-Grid layout is still there as a one-flip Settings escape hatch through beta.24

    Split editor - view two files side by side (Ctrl+\)

    • Toggle a second Monaco editor next to the first, each with its own tab bar, file icons, and breadcrumb trail
    • Click any open tab in the secondary pane to switch what's on the right - independent of the left
    • Each pane keeps its own Monaco model, so closing or switching files in one doesn't disturb the other
    • Same toggle wired to the command palette ("Toggle Split Editor") and the editor tab bar's split button

    Two-model VLM orchestration - vision for local text-only models (Ollama)

    • Cloud vision models (Claude 4.7, GPT-4o, Gemini) already handle images natively - nothing changes there
    • Local text-only coding models (DeepSeek, Qwen-Coder, Llama-3.3) couldn't see images. beta.23 fills that gap: the text model stays in the driver's seat and routes vision questions through a bound local VLM in the background
    • BoundVisionService auto-pairs the smallest installed VLM via a live PNG probe; air-gap mode + non-localhost Ollama refuse to bind
    • When no VLM exists, an inline card surfaces with one-click install - never echoes a model name, never silently fails
    • llama.cpp (the managed local engine) joins the orchestration in beta.24

    Polish that earned its keep

    • Settings opens as an editor tab in code mode (cleaner than the old overlay)
    • Direction B "Void Float" aesthetic - panels float on a darker canvas with subtle card lift
    • 3 composer dropdowns Portal'd so they no longer hide behind the editor
    • Cloud provider error classifier - HTTP 402 + "Insufficient Balance" + 6 sibling patterns now surface as payment_required with a direct billing link, across all cloud providers: OpenAI, Anthropic, DeepSeek, OpenRouter, Mistral, Together, Groq, Fireworks

    Under the hood

    • +73 frontend tests, +103 backend tests, no regressions
    • 6 god-file splits in one day (LLMService 816 → 640, WorktreeManager 913 → 669, ChatStage 525 → 434, etc.)
    • 16 rounds of live-smoke fixes after the tag (last-mile bugbash)

    Update through the app or grab the latest installer from the usual place.

  19. Bodega One v1.0.0-beta.21 - the agent can see your preview

    Open a localhost dev server in the Preview tab, ask any vision-capable cloud model "describe what you see" - it takes an actual screenshot of the rendered page and answers from pixels. Not from HTML - from what your users would actually see.

    What's new

    • PreviewInteractionTool - agent screenshot, DOM inspection, console capture, navigate, click.
    • Preview is a real editor tab now, not a bottom panel. Auto-opens when we detect a dev server in your terminal. URL persists per workspace.
    • llama.cpp vision support - LLaVA 1.6 Mistral 7B and Moondream2 in the catalog with paired mmproj download.
    • Welcome screen Send button actually works (chip + arrow = new session).
    • Ctrl+Alt+N for new session globally (Ctrl+N belongs to Monaco).
    • C/C++/Arduino syntax highlighting.
    • Vision chip on cloud + local VLMs in the model picker.
    • Grouped + collapsible model dropdowns (no more wall of names).
    • ~25 other smoke fixes from beta.20.

    About vision - read this

    The screenshot path works end-to-end with cloud vision models today:

    • Anthropic: Claude Sonnet 4.6, Claude Opus 4.7, Claude Haiku 4.5
    • OpenAI: GPT-4o, GPT-4o-mini, GPT-5 family
    • Google: Gemini 2.5 Pro, Gemini 2.5 Flash
    • Qwen cloud: qwen-vl-max

    Local 7B vision models (qwen2.5vl, LLaVA, Moondream) can SEE the image - the tool delivers the pixels correctly - but they can't reliably drive the agent loop at that size. They look at the screenshot, then forget to take the next action. Capability ceiling, not a bug.

    For now, if you want the screenshot loop on local, you're driving a 30B+ VLM (Pixtral, qwen-vl). The proper fix is in beta.22 ↓

    Auto-updater will pick it up. Restart Bodega and you're on beta.21.

  20. Bodega One v1.0.0-beta.20

    What shipped:

    Fleet - background agent sessions in isolated git worktrees. Send any session to fleet, it keeps running on its own branch while you work. Apply (merge/squash) or discard. 4 concurrent per project. Worktree GC, disk-pressure warnings, non-git degraded mode, discard confirmation. FleetCards show live activity ("Iter 5/13 - read style.css"), stale-amber timestamps, model badges, inline diff previews. TopBar + Agent panel status indicators.

    Sessions are now first-class persistent runs - survive SSE disconnect. Close panels, switch modes, navigate away, the loop keeps running. Re-attach with full mid-stream state. Auto-titled, browseable history.

    Multi-provider routing recovery - each session sticks to the provider it started with. No more dredged Ollama config when you switch to Kimi mid-chat. Two-tier picker (Cloud / Local → provider → model), mid-session switch warnings (warn, never block), Anthropic key in Settings, per-provider labeled key fields, catalog pre-warm.

    Auto-commit AI changes - every agent file write becomes a real git commit by Bodega AI. Apply dialog dry-runs the merge first; on uncommitted project edits it routes you to the Git panel with commit/discard guidance. Binary diff detection + pagination + conflict UX.

    Quality guards - multi-file QEL oscillation guard catches local models stuck rewriting the same file. Worktree path validation hard-aborts on env mismatch instead of silently leaking to the live project tree. Commit message sanitization, squash conflict-marker detection.

    Chat-mode agent tool gating fix - cross-session capability contamination was stripping tools off models that actually supported them. Symptom: ask the Chat-mode agent to do code work, get a hallucinated "I created the file" with no write. First fix after beta.19.

  21. Bodega One v1.0.0-beta.19 - out now

    13 PRs over 2 days. Real-user-driven fixes, structural cleanup, and the groundwork for the next release.

    From Cachev's beta testing:

    • Ollama timeout is now hardware-tier aware - minimal-tier machines (< 6 GB VRAM) get 360s for cold prefill instead of 180s
    • Ollama error classifier now recognizes the actual slow-load message so the UI doesn't bury it
    • Terminals open in your active project directory (every IDE does this, we finally do too)
    • Windows NSIS installer no longer renders dark text on a dark background

    Tool + provider polish:

    • file_system.read accepts offset/length params now - agents can paginate huge files
    • Featherless + DeepSeek-V3 native tool calling works end-to-end (XML override + strip-gating fix)
    • OpenAI payload capture diagnostic (dev-gated) - used to debug Qwen-cloud silent-tools issue
    • Multi-agent bug-hunt expedition surfaced 4 production fixes in one PR

    Under the hood:

    • ~2.15 MB shaved off the renderer bundle
    • Biggest god-file split in our history - SettingsService 915 → 335 lines, into 5 focused modules
    • Settings wiring audit - closed last unwired UI bindings (Monaco font default, Git AI audit trail toggle)
    • Settings → Safety → AI Audit Trail is now a real toggle
    • mac release CI fix (artifact-name collision)
  22. Bundle 5.42 MB → 1.55 MB, Featherless cold-start, IDE chat leak fix

    Going to start with the perf wins because they're the biggest user-visible change of the cycle.

    Main renderer bundle: 5.42 MB to 1.55 MB. A 71 percent reduction.

    Two paired fixes, neither would have worked alone.

    First fix was a tsconfig miss. The base tsconfig sets module to commonjs because the Electron main process needs CJS. The webpack-specific override was missing. So TypeScript was transpiling every import() call to a synchronous require(). Every React.lazy() in the codebase was a lie. The Monaco editor, the file tree, the terminal, the diff review, all the lazy-loaded panels were getting eagerly bundled into the main chunk. Adding "module: esnext" to tsconfig.webpack.json fixed the import() preservation. Bundle dropped to 2.11 MB.

    Second fix was Sentry. The crash reporter was being eagerly imported even when telemetry was disabled or air-gap mode was on. About 1.4 MB of Sentry code in the main bundle that almost never ran. Wrapped it in requestIdleCallback and a dynamic import so it loads after first paint, in its own async chunk. Bundle dropped to 1.55 MB.

    Also added perf instrumentation to the backend (endpoint timings, breadcrumb tracing, baseline metrics) so we have data to chase the next round of optimizations from.

    Featherless cold-start. Big multi-phase feature this cycle.

    Featherless is serverless inference. Their models live in a hibernated state when nobody's using them, and cold-starting a 70B model can take 30 to 60 seconds. Before this cycle, you'd send a message into a cold model, hit the timeout, see a red error, and have no idea what just happened.

    Phase 1: stretched the warmup timeout to 30 minutes and added an elapsed-time counter to the warming UI. At least you know it's working.

    Phase 2: full coordinator. State machine that tracks queued, requesting, loading, ready, and verified states per model. Dedicated SSE channel at /api/featherless/warmup/progress streams stage transitions and 10-second loading-progress ticks. Persistent WarmingBanner at the top of the chat that shows current stage and dismisses cleanly. SQLite persistence so the state survives restart. Send button is disabled while warming so you can't accidentally fire into a cold model.

    Layer 1 warmup. When you change the active model (Settings, Model Roles picker), Bodega fires a 1-token request in the background. By the time you actually send your first real message, the model is hot.

    Round 26 verified-warm. The banner doesn't just go away when the warmup probe succeeds. It stays until the first real chat completion goes through cleanly, because Featherless can lie about ready state (the probe at 1-token context can pass while a realistic 4K-context request still cold-starts).

    Warmup persistence queries now scoped by user_id per Sentinel audit.

    The IDE chat leaking into chat mode bug. The one that's been around for weeks.

    Sending a message in code mode would also write it to the chat session, and both panels would render the same conversation. Identical content in both modes. Joe screenshotted it across multiple test instances. Defied static analysis for hours.

    Both panels share the same useChat hook under the hood. One line in useChatSend used a nullish coalesce as a fallback: const currentSessionId = sessionId ?? activeSessionId.

    For chat mode's useChat instance: sessionId IS activeSessionId. The ?? never fires. Fine.

    For the IDE Agent panel's useChat with no code session yet: sessionId is null, so the ?? returns activeSessionId. That's the chat session's id. So the code-mode send wrote to the chat session in the database. The backend WebSocket broadcast then rendered it in chat. Meanwhile the optimistic addMessage rendered it in code. Same conversation in both panels.

    Took console.warn instrumentation on the slice setters with stack traces to pin it down. Once we saw setLocalMessages firing with sessionType=code and chat content, the call chain pointed straight at the ?? fallback.

    Fix is gating the fallback to sessionType chat only. Defense-in-depth check added on the WebSocket handler: now triple-checks sid against activeSessionId AND state.sessions AND not in state.codeSessions. Even if any future code path sets activeSessionId to a code session id, the type check structurally blocks the leak.

    Provider switching cleanup. 5 backend services were doing legacy single-OpenAI lookups.

    The /llm/running-models poll was reading llm.openai_base_url for every cloud preset. If you used OpenAI then switched to Featherless or DeepSeek, it kept hammering api.openai.com every 3 seconds with no key or the wrong key. Now routes through the per-preset lookup helpers like the chat path already does. Same fix in four other backend paths: embedding, STT, test-connection, and the deferred chat-stream reconfigure.

    Stale role models on preset switch. We were only clearing 4 of the 11 role keys, so research, debug, and advisor panels could carry stale model names across a switch. Now clears all 11. Plus pre-clears the available-model list and refetches health right after the flip, so role pickers repopulate within 200ms instead of waiting for the next 30s health tick.

    Settings panel was zeroing out role model defaults on save. A previous perf optimization had reduced the settings prop to a 3-key subset for re-render perf, but the hydration code was reading the subset as if it were the full settings. populateFrom now reads the full snapshot inside the effect.

    Featherless WarmingBanner was flashing for every cloud preset switch (qwen, deepseek, etc.). Now only fires for Featherless, which is the only preset with actual cold-start cost.

    STOP READING FILES nudge. The agent was misclassifying "what are the contents of source-of-truth folder?" as a simple knowledge question because "what are" and "folder" weren't in the exploration intent classifier. DeepSeek looped on the contradiction for 440 seconds before producing anything. Enriched both VERBS and TARGETS lists. Regression tests locked in the prompts that hit this.

    DeepSeek raw function_calls XML. Anthropic-style plural "function_calls" form was leaking as visible text in chat mode because the stripper only knew about the singular "function_call". Fixed plus bare-word and empty-block variants.

    Qwen "/think" directive at the start of every response. Qwen via DashScope echoes its thinking-mode prefix at the start of every code-mode message. The stream stripper now silently eats it. Mid-stream "/think" is still treated as content so prose like "the /think directive" survives.

    Featherless DeepSeek-V3 emits python tool-call fences. The model writes its tool calls as python code blocks instead of using OpenAI native tool_calls. The stripper now removes them so you don't see broken python. Doesn't make the tool actually execute though, that's a deeper fix for tomorrow.

    Smaller stuff. Chat input was treating the "What can you help me with?" prompt as real input, so clicking in the middle and typing concatenated. Now it auto-selects the prefill so your first keystroke replaces. Retry button on error banners was passing a React SyntheticEvent to the retry handler as a model name. Wrapped properly. API key field looked unfilled when actually saved. Now shows "saved, paste to replace" and a green check hint.

    Code quality and refactors. We have a hard line limit on file sizes (700L services, 400L components, 300L route handlers). 6 files crossed limits this cycle and got split:

    • OpenAIProvider: 843L to 670L. Extracted model-cap and message-converter.
    • LLMService: 736L to 699L. Extracted preset-lookup helpers.
    • routes/llm.ts: 452L to 126L. Extracted health, warmup, and test-connection into sibling files.
    • ProviderCard: split out useProviderBaseUrl and useProviderApiKey hooks.
    • MyModelsTab: 445L to 395L. Extracted ModelRow and MultiModelVramWarning.
    • GuidedTourOverlay: 405L to 326L. Extracted tourTooltipPosition.

    configPath.ts also got a shared candidate builder extracted, to handle 4-level-deep service paths after install services moved into subdirs.

    Security audits this cycle: 4xx body.error XSS audit. Added a convention test that no error message from any 4xx body field is rendered as innerHTML. MCP OAuth 2.1 audit. Documented current state, gap analysis, effort estimate for full compliance. Sentinel LOW-2 cleanup. Replaced remaining String(err) patterns with the proper err instanceof Error check, matching the convention used everywhere else.

    Three quality gaps deferred to tomorrow, with notes:

    1. Qwen via DashScope doesn't invoke tools at all. The model claims folders don't exist instead of reading them. Backend log shows tools are declared and sent in the request, but DashScope's OpenAI-compat endpoint might be silently dropping them or wanting a different shape. Needs a captured network payload.
    2. Featherless DeepSeek-V3 tool execution. Today's fix stops the broken python output from leaking to the user, but the tool itself still doesn't run. Either force the prompt template to push XML format, or add a parser for python-style calls.
    3. file_system.read pagination. Tool result is capped at 16KB to prevent context blowout, so an 85KB file like our CHANGELOG can't be fully read. Adding offset and length params to the tool schema.

    Pagination first tomorrow since it's the smallest scope. Good night.

  23. beta.18.1: a two-bug hotfix that became 26 rounds and 84 files

    Bodega One Code v1.0.0-beta.18.1 -- the full story

    This was supposed to be a quick two-bug hotfix this morning. By midnight it was 26 rounds, 84 files, ~3,800 lines, and a complete user-experience overhaul of the Featherless integration. Here's what happened and why each piece matters.

    The original two-bug hotfix

    1. Self-hosted providers were losing their custom Base URL. If you ran llama.cpp, LM Studio, vLLM, or any OpenAI-compatible local server on a non-default URL (or a different port), Bodega would let you type and save that URL -- but the moment you navigated away from the Models settings page, the value reverted to localhost. Discord users hit this within hours of beta.18 shipping.

    Root cause: the URL was being read from a generic llm.openai_base_url setting that every preset shared, but only the active preset's "Test Connection" button wrote to it. Switching presets on the menu wiped the previously-saved URL.

    Fix: a new lookupBaseUrlForPreset(presetId) helper generalizes the qwen/kimi region-override pattern so ANY OpenAI-compat preset can override its base URL via llm.<presetId>_base_url. The legacy single-key fallbacks are preserved so pre-Phase-2 setups don't break.

    2. No native preset for Featherless AI. Featherless's onboarding page was explicitly branded "for Bodega users" with rc_* API keys, but Bodega's preset list didn't include them. Users had to set them up as a Custom OpenAI provider with a typed-in base URL -- and once they did, they hit bug #1.

    Fix: Featherless added as a first-class preset with proper Bearer auth, https://api.featherless.ai/v1 default URL, the right setupTip pointing to featherless.ai/account, and dedicated llm.featherless_api_key storage. Wired into cloud-key validation, the V2 onboarding picker, the Settings → Cloud API Keys section, and the Cloud Boost provider picker.

    Then live testing happened

    I tested with my own Featherless key. The first thing it did was freeze.

    3. The 6,700-model freeze. Featherless's /v1/models endpoint returns every HuggingFace-mirrored model they host -- 16,275 entries on a free trial, more on paid plans. Our OpenAIProvider.listModels was trying to ingest all of them, ship per-model profiles for each in the /llm/health response, and let the model picker render a 16k-entry datalist. Result: Windows socket pool drained, /llm/health payload hit ~1MB, model picker would auto-select 000ADI/Qwen2.5-...-Gensyn-Swarm-grazing_locust (the alphabetically-first random fine-tune), and the renderer locked up.

    Three layers of fix:

    • capListedModels at the provider boundary caps the response at 500. When upstream exceeds the cap, models from a curated allowlist of foundation-model orgs (meta-llama, deepseek-ai, Qwen, mistralai, google, microsoft, NousResearch, HuggingFaceH4, CohereForAI, allenai, tiiuae, 01-ai, WizardLMTeam, moonshotai) are always retained; the rest fill remaining slots alphabetically.
    • /llm/health slimmed: when the model count exceeds 50, the response ships only model names. Per-model profiles are lazy-fetched via /api/models/:name/info on demand. Net JSON shrinks from ~1MB to ~10KB.
    • pickDefaultModelForPreset learned to prefer curated foundation models. CURATED_CLOUD_MODELS.featherless seeds 10 hand-picked entries (DeepSeek-V3-0324, Qwen 2.5-Coder-32B, Llama 3.1-70B-Instruct, etc.) so first-run users land on a real model.

    4. The 60fps re-render loop. After a Featherless reconfigure, the renderer would suddenly hit 49% sustained CPU. Profiler audit traced it to a feedback loop: /llm/health → setModelProfiles({...spread}) → every Zustand subscriber re-renders → one of those re-renders triggers another /llm/health → loop.

    Fix: removed the lazy-fetch from useLLMHealthCheck. Empty-dict guards added to setModelProfiles and setRecommendedSettings so an empty {} (which the slim path sends) doesn't trigger spurious re-renders. CPU dropped from 49% to 1.3%.

    5. The whole-settings selector cascade (BUG-DM-15). Nine components were subscribing to the entire settings object via useStore((s) => s.settings). Any change to any of the 50+ settings keys re-rendered all of them. Refactored to per-key selectors so each component only re-renders when the keys it actually reads change. ~10x reduction in re-renders during typical settings churn.

    6. The SQLite race conditions. Two distinct races:

    • SettingsService.setMany was firing concurrent BEGIN statements during onboarding, hitting "cannot start a transaction within a transaction." Fix: serialized via internal queue.
    • Cross-service: SettingsService and MessageService could both try BEGIN at the same time. Fix: BEGIN-retry wrapper that catches the race and retries with backoff.

    7. Eight Featherless models were OAuth-gated. Live API testing revealed that 8 of the 10 originally-curated Featherless models returned 403 with model_gated_needs_oauth -- they require HuggingFace account-linking that Bodega can't do. meta-llama/* (every Llama model) and google/* (every Gemma model) were affected.

    Fix: rewrote the curated list with verified-working IDs only (DeepSeek-V3-0324, Qwen 2.5-72B-Instruct, Qwen 2.5-Coder-32B, Kimi-K2, Hermes-3-Llama-3.1-70B, etc.) and added an OAUTH_REQUIRED_HF_ORGS filter at the boundary so meta-llama and google models never reach the dropdown at all.

    Then the silent-fail bug surfaced

    This is the one that took the longest to crack. After onboarding completed, the user would press Enter on the prefilled "What can you help me with?" message, the composer would clear, and absolutely nothing else would happen. No error banner, no chat session in the sidebar, no response. The send was reaching the backend, the session was being created, but the user saw silence.

    Twenty rounds of progressively-deeper retry mechanics shipped throughout the day. Each round helped a little. None of them actually fixed it.

    The root cause turned out to be embarrassingly simple: ChatErrorBanner only existed in the active-chat layout, never in the empty-state ChatGreeting. When the first send failed (because Featherless's cold-start blocked the backend's event loop while parsing the 500-model list), the error fired correctly -- but it had nowhere to render. The user saw nothing because there was no UI surface to put the error on.

    Fix: the banner now renders in both empty + active states. The error has somewhere to go. The user sees a clear "Request timed out -- the model may still be loading. Try again in a moment." with a Retry button instead of mysterious silence.

    Sub-fixes that landed alongside:

    • Code mode's ErrorBanner was rendering errors raw ("signal timed out") because it diverged from chat mode's ChatErrorBanner which used formatErrorMessage. Wired through.
    • Express keepAliveTimeout bumped from Node's 5s default to 65s. The 5s default caused Chromium's keep-alive socket pool to try reusing FIN-acked sockets during the post-onboarding settle, silently hanging follow-up requests.
    • Iteration-cap warning footer ("Reached the iteration limit. Response may be incomplete.") was appearing on pure-text conversational answers in code mode. Now only appended when the model actually used tools.
    • Toast on first retry ("Connection slow -- reconnecting...") so the previously-silent 15s window gives visible feedback.

    Then the cold-start UX problem

    Even with the banner visible, Featherless's serverless cold-start was 30 seconds to 5 minutes on a busy night. Users would see "Request timed out", click Retry, see "Request timed out" again, click Retry again. The Retry was working but Featherless wasn't responding fast enough to feel like the app worked.

    The proper architectural fix is to move LLM calls off the backend's main event loop into a worker thread (planned for beta.19, ~3-5 days). For tonight, two layers of mitigation:

    Layer 1 -- Pre-warm on onboarding. The moment the user finishes cloud onboarding (5-10 seconds before they actually press Enter), fire a 1-token completion request to the chosen model. Featherless spins up its GPU during the welcome-screen seconds. By the time the user types and hits Enter, the model is warm.

    New backend endpoint: POST /api/llm/warmup -- fire-and-forget, returns 202 immediately, runs the warmup in the background. New frontend hook useModelWarmup watches activeRoutedModel in the store and re-fires warmup whenever the user picks a different model from the dropdown or pins a new role-model in Settings. 60s same-model dedup so we don't spam Featherless.

    Layer 2 -- Backend health cache + renderer health-poll pause. /llm/health now caches its response for 5 seconds with in-flight dedup. Coalesces the burst of health calls during onboarding (Providers tab + FIM + Embeddings + the main poll all fire on mount) plus the 30s steady-state poll. Cache key includes preset so reconfigure invalidates implicitly.

    useLLMHealthCheck skips the 30s poll while a chat or agent stream is active. Eliminates the "Cannot reach Featherless" yellow banner flickering mid-chat that previously fired every 30s when the backend's event loop was busy awaiting Featherless's response.

    Round 26 -- The persistent warming banner. Even with Layer 1 + 2, the cold-start window was still invisible. Users saw nothing happen, didn't know if the app was broken or just waiting. The transient Cannot reach Featherless banner flickering on/off REINFORCED the broken perception.

    A new persistent banner sits between TopBar and the mode layout: "• Warming up DeepSeek-V3-0324 -- first send may take 30-90 seconds while the model loads on the provider's GPU." It stays up from the moment the warmup fires until either /llm/health returns connected or a chat completion succeeds. The composer stays enabled. The user knows the truth.

    Then the security audit

    A Sentinel agent was dispatched in parallel with the live testing to audit the day's changes. It found three HIGH-severity findings, all in the SSRF guard added for BUG-DM-18:

    • IPv6-mapped IPv4 (::ffff:127.0.0.1) wasn't being matched. Node's URL constructor returns [::ffff:7f00:1] for that host, which the original isPrivateHost regex didn't catch.
    • IPv6 ULA (fc00::/7) and link-local (fe80::/10) ranges weren't checked at all.
    • Trailing-dot FQDN form (localhost.) bypassed the exact-string match.

    Fixed all three plus added a length cap on the warmup endpoint (was logging unbounded user-controlled strings via pino).

    The same Sentinel pass also verified BUG-DM-16 (prefix-match boundary in pickDefaultModelForPreset -- was matching Qwen2.5-7B against Qwen2.5-7B-Vision-Instruct instead of Qwen2.5-7B-Instruct-FP8) and BUG-DM-17 (length cap on /api/models/:name/info path param) close their respective holes.

    30 new netGuards unit tests + 9 new cloud-key-validate integration tests cover the bypasses.

    Then the Models tab UI polish

    Live testing surfaced two cosmetic issues:

    The "Search models..." input at the top of the Models tab was rendering on all three sub-tabs (Discover, My Models, Providers). Useful on Discover (you're browsing a catalog). Redundant on My Models (the eight inline role pickers ARE search inputs with shared autocomplete). Useless on Providers (at most a dozen presets). Now only renders on Discover.

    When you picked a model from the Default role picker (the only <input list>-based picker in My Models), the input box turned white -- Chromium applies a :-webkit-autofill background highlight on inputs that get a value via native datalist autocomplete, and our dark theme hadn't overridden it. Fixed with the standard inset box-shadow CSS trick that overrides the autofill background.

    What didn't make it

    • Worker-thread refactor for LLM calls -- the proper architectural fix for the cold-start UX issue. Moves LLMService and providers into a worker_threads Worker so the Express main thread stays responsive during LLM calls. Eliminates the entire class of "backend looks dead during chat" bugs. ~3-5 days, planned for beta.19.
    • File splits -- LLMService.ts (863L), useFirstRunMachine.ts (663L), ProviderCard.tsx (465L) are all over their respective limits. Beta.19.
    • Warmup-debounce -- useModelWarmup currently fires twice during onboarding because of a transient state during the reconfigure cycle. Wastes one Featherless request per onboard. Trivial fix (~10 lines), beta.19.
    • Optimistic user-message-shows-immediately in empty state -- currently the typed text disappears the moment Send is pressed, before the new chat session UI renders. Should show in the chat area immediately. Filed for beta.19.

    Why it took so long

    The root-cause fix for the silent-fail bug (banner-in-empty-state) was a 30-line change. It took ~10 hours to get there because the symptom looked like a network/race issue -- POST timing out, optimistic message reconciling wrong, abort signal firing, fetch never reaching the backend. Twenty rounds of retry mechanics, keep-alive tweaks, in-flight dedup, and abort handling each helped a little but didn't fix it. The actual cause was upstream of all that: there was no UI surface to render the error on.

    Web research finally gave the angle to look at: "what if the error IS firing and we just can't see it?" Tracing render trees instead of network paths landed the fix in 30 minutes after that.

    The lesson, if there is one: when twenty rounds of patches each seem to almost-work but don't quite, the failure mode probably isn't where you think it is. Stop patching, trace the actual symptom path from the bottom up.

    That's beta.18.1. Beta.19 starts tomorrow with the worker-thread refactor.

  24. beta.18: V2 Cloud APIs + two ship-blockers caught live

    Update on todays bug fixing, found a lot of issues with cloud providers, specifically Deepseek, Qwen & Kimi, doing fixes as we speak, then sanity testing before shipping beta.18, be on the lookout for an update will try my best to ship all these fixes today.

    Beta.18 Is now live. Below is the full changelog for all work done today.

    V2 Cloud APIs -- what shipped

    Per-provider keys. Every cloud provider has its own settings field now. Switching presets keeps each provider's key -- flip from DeepSeek to Mistral to OpenAI without re-pasting anything. 13 providers wired: OpenAI, Anthropic, Gemini, OpenRouter, Azure, Mistral, Cohere, DeepSeek, Fireworks, Groq, Together, Qwen, Kimi.

    Cloud API onboarding flow. New "Cloud API" path in first-run leads to an 11-provider grid → per-provider key entry → validation → first chat. Region toggles for Qwen + Kimi (international vs China). Resource hostname for Azure. Mirrors how llama.cpp + Ollama install patterns work.

    Per-message cost tracking. Every cloud BYOK response now ships with a $0.0091-style badge next to message metadata. Click for input/output token split. Settings → Cloud API Keys → Spend summary shows session / today / this-month columns plus per-provider breakdown. Pricing tables verified 2026-05-08 for all 13 providers (DeepSeek $0.07/$0.28 flash, Anthropic Sonnet $3/$15, OpenAI gpt-4o $2.50/$10, etc.).

    Optional API key field for local providers. LocalAI behind a reverse proxy, llama-server with --api-key, LM Studio with auth -- every OpenAI-compatible local preset now exposes the key field, labeled "optional" for local and "required" for cloud

    Agentic loop -- two ship-blocker bugs fixed live

    Bug 1: /think prefix sent to non-Qwen models. A previous change extended Qwen3's /think slash-prefix toggle to the DeepSeek family on the assumption R1+ uses the same toggle. It does not -- DeepSeek's thinking is automatic via inline <think> blocks. Result: every iteration prepended /think\n\n, the model parsed it as an unknown user command, and refused with "/think is not a registered command in this environment." 20-iteration refusal loop, $0.034 burned, no answer. Fixed: Qwen-only, with regression tests.

    Bug 2: Over-eager-tool-use nudge fired on legitimate exploration. Two classifiers disagreed: RuntimeLayer correctly routed "explain this repo" to the full lane, but NudgeOrchestrator's simpler heuristic flagged it as a knowledge question and emitted "STOP READING FILES. Answer from training knowledge." Qwen3 quoted the contradiction back to the user as a 5-point rebuttal. Fixed: both nudges now route through RuntimeLayer.isExplorationIntent first, single source of truth.

    New: mid-loop reasoning visibility. ThinkingDisclosure was nested inside the streaming-response block, so it only showed AFTER the model started emitting final-response content. During tool-calling iterations the panel went dark. New render path surfaces reasoning during tool-calling phases. The same change would have caught both today's bugs visually within seconds.

    Plus: time-aware reassurance copy at 15s/60s/120s thresholds. Path context in tool rows (Listed agentic → Listed .../services/agentic). Cross-mode state leaks (chat ↔ code) closed for tool approvals, clarifications, and plan approvals.

    Also in beta.18

    Reasoning persistence (Phase 2J complete). Cloud thinking-mode models -- DeepSeek R1, Qwen3, Kimi K2 thinking -- now have reasoning blocks persisted to the DB AND threaded through the agentic loop's in-memory message array. Pre-fix, second tool-call iteration would 400 with "reasoning_content must be passed back."

    Smarter iteration cap. New isExplorationIntent classifier looks for an exploration verb (explain / describe / trace / how does / etc.) combined with a codebase target (repo / module / architecture / flow). "Explain this repo" → up to 25 iterations. "Pick a number 1-9" → simple-task fast path. The previous heuristic gave both prompts the same 8-iteration cap.

    Qwen iteration thinking-only loop fix. Qwen models occasionally emit reasoning_content for an entire iteration without producing content or tool calls. The loop used to retry on empty content. Now recognizes the pattern as legitimate progress.

    FIM model routing. Default fim.mode flipped to 'off' (was 'auto'). Strict per-preset model allowlist prevents Mistral models from being mis-routed to DeepSeek FIM endpoints. Local llamacpp / Ollama users still have free choice.

    DeepSeek v4 + Kimi K2 model profiles updated. DeepSeek v4-flash (131k/16k, ~50B MoE) and v4-pro (131k/16k, ~671B MoE) replace the deprecated deepseek-chat / deepseek-reasoner aliases (which hard-error 2026-07-24). Kimi K2 temperature locked to 1.0 per Moonshot's docs -- slider disables for locked models.

    Cross-mode state leaks closed. Tool approvals, clarification cards, and plan approvals now scope by sessionId so a pending action from a code-mode session can no longer surface in chat-mode and vice versa.

    How to update

    If you're already running beta.17:

    Bodega checks for updates automatically on launch. Or trigger manually: Settings → About → Check for updates

    The auto-updater pulls the new build, verifies the signature, and prompts you to restart. Your settings, sessions, and API keys are preserved across the update.

    Fresh install or you'd rather download manually: https://bodegaone.ai/download

    Direct installers for Windows, macOS, and Linux. The site auto-detects your platform.

    On the BYOK migration: First launch on beta.18 runs a one-shot migration that copies your existing llm.openai_api_key to the active provider's per-provider key (only when the destination is empty). Cloud Boost retains its key. No existing setups lose access. The migration runs once per user, gated so it can't double-fire.

    If you previously had Cloud Boost + a separate cloud provider configured, both still work -- they're now architecturally separated at the storage layer so you can run, say, DeepSeek as primary + OpenRouter as boost simultaneously.

    By the numbers

    • 13 PRs merged between beta.17 (May 7) and beta.18 (May 8)
    • Backend tests: 3,918 passing (+146 over beta.17)
    • Frontend tests: 525 passing (+64 over beta.17)
    • All tsc, ESLint, webpack clean
    • 2 ship-blocker bugs caught in pre-tag live verification -- both reproduced, both fixed with regression tests, both validated working post-fix
    • Live test: DeepSeek-v4-flash on "Explain this repo structure to me" → 2,556 output tokens at 103.2 tok/s for $0.0091. Pre-fix: 20-iteration refusal loop, $0.034, no answer. 73% cheaper, 25s vs 7+ minutes.
  25. One-click llama.cpp + Ollama setup, ~25 bugs squashed in beta.17

    One-click setup for both major local providers.

    If you've never touched a local model before, Bodega will install one for you. Pick "Set up llama.cpp" or "I'll install Ollama instead" on first launch. Bodega downloads the binary, verifies the SHA256 against the official release, runs the installer (UAC silent on Windows), waits for the service to come online, then drops you into a curated model picker sized for your hardware. You go from clicking the .exe to chatting with a local model in about 90 seconds. No terminal commands, no editing config files, no leaving the app.

    If you already have Ollama installed but the service is stopped, Bodega notices and just starts it instead of re-downloading 2 GB. Smart skip.

    ~25 bugs squashed between yesterday's beta.16 and today's beta.17:

    • Code mode broken on llama.cpp (tools weren't reaching the model). Fixed at the settings layer.
    • Modal dismissed before the model finished loading, leaving a confusing "no provider" banner. Now waits properly.
    • Internal model record IDs (managed-1778...) leaked into the UI in 5+ places. Replaced everywhere with the friendly filename.
    • Rate limiter was throttling localhost calls during onboarding, causing cascading "Catalog failed: Too Many Requests". Fixed.
    • Catalog downloads cancelled when you switched tabs. Now persist across navigation.
    • Guided tour rewrite: spotlight tracks animations, auto-flips when there's no room, opens panels for you instead of silently skipping.
    • High-VRAM users got auto-pulled a 17 GB model with no choice. Now you pick.
    • Shell injection hardening in the installer paths (caught by our pre-ship security review).
    • Plus a dozen smaller polish items. See the full changelog.

    Hit a bug? Help us help you.

    Settings → About → Export Diagnostics Bundle. One click, saves a redacted text report with your app settings, system info, and recent logs. API keys and secrets are stripped before the file gets written. Safe to paste into a support thread here.

  26. Packaging update: a few more days before beta ships

    Sorry for the delay, during packaging and Mac certifications, quiet a few bugs ended up surfacing. We have made lots of progress and are wrapping up and doing some final testing before we feel confident that the Beta version of the app is in a good state to ship, we ask for a bit more patience and should be releasing within the next couple of days. Thank you again.

  27. T-2 beta hardening: troubleshooting docs, NSIS audit, unsigned binary math

    T-2 days to the May beta. Spent today running the final-push hardening pass -- every place users could hit a wall on day one.

    Two troubleshooting guides shipped. A short one in the public release repo for people who land on the GitHub release page directly (SmartScreen, install.ps1 ExecutionPolicy, AV quarantine, Mac Gatekeeper, AppImage chmod, FUSE2, sandbox helper, auto-update edge cases). And a comprehensive one for the website covering everything else: provider setup across Ollama / llama.cpp / LM Studio, model pull failures, Cloud Boost auth, BYOK keys, agentic-loop edge cases, FIM autocomplete, air-gap mode, workspace sandbox, settings + logs + factory reset, beta expiry, telemetry. Roughly 50 distinct failure modes documented, each grounded in actual code paths instead of guesses.

    NSIS installer audit caught three real bugs. Ran a real local Windows build instead of trusting the config. Three things would have shipped broken:

    1. The installer never produced a .exe. Two NSIS macro hooks (customUnWelcomePage, customUnInstFilesPage) aren't actually exposed by electron-builder, so the dark-theme functions they referenced were dead code. NSIS escalated the unreferenced-function warning to error and killed the build before the .exe stage. Would have been a silent CI fail.

    2. package.json:homepage still pointed at my old personal handle on the now-private source repo. NSIS bakes that URL into the uninstaller's "About / Help / Updates" metadata. Every end user would see it. Flipped to bodegaone.ai.

    3. The portable zip target had its filename config in the wrong block. Portable config existed but portable wasn't in the win.target array, so the actual zip target was producing a version-suffixed filename with spaces in it. The branded /download/portable URL would have 404'd. Two-line fix.

    All three caught because I built locally. The CI was happy. Configs lie; outputs don't.

    One thing to be upfront about: the Windows beta ships unsigned. SmartScreen will throw a warning the first time anyone runs the .exe. That's expected, not malware. Two paths around it:

    • PowerShell one-liner (irm https://bodegaone.ai/install.ps1 | iex) -- runs in-memory, calls Unblock-File post-extract, doesn't trip SmartScreen. This is the headline install method.
    • NSIS installer -- works fine, just needs "More info → Run anyway" the first time.

    The reason it's unsigned: Microsoft changed the EV cert rules in 2024 -- paying for an EV cert no longer gives you instant SmartScreen pass-through. So a $300/yr cert buys you nothing the PowerShell path doesn't already give for free. We'll sign for v1.0 GA when there's a real signal premium for it. Mac signing is paid + pending Apple Dev approval (~24h ETA), so notarized DMGs come online with the next tag. Linux AppImage isn't signed because Linux doesn't really care.

    On deck for tomorrow. Apple Dev cert approval should land. Patch the security findings, build the beta-expired screen, swap the eager imports for React.lazy, run a final swarm pass on the patched code, then it's tag time.

    The pattern with these final-push days is the same: the code can pass every test, every type-check, every lint, and still ship with a half-dozen bugs that only surface when a real user touches it for the first time. That's the gap the swarm exists to close. Catching three NSIS bugs and a credentials leak in one morning is not a flex -- it's what should happen before a binary goes public.

    Building in public means showing the cleanup, too.

  28. 60+ commits, the last god file, and two bugs the cleanup exposed

    One day, 60+ commits, the biggest structural cleanup the codebase has seen. Also found two real bugs that were hiding in plain sight, one silently dropping messages on every cloud call, one quietly capping everyone's context window to 32K regardless of their hardware. Both shipped-and-in-production-for-weeks type bugs. Both fixed.

    Breaking this into the three things that actually happened:

    1. Killed the last god file

    AgenticChatService.ts was 1,386 lines. The 700-line service limit has been enforced across the codebase for months. This file kept getting a pass because "it's the orchestrator, orchestrators are allowed to be big." That stopped being true today.

    The split: 7 new modules under services/agentic/. RequestNormalizer owns input validation and intent classification. ProviderResolver resolves the LLM provider, model, and boost-routing decision. AgenticLoopSetup builds everything the loop needs before iteration 0 -- messages, tools, policy, contract, skill context, telemetry. AgenticLoopCoordinator runs the actual loop. Branches A (native tool calls) and B (XML-extracted tool calls) each live in their own file now with a shared BranchPipelineHelpers module underneath.

    AgenticChatService itself is now 230 lines of pure HTTP-facing dispatch.

    Three other files also went under the knife:

    • LLMCallExecutor (668 → 420) -- context-overflow recovery and model-fallback wrappers extracted into ProviderFallbackHandler.
    • StructuralVerifier (611 → 129) -- per-language verifiers (TypeScript, Python, Go/Rust, Generic) moved to a verifiers/ subfolder, dispatcher stays tiny.
    • PostLoopProcessor (609 → 206) -- MemoryExtractor and LoopOutcomeRecorder pulled out as siblings.

    Frontend got the same treatment: ChatInput.tsx 464 → 186 (three hooks + two presentational components extracted), AgentChatPanel.tsx 419 → 276 (useAgentChatPanel hook extracted). Zero files over the god-file limit anywhere in the codebase for the first time I can remember.

    2. The bug the split exposed

    The split itself was mechanical. The interesting part started after the merge, when Cloud Boost tests began failing. Not because the split broke anything -- it didn't -- but because I was now tracing one code path end-to-end for the first time instead of assuming it worked.

    Backend log showed a 400 from Anthropic: messages: at least one message is required.

    The user's turn was being stripped before Claude ever saw it. Always. For weeks. I'd written it off as "boost is flaky" the handful of times I'd noticed.

    Root cause: ContextAssemblyService.trimToContextWindow was reading the global llm.context_window setting (default 32K, baked in back when Ollama was the only supported provider) and Math.min-ing it against the model's real capability. On my 5090 calling Claude Sonnet 4 with its 200K native context, the 20K Bodega system prompt overflowed the imagined 32K budget. The trimmer returned only system messages, dropped the user turn, Anthropic rejected the request, the agentic loop caught the error, the stream silently closed with a "done" frame. The user sees "the model didn't answer."

    Fixed two ways: Thread the resolved model name into the trimmer so cloud providers use their actual native context window, and add a hard safety floor that guarantees the last user turn is always preserved even when system overflow is catastrophic. Cloud providers will never get messages: [] from us again.

    3. The Ollama tax

    The trimmer fix was tactical. The underlying setting : llm.context_window as a global ceiling , was the strategic problem. It was making every provider look like Ollama.

    5090 user with 32GB VRAM? Could easily run 128K context on a 7B model locally. Setting says 32K. Locked to 32K.

    Calling Claude Sonnet (200K native)? Setting says 32K. Locked to 32K.

    Calling GPT-4o (128K)? Locked to 32K.

    Nobody was getting what their provider was capable of. The default was silently the ceiling for everyone.

    Spent the rest of the day rebuilding context window resolution as provider-kind aware. New module ContextWindowResolver with a four-level priority chain:

    1. Per-model override (if user explicitly set one for this model)
    2. Explicit global cap (if user set one non-default -- 0 is the new "unset" sentinel)
    3. Cloud providers: use the model's native max, no VRAM math
    4. Local providers: apply a hardware-computed ceiling based on VRAM tier

    Hardware probe now writes a recommended local ceiling to settings on first run. VRAM tier table: ≤4GB → 8K, 4-8GB → 16K, 8-16GB → 32K, 16-24GB → 64K, 24GB+ → 128K. The probe does this automatically; users see the detected value, can cap it further if they want.

    Migration is idempotent, users who explicitly tuned their context window get that value preserved. Users on the old default get the sentinel, which means auto-resolution takes over.

    Lesson of the day: refactoring clean code finds nothing. Refactoring tangled code finds bugs the tangle was hiding.

  29. 27 bugs, 8 features, zero regressions

    Big session. Ran a full end-to-end audit of the entire app, closed 27 bugs, then started on 8 features pulled from competitive research. 2,523 tests passing at the end.

    The security ones matter most. Three sandbox escape paths in GrepTool, GlobTool, and DiffFileTool -- arguments weren't being re-anchored to the workspace root, so a carefully crafted path could read outside it. Command injection in RunTestsTool: test arguments flowed straight into the shell. All four closed before anything touched a release tag.

    The "why is this broken" ones. Image attachments via the file picker rejected every image as "binary." Chat mode's agent was going rogue with docs attached: trying code tools, burning iterations, then hallucinating a fake answer. Research mode was firing web searches against the embedded doc content instead of the actual question. All three were integration bugs that passed unit tests but fell apart under real usage.

    The off-by-ones. Circuit breaker was tripping at exactly 80% of budget, killing valid conversations at the threshold. Budget enforcement blocked at the limit instead of when exceeded. Air-gap mode wasn't disabling the cloud boost toggle in the UI -- it was enforcing air-gap at the network layer, but the toggle was still live. These are the bugs that make people lose trust. Fixed.

    Onboarding completion never persisted, so the checklist reappeared every restart. Session deletion orphaned memory entries. The memory route crashed on corrupt JSON. The write mutex had no timeout and could hang the app forever. Plus 363 lines of dead CSS from themes we removed weeks ago. And 12 more across the stack.

    What I shipped on top of that:

    • Chat mode now has read-only code tools (grep, glob, file read, code search). You can ask the model about your codebase without switching to Code mode. This was the #1 feature request after the last audit.
    • FIM inline completions now default ON. They were behind an opt-in toggle that nobody found. Better first-run experience.
    • Smart tool stripping. Some small models repeatedly try to call tools that are blocked for their context. After a few failures, the tool list gets removed entirely and the model is forced to respond as text. Reduces the "agent flailing" loop that kills small-model usability.
    • Settings progressive disclosure. New users see Theme, Models, Profile, Boost. That's it. Everything advanced lives behind a toggle. The settings page was turning into a wall.

    Next session: cloud provider as primary (not just boost -- for users without a local GPU), auto-verify after code changes (run tests and build, feed errors back), panel hand-off buttons ("Fix this" from Debug or Research jumps into Agent), and conversation export (JSON + ZIP backup).

  30. The agent that learns mid-run

    First time I watched Qwen 3.5 hallucinate spawn_worker three times in one run, I knew the cross-session learning system wouldn't be enough. Today we closed both gaps.

    Within-loop learning. Before today, Bodega's agent learning was cross-session only. The model makes a mistake, we record it, and the next session gets a rule injected into its context: "don't do that." Works great. 5-stage pipeline, Bayesian confidence tracking, hard pre-execution blocking. But if the model hallucinates a tool in iteration 3 of a 15-iteration run, it could repeat that mistake in iterations 4, 5, and 6 before the session ends and the rule kicks in.

    Now it can't. SessionRuleBuffer records failures in-memory during the run. Iteration 3 fails → iteration 4 sees a SESSION RULES block and the tool is hard-blocked before execution. Max 3 temp rules per session to avoid prompt instability. Rules are ephemeral. injected before each LLM call, stripped after. One new file, 76 lines.

    Rule confidence persistence. We had a Bayesian tracker computing which learned rules were actually working (Beta distribution, alpha/beta posteriors). It flagged rules with confidence below 0.3 for demotion. The math was there. The DB write wasn't. Rules that looked good on paper but triggered false positives were accumulating forever. 20-line fix: shouldDemote() now calls deactivateRule() and the rule goes inactive in SQLite.

  31. 156 E2E tests, Playwright Electron, dark installer, Sentry crash reporting

    Spent the day doing something most indie devs skip: writing real end-to-end tests that launch the actual Electron app and exercise every feature through the UI.

    Not unit tests. Not API tests. Full Playwright Electron tests that boot the app, switch between modes, send messages to local LLMs, verify tool calls hit the disk, and confirm that context survives compaction. 156 tests across 21 spec files. 98.7% pass rate.

    What the tests found: compact was using a stale model config, agent panels weren't auto-scrolling, reasoning-only models showed blank responses, and the auto-router was sending simple prompts to tiny models. All fixed. Also shipped dark-themed NSIS installer with matching uninstaller, crash reporting via Sentry (opt-in, respects air-gap), TopBar layout rearranged per feedback.

  32. Loop write guard, approval card fix, E2E Round 2

    Two things were driving me crazy about the agentic loop. One: the agent would write a file, re-verify, decide it wasn't done, and write the file again. And again. Same content, same path, different iteration. Two: approval cards would appear mid-stream and you'd never see them because they rendered outside the scroll container.

    Both fixed. The repeat-write guard now tracks writes per file path across the loop -- after 3 writes to the same file in a single session, it injects a system message, marks the deliverable satisfied, and breaks the cycle. Approval cards moved inside the scroll container so they actually travel with the content. E2E Round 2 ran 31 tests. Found 11 bugs across todo_write registration, model routing, panel scroll, web search iteration caps, and VRAM warning noise. All 11 fixed and committed.

  33. Chat → Runtime → Loop → QEL

    Shipped the Runtime Layer today. This one is more architectural than visible, but it matters.

    The problem: before each agentic loop, there were ~150 lines of scattered conditionals spread across the chat orchestrator. Is this session in a panel? What iteration budget applies? Does this model support tools? What happens after 3 consecutive failures? These questions were answered in different places with inconsistent logic.

    RuntimeLayer.ts consolidates all of that into a single typed LoopPolicy that gets produced before the loop starts. The classify() call looks at the request, the model's capability profile, the panel context, and the session failure history -- then produces a LoopPolicy with a single executionLane value.

    Four execution lanes:

    • advisory -- bypasses the loop entirely, single LLM call, no tools. Fast. For panels that just need a quick answer.
    • guided -- up to 8 iterations, limited tool set. For supervised agent work.
    • restricted -- panel-constrained tool allowlist. Research panel only gets research tools.
    • full -- complete tool access, computed iteration budget. Normal code mode.

    The capability detection piece is new: CapabilityProfile reads the model's known abilities (tool calling tier: native/xml/weak/none; structured output; reasoning) and can downgrade the lane automatically if the model can't support what was requested. No more sending tool calls to a model that'll ignore them.

    Dynamic failure tracking: if a session sees 3 consecutive tool failures, the lane downgrades automatically for the rest of the session. The model gets fewer chances to break things.

  34. Mar 24-26 -- Phase 9A through 9E

    Shipped the full memory system this week. Five phases in three days. This is the one I'm most proud of so far.

    The problem: every agentic loop iteration starts from scratch. The model has no memory of which files you've been editing together, what patterns you prefer, what errors you've hit before. Every session is day one.

    Phase 9 changes that. Here's what we built:

    • 9A -- HeuristicExtractor wired into the post-loop processor. After every agentic iteration, it extracts facts from what the agent observed and stores them in SQLite. Compression ratio confirmed at 5x+ on real sessions.
    • 9B -- FileAffinityTracker (tracks which files you co-edit, how often, how recently) + ImportGraphExtractor (static import graph for JS/TS/Python/Rust/Go). The context assembler uses both to inject the right files into the next session without you having to specify them.
    • 9C -- LLMObserver -- a second-pass LLM call that extracts implicit facts from assistant turns. Things the heuristic extractor misses. Runs async post-loop on hardware that can afford it, falls back to heuristic-only on low VRAM.
    • 9D -- Memory time decay. Observations have configurable half-lives by type. Stale memories fade instead of polluting context forever. BM25 relevance scoring added alongside recency decay.
    • 9E -- Evaluation harness. 25 scenarios covering injection, retrieval, dedup, decay, and cross-session recall. Memory metrics API exposed for debugging.

    Total: 8 new service files, 2 new tools (CreateDocument, DeepResearch), memory pipeline fully wired end to end. This is what makes Bodega feel like it knows you over time.

  35. Mar 23 -- 30 bugs, one session

    Ran what we're calling Operation Fumigate last Sunday. The goal: clear every known bug before the next beta tag. Final count: 30 bugs fixed in one session.

    It was deliberately parallel. Stood up 4 squads, each with a defined scope and a dedicated branch. No overlap, no conflicts.

    • Squad 1 hit the code editor and FIM (fill-in-middle): 9 bugs. Monaco diff decoration bugs, inline fix streaming edge cases, FIM fence stripping failures.
    • Squad 2 took terminal and the Problems panel: 7 bugs. Terminal duplicate input handlers, xterm focus tracking using the wrong event, OSC 133 command block edge cases.
    • Squad 3 handled streaming and session infrastructure: 8 bugs. Double SSE events, streaming interrupted on panel navigation, session data leaks, permission mode enforcement in chat mode.
    • Squad 4 closed out settings, memory, and project management: 6 bugs. Settings not persisting across restarts, memory rate limit bypasses, orphaned settings keys.

    All 4 squads merged to dev by end of day. Doc sweep ran afterward -- all counts, changelogs, and references updated to match. Tagged beta.6 that same evening.

    The thing that made this work: clear blast radius per squad, no shared files, every fix against a real test case. 30 bugs with no regressions.

  36. Mar 17-18 -- Brain MCP + 13-agent team

    This is the part that doesn't look like normal solo indie dev.

    I've been building with an AI agent team. Not AI-assisted -- an actual team of 15 specialized agents coordinated through a shared memory system called Bodega Brain. Each agent has a defined role, its own identity file, and stays in its lane.

    The roster: Co-Dev (lead), Architect (structural health), Engineer (implementation), Fixer (bugs), Sentinel (security scanning), Scout (competitive intel), Strategist (product direction), QA Engineer, Doc Guardian, Performance Profiler, Integration Tester, Release Manager, Reviewer, UX Auditor, Writer.

    Each one runs on its own git branch. Co-Dev reviews their work, creates PRs, merges after CI passes. I have final say on anything touching main. It's a proper dev workflow, just with agents instead of contractors.

    The Brain is how they coordinate -- a shared system with messaging, task queues, workspace claiming, decision logging, and a live dashboard. When two agents might conflict on the same files, they claim workspaces and check for conflicts before starting.

    This session: 8 PRs reviewed, 5 merged to dev (LSP integration, unified model hub, god-file splits, security hardening, test coverage). The acceleration this enables is real. Phase 0-3 of the V2 overhaul shipped in 48 hours.

  37. QEL ships

    Spent the last few days hardening what I'm calling the QEL -- Quality Enforcement Layer. This was the biggest early architecture decision and it's worth explaining why it exists.

    Most AI coding assistants work like this: you ask a question, the model responds, done. There's no verification that what was produced actually matches what was asked. No check that the code compiles. No detection of stubs. The model hallucinates a solution and calls it a day.

    QEL changes that. Every agentic loop iteration runs three passes: contract extraction (what did the user actually ask for?), completion verification (did the response satisfy it?), and a mode firewall that prevents the wrong class of task from sneaking through. There's a test suite with a letter-grade output system -- the agent has to get an A or B before the response goes out.

    The architecture underneath is Express + SQLite for the backend, with a streaming pipeline that pushes Server-Sent Events to the frontend in real time. 15 defined SSE event types covering everything from tool calls to plan approvals to QEL verification results.

    The other decision I made early: no god files. I've worked on enough codebases that became unmaintainable from one class doing everything. Bodega has hard line limits: 700 lines for service files, 400 for React components. When something hits the limit, it splits. This decision has already paid off four times.

    Current state: QEL shipping, 630 tests passing, agentic loop running on Ollama and OpenAI-compatible providers.

  38. Initial commit day

    Started building Bodega One Code. Here's what it is and why I'm building it.

    It's a local-first AI desktop IDE. Two modes: Chat Mode for general AI conversation, Code Mode for agentic software development. Runs entirely on your machine. No cloud dependency unless you want one.

    I got tired of tools that route everything through someone else's servers. Not because I have something to hide -- because I don't want to depend on a company's uptime, rate limits, or pricing decisions to do my work. Your code, your hardware, your data.

    The tech stack: Electron 40 for the desktop shell, React 19 + TypeScript on the frontend, Express + SQLite on the backend. It supports Ollama out of the box, with OpenAI-compatible endpoints as a fallback for when you need a heavier model.

    The thing I kept noticing with other AI coding tools is that they're mostly fancy autocomplete with a chat window bolted on. What I wanted was something that could actually reason about what it's doing -- extract requirements from what you ask, verify its own output, and refuse to ship half-finished work. That's the Quality Enforcement Layer. More on that later.

    First commit dropped today. It's rough but it runs. The bones are there.

    Building this in public. Wins, bugs, architecture decisions -- all of it.

For polished release notes, see the Changelog · Join Discord

Follow the build.

Beta is free and open to everyone. Download free.