Skip to main content

Build Log

The messy, honest version of development.

Not polished release notes. Real decisions, wrong turns, and why things took longer than expected. Updated as we build.

Build log entries

  1. Bundle 5.42 MB → 1.55 MB, Featherless cold-start, IDE chat leak fix

    Going to start with the perf wins because they're the biggest user-visible change of the cycle.

    Main renderer bundle: 5.42 MB to 1.55 MB. A 71 percent reduction.

    Two paired fixes, neither would have worked alone.

    First fix was a tsconfig miss. The base tsconfig sets module to commonjs because the Electron main process needs CJS. The webpack-specific override was missing. So TypeScript was transpiling every import() call to a synchronous require(). Every React.lazy() in the codebase was a lie. The Monaco editor, the file tree, the terminal, the diff review, all the lazy-loaded panels were getting eagerly bundled into the main chunk. Adding "module: esnext" to tsconfig.webpack.json fixed the import() preservation. Bundle dropped to 2.11 MB.

    Second fix was Sentry. The crash reporter was being eagerly imported even when telemetry was disabled or air-gap mode was on. About 1.4 MB of Sentry code in the main bundle that almost never ran. Wrapped it in requestIdleCallback and a dynamic import so it loads after first paint, in its own async chunk. Bundle dropped to 1.55 MB.

    Also added perf instrumentation to the backend (endpoint timings, breadcrumb tracing, baseline metrics) so we have data to chase the next round of optimizations from.

    Featherless cold-start. Big multi-phase feature this cycle.

    Featherless is serverless inference. Their models live in a hibernated state when nobody's using them, and cold-starting a 70B model can take 30 to 60 seconds. Before this cycle, you'd send a message into a cold model, hit the timeout, see a red error, and have no idea what just happened.

    Phase 1: stretched the warmup timeout to 30 minutes and added an elapsed-time counter to the warming UI. At least you know it's working.

    Phase 2: full coordinator. State machine that tracks queued, requesting, loading, ready, and verified states per model. Dedicated SSE channel at /api/featherless/warmup/progress streams stage transitions and 10-second loading-progress ticks. Persistent WarmingBanner at the top of the chat that shows current stage and dismisses cleanly. SQLite persistence so the state survives restart. Send button is disabled while warming so you can't accidentally fire into a cold model.

    Layer 1 warmup. When you change the active model (Settings, Model Roles picker), Bodega fires a 1-token request in the background. By the time you actually send your first real message, the model is hot.

    Round 26 verified-warm. The banner doesn't just go away when the warmup probe succeeds. It stays until the first real chat completion goes through cleanly, because Featherless can lie about ready state (the probe at 1-token context can pass while a realistic 4K-context request still cold-starts).

    Warmup persistence queries now scoped by user_id per Sentinel audit.

    The IDE chat leaking into chat mode bug. The one that's been around for weeks.

    Sending a message in code mode would also write it to the chat session, and both panels would render the same conversation. Identical content in both modes. Joe screenshotted it across multiple test instances. Defied static analysis for hours.

    Both panels share the same useChat hook under the hood. One line in useChatSend used a nullish coalesce as a fallback: const currentSessionId = sessionId ?? activeSessionId.

    For chat mode's useChat instance: sessionId IS activeSessionId. The ?? never fires. Fine.

    For the IDE Agent panel's useChat with no code session yet: sessionId is null, so the ?? returns activeSessionId. That's the chat session's id. So the code-mode send wrote to the chat session in the database. The backend WebSocket broadcast then rendered it in chat. Meanwhile the optimistic addMessage rendered it in code. Same conversation in both panels.

    Took console.warn instrumentation on the slice setters with stack traces to pin it down. Once we saw setLocalMessages firing with sessionType=code and chat content, the call chain pointed straight at the ?? fallback.

    Fix is gating the fallback to sessionType chat only. Defense-in-depth check added on the WebSocket handler: now triple-checks sid against activeSessionId AND state.sessions AND not in state.codeSessions. Even if any future code path sets activeSessionId to a code session id, the type check structurally blocks the leak.

    Provider switching cleanup. 5 backend services were doing legacy single-OpenAI lookups.

    The /llm/running-models poll was reading llm.openai_base_url for every cloud preset. If you used OpenAI then switched to Featherless or DeepSeek, it kept hammering api.openai.com every 3 seconds with no key or the wrong key. Now routes through the per-preset lookup helpers like the chat path already does. Same fix in four other backend paths: embedding, STT, test-connection, and the deferred chat-stream reconfigure.

    Stale role models on preset switch. We were only clearing 4 of the 11 role keys, so research, debug, and advisor panels could carry stale model names across a switch. Now clears all 11. Plus pre-clears the available-model list and refetches health right after the flip, so role pickers repopulate within 200ms instead of waiting for the next 30s health tick.

    Settings panel was zeroing out role model defaults on save. A previous perf optimization had reduced the settings prop to a 3-key subset for re-render perf, but the hydration code was reading the subset as if it were the full settings. populateFrom now reads the full snapshot inside the effect.

    Featherless WarmingBanner was flashing for every cloud preset switch (qwen, deepseek, etc.). Now only fires for Featherless, which is the only preset with actual cold-start cost.

    STOP READING FILES nudge. The agent was misclassifying "what are the contents of source-of-truth folder?" as a simple knowledge question because "what are" and "folder" weren't in the exploration intent classifier. DeepSeek looped on the contradiction for 440 seconds before producing anything. Enriched both VERBS and TARGETS lists. Regression tests locked in the prompts that hit this.

    DeepSeek raw function_calls XML. Anthropic-style plural "function_calls" form was leaking as visible text in chat mode because the stripper only knew about the singular "function_call". Fixed plus bare-word and empty-block variants.

    Qwen "/think" directive at the start of every response. Qwen via DashScope echoes its thinking-mode prefix at the start of every code-mode message. The stream stripper now silently eats it. Mid-stream "/think" is still treated as content so prose like "the /think directive" survives.

    Featherless DeepSeek-V3 emits python tool-call fences. The model writes its tool calls as python code blocks instead of using OpenAI native tool_calls. The stripper now removes them so you don't see broken python. Doesn't make the tool actually execute though, that's a deeper fix for tomorrow.

    Smaller stuff. Chat input was treating the "What can you help me with?" prompt as real input, so clicking in the middle and typing concatenated. Now it auto-selects the prefill so your first keystroke replaces. Retry button on error banners was passing a React SyntheticEvent to the retry handler as a model name. Wrapped properly. API key field looked unfilled when actually saved. Now shows "saved, paste to replace" and a green check hint.

    Code quality and refactors. We have a hard line limit on file sizes (700L services, 400L components, 300L route handlers). 6 files crossed limits this cycle and got split:

    • OpenAIProvider: 843L to 670L. Extracted model-cap and message-converter.
    • LLMService: 736L to 699L. Extracted preset-lookup helpers.
    • routes/llm.ts: 452L to 126L. Extracted health, warmup, and test-connection into sibling files.
    • ProviderCard: split out useProviderBaseUrl and useProviderApiKey hooks.
    • MyModelsTab: 445L to 395L. Extracted ModelRow and MultiModelVramWarning.
    • GuidedTourOverlay: 405L to 326L. Extracted tourTooltipPosition.

    configPath.ts also got a shared candidate builder extracted, to handle 4-level-deep service paths after install services moved into subdirs.

    Security audits this cycle: 4xx body.error XSS audit. Added a convention test that no error message from any 4xx body field is rendered as innerHTML. MCP OAuth 2.1 audit. Documented current state, gap analysis, effort estimate for full compliance. Sentinel LOW-2 cleanup. Replaced remaining String(err) patterns with the proper err instanceof Error check, matching the convention used everywhere else.

    Three quality gaps deferred to tomorrow, with notes:

    1. Qwen via DashScope doesn't invoke tools at all. The model claims folders don't exist instead of reading them. Backend log shows tools are declared and sent in the request, but DashScope's OpenAI-compat endpoint might be silently dropping them or wanting a different shape. Needs a captured network payload.
    2. Featherless DeepSeek-V3 tool execution. Today's fix stops the broken python output from leaking to the user, but the tool itself still doesn't run. Either force the prompt template to push XML format, or add a parser for python-style calls.
    3. file_system.read pagination. Tool result is capped at 16KB to prevent context blowout, so an 85KB file like our CHANGELOG can't be fully read. Adding offset and length params to the tool schema.

    Pagination first tomorrow since it's the smallest scope. Good night.

  2. beta.18.1: a two-bug hotfix that became 26 rounds and 84 files

    Bodega One v1.0.0-beta.18.1 -- the full story

    This was supposed to be a quick two-bug hotfix this morning. By midnight it was 26 rounds, 84 files, ~3,800 lines, and a complete user-experience overhaul of the Featherless integration. Here's what happened and why each piece matters.

    The original two-bug hotfix

    1. Self-hosted providers were losing their custom Base URL. If you ran llama.cpp, LM Studio, vLLM, or any OpenAI-compatible local server on a non-default URL (or a different port), Bodega would let you type and save that URL -- but the moment you navigated away from the Models settings page, the value reverted to localhost. Discord users hit this within hours of beta.18 shipping.

    Root cause: the URL was being read from a generic llm.openai_base_url setting that every preset shared, but only the active preset's "Test Connection" button wrote to it. Switching presets on the menu wiped the previously-saved URL.

    Fix: a new lookupBaseUrlForPreset(presetId) helper generalizes the qwen/kimi region-override pattern so ANY OpenAI-compat preset can override its base URL via llm.<presetId>_base_url. The legacy single-key fallbacks are preserved so pre-Phase-2 setups don't break.

    2. No native preset for Featherless AI. Featherless's onboarding page was explicitly branded "for Bodega users" with rc_* API keys, but Bodega's preset list didn't include them. Users had to set them up as a Custom OpenAI provider with a typed-in base URL -- and once they did, they hit bug #1.

    Fix: Featherless added as a first-class preset with proper Bearer auth, https://api.featherless.ai/v1 default URL, the right setupTip pointing to featherless.ai/account, and dedicated llm.featherless_api_key storage. Wired into cloud-key validation, the V2 onboarding picker, the Settings → Cloud API Keys section, and the Cloud Boost provider picker.

    Then live testing happened

    I tested with my own Featherless key. The first thing it did was freeze.

    3. The 6,700-model freeze. Featherless's /v1/models endpoint returns every HuggingFace-mirrored model they host -- 16,275 entries on a free trial, more on paid plans. Our OpenAIProvider.listModels was trying to ingest all of them, ship per-model profiles for each in the /llm/health response, and let the model picker render a 16k-entry datalist. Result: Windows socket pool drained, /llm/health payload hit ~1MB, model picker would auto-select 000ADI/Qwen2.5-...-Gensyn-Swarm-grazing_locust (the alphabetically-first random fine-tune), and the renderer locked up.

    Three layers of fix:

    • capListedModels at the provider boundary caps the response at 500. When upstream exceeds the cap, models from a curated allowlist of foundation-model orgs (meta-llama, deepseek-ai, Qwen, mistralai, google, microsoft, NousResearch, HuggingFaceH4, CohereForAI, allenai, tiiuae, 01-ai, WizardLMTeam, moonshotai) are always retained; the rest fill remaining slots alphabetically.
    • /llm/health slimmed: when the model count exceeds 50, the response ships only model names. Per-model profiles are lazy-fetched via /api/models/:name/info on demand. Net JSON shrinks from ~1MB to ~10KB.
    • pickDefaultModelForPreset learned to prefer curated foundation models. CURATED_CLOUD_MODELS.featherless seeds 10 hand-picked entries (DeepSeek-V3-0324, Qwen 2.5-Coder-32B, Llama 3.1-70B-Instruct, etc.) so first-run users land on a real model.

    4. The 60fps re-render loop. After a Featherless reconfigure, the renderer would suddenly hit 49% sustained CPU. Profiler audit traced it to a feedback loop: /llm/health → setModelProfiles({...spread}) → every Zustand subscriber re-renders → one of those re-renders triggers another /llm/health → loop.

    Fix: removed the lazy-fetch from useLLMHealthCheck. Empty-dict guards added to setModelProfiles and setRecommendedSettings so an empty {} (which the slim path sends) doesn't trigger spurious re-renders. CPU dropped from 49% to 1.3%.

    5. The whole-settings selector cascade (BUG-DM-15). Nine components were subscribing to the entire settings object via useStore((s) => s.settings). Any change to any of the 50+ settings keys re-rendered all of them. Refactored to per-key selectors so each component only re-renders when the keys it actually reads change. ~10x reduction in re-renders during typical settings churn.

    6. The SQLite race conditions. Two distinct races:

    • SettingsService.setMany was firing concurrent BEGIN statements during onboarding, hitting "cannot start a transaction within a transaction." Fix: serialized via internal queue.
    • Cross-service: SettingsService and MessageService could both try BEGIN at the same time. Fix: BEGIN-retry wrapper that catches the race and retries with backoff.

    7. Eight Featherless models were OAuth-gated. Live API testing revealed that 8 of the 10 originally-curated Featherless models returned 403 with model_gated_needs_oauth -- they require HuggingFace account-linking that Bodega can't do. meta-llama/* (every Llama model) and google/* (every Gemma model) were affected.

    Fix: rewrote the curated list with verified-working IDs only (DeepSeek-V3-0324, Qwen 2.5-72B-Instruct, Qwen 2.5-Coder-32B, Kimi-K2, Hermes-3-Llama-3.1-70B, etc.) and added an OAUTH_REQUIRED_HF_ORGS filter at the boundary so meta-llama and google models never reach the dropdown at all.

    Then the silent-fail bug surfaced

    This is the one that took the longest to crack. After onboarding completed, the user would press Enter on the prefilled "What can you help me with?" message, the composer would clear, and absolutely nothing else would happen. No error banner, no chat session in the sidebar, no response. The send was reaching the backend, the session was being created, but the user saw silence.

    Twenty rounds of progressively-deeper retry mechanics shipped throughout the day. Each round helped a little. None of them actually fixed it.

    The root cause turned out to be embarrassingly simple: ChatErrorBanner only existed in the active-chat layout, never in the empty-state ChatGreeting. When the first send failed (because Featherless's cold-start blocked the backend's event loop while parsing the 500-model list), the error fired correctly -- but it had nowhere to render. The user saw nothing because there was no UI surface to put the error on.

    Fix: the banner now renders in both empty + active states. The error has somewhere to go. The user sees a clear "Request timed out -- the model may still be loading. Try again in a moment." with a Retry button instead of mysterious silence.

    Sub-fixes that landed alongside:

    • Code mode's ErrorBanner was rendering errors raw ("signal timed out") because it diverged from chat mode's ChatErrorBanner which used formatErrorMessage. Wired through.
    • Express keepAliveTimeout bumped from Node's 5s default to 65s. The 5s default caused Chromium's keep-alive socket pool to try reusing FIN-acked sockets during the post-onboarding settle, silently hanging follow-up requests.
    • Iteration-cap warning footer ("Reached the iteration limit. Response may be incomplete.") was appearing on pure-text conversational answers in code mode. Now only appended when the model actually used tools.
    • Toast on first retry ("Connection slow -- reconnecting...") so the previously-silent 15s window gives visible feedback.

    Then the cold-start UX problem

    Even with the banner visible, Featherless's serverless cold-start was 30 seconds to 5 minutes on a busy night. Users would see "Request timed out", click Retry, see "Request timed out" again, click Retry again. The Retry was working but Featherless wasn't responding fast enough to feel like the app worked.

    The proper architectural fix is to move LLM calls off the backend's main event loop into a worker thread (planned for beta.19, ~3-5 days). For tonight, two layers of mitigation:

    Layer 1 -- Pre-warm on onboarding. The moment the user finishes cloud onboarding (5-10 seconds before they actually press Enter), fire a 1-token completion request to the chosen model. Featherless spins up its GPU during the welcome-screen seconds. By the time the user types and hits Enter, the model is warm.

    New backend endpoint: POST /api/llm/warmup -- fire-and-forget, returns 202 immediately, runs the warmup in the background. New frontend hook useModelWarmup watches activeRoutedModel in the store and re-fires warmup whenever the user picks a different model from the dropdown or pins a new role-model in Settings. 60s same-model dedup so we don't spam Featherless.

    Layer 2 -- Backend health cache + renderer health-poll pause. /llm/health now caches its response for 5 seconds with in-flight dedup. Coalesces the burst of health calls during onboarding (Providers tab + FIM + Embeddings + the main poll all fire on mount) plus the 30s steady-state poll. Cache key includes preset so reconfigure invalidates implicitly.

    useLLMHealthCheck skips the 30s poll while a chat or agent stream is active. Eliminates the "Cannot reach Featherless" yellow banner flickering mid-chat that previously fired every 30s when the backend's event loop was busy awaiting Featherless's response.

    Round 26 -- The persistent warming banner. Even with Layer 1 + 2, the cold-start window was still invisible. Users saw nothing happen, didn't know if the app was broken or just waiting. The transient Cannot reach Featherless banner flickering on/off REINFORCED the broken perception.

    A new persistent banner sits between TopBar and the mode layout: "• Warming up DeepSeek-V3-0324 -- first send may take 30-90 seconds while the model loads on the provider's GPU." It stays up from the moment the warmup fires until either /llm/health returns connected or a chat completion succeeds. The composer stays enabled. The user knows the truth.

    Then the security audit

    A Sentinel agent was dispatched in parallel with the live testing to audit the day's changes. It found three HIGH-severity findings, all in the SSRF guard added for BUG-DM-18:

    • IPv6-mapped IPv4 (::ffff:127.0.0.1) wasn't being matched. Node's URL constructor returns [::ffff:7f00:1] for that host, which the original isPrivateHost regex didn't catch.
    • IPv6 ULA (fc00::/7) and link-local (fe80::/10) ranges weren't checked at all.
    • Trailing-dot FQDN form (localhost.) bypassed the exact-string match.

    Fixed all three plus added a length cap on the warmup endpoint (was logging unbounded user-controlled strings via pino).

    The same Sentinel pass also verified BUG-DM-16 (prefix-match boundary in pickDefaultModelForPreset -- was matching Qwen2.5-7B against Qwen2.5-7B-Vision-Instruct instead of Qwen2.5-7B-Instruct-FP8) and BUG-DM-17 (length cap on /api/models/:name/info path param) close their respective holes.

    30 new netGuards unit tests + 9 new cloud-key-validate integration tests cover the bypasses.

    Then the Models tab UI polish

    Live testing surfaced two cosmetic issues:

    The "Search models..." input at the top of the Models tab was rendering on all three sub-tabs (Discover, My Models, Providers). Useful on Discover (you're browsing a catalog). Redundant on My Models (the eight inline role pickers ARE search inputs with shared autocomplete). Useless on Providers (at most a dozen presets). Now only renders on Discover.

    When you picked a model from the Default role picker (the only <input list>-based picker in My Models), the input box turned white -- Chromium applies a :-webkit-autofill background highlight on inputs that get a value via native datalist autocomplete, and our dark theme hadn't overridden it. Fixed with the standard inset box-shadow CSS trick that overrides the autofill background.

    What didn't make it

    • Worker-thread refactor for LLM calls -- the proper architectural fix for the cold-start UX issue. Moves LLMService and providers into a worker_threads Worker so the Express main thread stays responsive during LLM calls. Eliminates the entire class of "backend looks dead during chat" bugs. ~3-5 days, planned for beta.19.
    • File splits -- LLMService.ts (863L), useFirstRunMachine.ts (663L), ProviderCard.tsx (465L) are all over their respective limits. Beta.19.
    • Warmup-debounce -- useModelWarmup currently fires twice during onboarding because of a transient state during the reconfigure cycle. Wastes one Featherless request per onboard. Trivial fix (~10 lines), beta.19.
    • Optimistic user-message-shows-immediately in empty state -- currently the typed text disappears the moment Send is pressed, before the new chat session UI renders. Should show in the chat area immediately. Filed for beta.19.

    Why it took so long

    The root-cause fix for the silent-fail bug (banner-in-empty-state) was a 30-line change. It took ~10 hours to get there because the symptom looked like a network/race issue -- POST timing out, optimistic message reconciling wrong, abort signal firing, fetch never reaching the backend. Twenty rounds of retry mechanics, keep-alive tweaks, in-flight dedup, and abort handling each helped a little but didn't fix it. The actual cause was upstream of all that: there was no UI surface to render the error on.

    Web research finally gave the angle to look at: "what if the error IS firing and we just can't see it?" Tracing render trees instead of network paths landed the fix in 30 minutes after that.

    The lesson, if there is one: when twenty rounds of patches each seem to almost-work but don't quite, the failure mode probably isn't where you think it is. Stop patching, trace the actual symptom path from the bottom up.

    That's beta.18.1. Beta.19 starts tomorrow with the worker-thread refactor.

  3. beta.18: V2 Cloud APIs + two ship-blockers caught live

    Update on todays bug fixing, found a lot of issues with cloud providers, specifically Deepseek, Qwen & Kimi, doing fixes as we speak, then sanity testing before shipping beta.18, be on the lookout for an update will try my best to ship all these fixes today.

    Beta.18 Is now live. Below is the full changelog for all work done today.

    V2 Cloud APIs -- what shipped

    Per-provider keys. Every cloud provider has its own settings field now. Switching presets keeps each provider's key -- flip from DeepSeek to Mistral to OpenAI without re-pasting anything. 13 providers wired: OpenAI, Anthropic, Gemini, OpenRouter, Azure, Mistral, Cohere, DeepSeek, Fireworks, Groq, Together, Qwen, Kimi.

    Cloud API onboarding flow. New "Cloud API" path in first-run leads to an 11-provider grid → per-provider key entry → validation → first chat. Region toggles for Qwen + Kimi (international vs China). Resource hostname for Azure. Mirrors how llama.cpp + Ollama install patterns work.

    Per-message cost tracking. Every cloud BYOK response now ships with a $0.0091-style badge next to message metadata. Click for input/output token split. Settings → Cloud API Keys → Spend summary shows session / today / this-month columns plus per-provider breakdown. Pricing tables verified 2026-05-08 for all 13 providers (DeepSeek $0.07/$0.28 flash, Anthropic Sonnet $3/$15, OpenAI gpt-4o $2.50/$10, etc.).

    Optional API key field for local providers. LocalAI behind a reverse proxy, llama-server with --api-key, LM Studio with auth -- every OpenAI-compatible local preset now exposes the key field, labeled "optional" for local and "required" for cloud

    Agentic loop -- two ship-blocker bugs fixed live

    Bug 1: /think prefix sent to non-Qwen models. A previous change extended Qwen3's /think slash-prefix toggle to the DeepSeek family on the assumption R1+ uses the same toggle. It does not -- DeepSeek's thinking is automatic via inline <think> blocks. Result: every iteration prepended /think\n\n, the model parsed it as an unknown user command, and refused with "/think is not a registered command in this environment." 20-iteration refusal loop, $0.034 burned, no answer. Fixed: Qwen-only, with regression tests.

    Bug 2: Over-eager-tool-use nudge fired on legitimate exploration. Two classifiers disagreed: RuntimeLayer correctly routed "explain this repo" to the full lane, but NudgeOrchestrator's simpler heuristic flagged it as a knowledge question and emitted "STOP READING FILES. Answer from training knowledge." Qwen3 quoted the contradiction back to the user as a 5-point rebuttal. Fixed: both nudges now route through RuntimeLayer.isExplorationIntent first, single source of truth.

    New: mid-loop reasoning visibility. ThinkingDisclosure was nested inside the streaming-response block, so it only showed AFTER the model started emitting final-response content. During tool-calling iterations the panel went dark. New render path surfaces reasoning during tool-calling phases. The same change would have caught both today's bugs visually within seconds.

    Plus: time-aware reassurance copy at 15s/60s/120s thresholds. Path context in tool rows (Listed agentic → Listed .../services/agentic). Cross-mode state leaks (chat ↔ code) closed for tool approvals, clarifications, and plan approvals.

    Also in beta.18

    Reasoning persistence (Phase 2J complete). Cloud thinking-mode models -- DeepSeek R1, Qwen3, Kimi K2 thinking -- now have reasoning blocks persisted to the DB AND threaded through the agentic loop's in-memory message array. Pre-fix, second tool-call iteration would 400 with "reasoning_content must be passed back."

    Smarter iteration cap. New isExplorationIntent classifier looks for an exploration verb (explain / describe / trace / how does / etc.) combined with a codebase target (repo / module / architecture / flow). "Explain this repo" → up to 25 iterations. "Pick a number 1-9" → simple-task fast path. The previous heuristic gave both prompts the same 8-iteration cap.

    Qwen iteration thinking-only loop fix. Qwen models occasionally emit reasoning_content for an entire iteration without producing content or tool calls. The loop used to retry on empty content. Now recognizes the pattern as legitimate progress.

    FIM model routing. Default fim.mode flipped to 'off' (was 'auto'). Strict per-preset model allowlist prevents Mistral models from being mis-routed to DeepSeek FIM endpoints. Local llamacpp / Ollama users still have free choice.

    DeepSeek v4 + Kimi K2 model profiles updated. DeepSeek v4-flash (131k/16k, ~50B MoE) and v4-pro (131k/16k, ~671B MoE) replace the deprecated deepseek-chat / deepseek-reasoner aliases (which hard-error 2026-07-24). Kimi K2 temperature locked to 1.0 per Moonshot's docs -- slider disables for locked models.

    Cross-mode state leaks closed. Tool approvals, clarification cards, and plan approvals now scope by sessionId so a pending action from a code-mode session can no longer surface in chat-mode and vice versa.

    How to update

    If you're already running beta.17:

    Bodega checks for updates automatically on launch. Or trigger manually: Settings → About → Check for updates

    The auto-updater pulls the new build, verifies the signature, and prompts you to restart. Your settings, sessions, and API keys are preserved across the update.

    Fresh install or you'd rather download manually: https://bodegaone.ai/download

    Direct installers for Windows, macOS, and Linux. The site auto-detects your platform.

    On the BYOK migration: First launch on beta.18 runs a one-shot migration that copies your existing llm.openai_api_key to the active provider's per-provider key (only when the destination is empty). Cloud Boost retains its key. No existing setups lose access. The migration runs once per user, gated so it can't double-fire.

    If you previously had Cloud Boost + a separate cloud provider configured, both still work -- they're now architecturally separated at the storage layer so you can run, say, DeepSeek as primary + OpenRouter as boost simultaneously.

    By the numbers

    • 13 PRs merged between beta.17 (May 7) and beta.18 (May 8)
    • Backend tests: 3,918 passing (+146 over beta.17)
    • Frontend tests: 525 passing (+64 over beta.17)
    • All tsc, ESLint, webpack clean
    • 2 ship-blocker bugs caught in pre-tag live verification -- both reproduced, both fixed with regression tests, both validated working post-fix
    • Live test: DeepSeek-v4-flash on "Explain this repo structure to me" → 2,556 output tokens at 103.2 tok/s for $0.0091. Pre-fix: 20-iteration refusal loop, $0.034, no answer. 73% cheaper, 25s vs 7+ minutes.
  4. One-click llama.cpp + Ollama setup, ~25 bugs squashed in beta.17

    One-click setup for both major local providers.

    If you've never touched a local model before, Bodega will install one for you. Pick "Set up llama.cpp" or "I'll install Ollama instead" on first launch. Bodega downloads the binary, verifies the SHA256 against the official release, runs the installer (UAC silent on Windows), waits for the service to come online, then drops you into a curated model picker sized for your hardware. You go from clicking the .exe to chatting with a local model in about 90 seconds. No terminal commands, no editing config files, no leaving the app.

    If you already have Ollama installed but the service is stopped, Bodega notices and just starts it instead of re-downloading 2 GB. Smart skip.

    ~25 bugs squashed between yesterday's beta.16 and today's beta.17:

    • Code mode broken on llama.cpp (tools weren't reaching the model). Fixed at the settings layer.
    • Modal dismissed before the model finished loading, leaving a confusing "no provider" banner. Now waits properly.
    • Internal model record IDs (managed-1778...) leaked into the UI in 5+ places. Replaced everywhere with the friendly filename.
    • Rate limiter was throttling localhost calls during onboarding, causing cascading "Catalog failed: Too Many Requests". Fixed.
    • Catalog downloads cancelled when you switched tabs. Now persist across navigation.
    • Guided tour rewrite: spotlight tracks animations, auto-flips when there's no room, opens panels for you instead of silently skipping.
    • High-VRAM users got auto-pulled a 17 GB model with no choice. Now you pick.
    • Shell injection hardening in the installer paths (caught by our pre-ship security review).
    • Plus a dozen smaller polish items. See the full changelog.

    Hit a bug? Help us help you.

    Settings → About → Export Diagnostics Bundle. One click, saves a redacted text report with your app settings, system info, and recent logs. API keys and secrets are stripped before the file gets written. Safe to paste into a support thread here.

  5. Packaging update: a few more days before beta ships

    Sorry for the delay, during packaging and Mac certifications, quiet a few bugs ended up surfacing. We have made lots of progress and are wrapping up and doing some final testing before we feel confident that the Beta version of the app is in a good state to ship, we ask for a bit more patience and should be releasing within the next couple of days. Thank you again.

  6. T-2 beta hardening: troubleshooting docs, NSIS audit, unsigned binary math

    T-2 days to the May beta. Spent today running the final-push hardening pass -- every place users could hit a wall on day one.

    Two troubleshooting guides shipped. A short one in the public release repo for people who land on the GitHub release page directly (SmartScreen, install.ps1 ExecutionPolicy, AV quarantine, Mac Gatekeeper, AppImage chmod, FUSE2, sandbox helper, auto-update edge cases). And a comprehensive one for the website covering everything else: provider setup across Ollama / llama.cpp / LM Studio, model pull failures, Cloud Boost auth, BYOK keys, agentic-loop edge cases, FIM autocomplete, air-gap mode, workspace sandbox, settings + logs + factory reset, beta expiry, telemetry. Roughly 50 distinct failure modes documented, each grounded in actual code paths instead of guesses.

    NSIS installer audit caught three real bugs. Ran a real local Windows build instead of trusting the config. Three things would have shipped broken:

    1. The installer never produced a .exe. Two NSIS macro hooks (customUnWelcomePage, customUnInstFilesPage) aren't actually exposed by electron-builder, so the dark-theme functions they referenced were dead code. NSIS escalated the unreferenced-function warning to error and killed the build before the .exe stage. Would have been a silent CI fail.

    2. package.json:homepage still pointed at my old personal handle on the now-private source repo. NSIS bakes that URL into the uninstaller's "About / Help / Updates" metadata. Every end user would see it. Flipped to bodegaone.ai.

    3. The portable zip target had its filename config in the wrong block. Portable config existed but portable wasn't in the win.target array, so the actual zip target was producing a version-suffixed filename with spaces in it. The branded /download/portable URL would have 404'd. Two-line fix.

    All three caught because I built locally. The CI was happy. Configs lie; outputs don't.

    One thing to be upfront about: the Windows beta ships unsigned. SmartScreen will throw a warning the first time anyone runs the .exe. That's expected, not malware. Two paths around it:

    • PowerShell one-liner (irm https://bodegaone.ai/install.ps1 | iex) -- runs in-memory, calls Unblock-File post-extract, doesn't trip SmartScreen. This is the headline install method.
    • NSIS installer -- works fine, just needs "More info → Run anyway" the first time.

    The reason it's unsigned: Microsoft changed the EV cert rules in 2024 -- paying for an EV cert no longer gives you instant SmartScreen pass-through. So a $300/yr cert buys you nothing the PowerShell path doesn't already give for free. We'll sign for v1.0 GA when there's a real signal premium for it. Mac signing is paid + pending Apple Dev approval (~24h ETA), so notarized DMGs come online with the next tag. Linux AppImage isn't signed because Linux doesn't really care.

    On deck for tomorrow. Apple Dev cert approval should land. Patch the security findings, build the beta-expired screen, swap the eager imports for React.lazy, run a final swarm pass on the patched code, then it's tag time.

    The pattern with these final-push days is the same: the code can pass every test, every type-check, every lint, and still ship with a half-dozen bugs that only surface when a real user touches it for the first time. That's the gap the swarm exists to close. Catching three NSIS bugs and a credentials leak in one morning is not a flex -- it's what should happen before a binary goes public.

    Building in public means showing the cleanup, too.

  7. 60+ commits, the last god file, and two bugs the cleanup exposed

    One day, 60+ commits, the biggest structural cleanup the codebase has seen. Also found two real bugs that were hiding in plain sight, one silently dropping messages on every cloud call, one quietly capping everyone's context window to 32K regardless of their hardware. Both shipped-and-in-production-for-weeks type bugs. Both fixed.

    Breaking this into the three things that actually happened:

    1. Killed the last god file

    AgenticChatService.ts was 1,386 lines. The 700-line service limit has been enforced across the codebase for months. This file kept getting a pass because "it's the orchestrator, orchestrators are allowed to be big." That stopped being true today.

    The split: 7 new modules under services/agentic/. RequestNormalizer owns input validation and intent classification. ProviderResolver resolves the LLM provider, model, and boost-routing decision. AgenticLoopSetup builds everything the loop needs before iteration 0 -- messages, tools, policy, contract, skill context, telemetry. AgenticLoopCoordinator runs the actual loop. Branches A (native tool calls) and B (XML-extracted tool calls) each live in their own file now with a shared BranchPipelineHelpers module underneath.

    AgenticChatService itself is now 230 lines of pure HTTP-facing dispatch.

    Three other files also went under the knife:

    • LLMCallExecutor (668 → 420) -- context-overflow recovery and model-fallback wrappers extracted into ProviderFallbackHandler.
    • StructuralVerifier (611 → 129) -- per-language verifiers (TypeScript, Python, Go/Rust, Generic) moved to a verifiers/ subfolder, dispatcher stays tiny.
    • PostLoopProcessor (609 → 206) -- MemoryExtractor and LoopOutcomeRecorder pulled out as siblings.

    Frontend got the same treatment: ChatInput.tsx 464 → 186 (three hooks + two presentational components extracted), AgentChatPanel.tsx 419 → 276 (useAgentChatPanel hook extracted). Zero files over the god-file limit anywhere in the codebase for the first time I can remember.

    2. The bug the split exposed

    The split itself was mechanical. The interesting part started after the merge, when Cloud Boost tests began failing. Not because the split broke anything -- it didn't -- but because I was now tracing one code path end-to-end for the first time instead of assuming it worked.

    Backend log showed a 400 from Anthropic: messages: at least one message is required.

    The user's turn was being stripped before Claude ever saw it. Always. For weeks. I'd written it off as "boost is flaky" the handful of times I'd noticed.

    Root cause: ContextAssemblyService.trimToContextWindow was reading the global llm.context_window setting (default 32K, baked in back when Ollama was the only supported provider) and Math.min-ing it against the model's real capability. On my 5090 calling Claude Sonnet 4 with its 200K native context, the 20K Bodega system prompt overflowed the imagined 32K budget. The trimmer returned only system messages, dropped the user turn, Anthropic rejected the request, the agentic loop caught the error, the stream silently closed with a "done" frame. The user sees "the model didn't answer."

    Fixed two ways: Thread the resolved model name into the trimmer so cloud providers use their actual native context window, and add a hard safety floor that guarantees the last user turn is always preserved even when system overflow is catastrophic. Cloud providers will never get messages: [] from us again.

    3. The Ollama tax

    The trimmer fix was tactical. The underlying setting : llm.context_window as a global ceiling , was the strategic problem. It was making every provider look like Ollama.

    5090 user with 32GB VRAM? Could easily run 128K context on a 7B model locally. Setting says 32K. Locked to 32K.

    Calling Claude Sonnet (200K native)? Setting says 32K. Locked to 32K.

    Calling GPT-4o (128K)? Locked to 32K.

    Nobody was getting what their provider was capable of. The default was silently the ceiling for everyone.

    Spent the rest of the day rebuilding context window resolution as provider-kind aware. New module ContextWindowResolver with a four-level priority chain:

    1. Per-model override (if user explicitly set one for this model)
    2. Explicit global cap (if user set one non-default -- 0 is the new "unset" sentinel)
    3. Cloud providers: use the model's native max, no VRAM math
    4. Local providers: apply a hardware-computed ceiling based on VRAM tier

    Hardware probe now writes a recommended local ceiling to settings on first run. VRAM tier table: ≤4GB → 8K, 4-8GB → 16K, 8-16GB → 32K, 16-24GB → 64K, 24GB+ → 128K. The probe does this automatically; users see the detected value, can cap it further if they want.

    Migration is idempotent, users who explicitly tuned their context window get that value preserved. Users on the old default get the sentinel, which means auto-resolution takes over.

    Lesson of the day: refactoring clean code finds nothing. Refactoring tangled code finds bugs the tangle was hiding.

  8. 27 bugs, 8 features, zero regressions

    Big session. Ran a full end-to-end audit of the entire app, closed 27 bugs, then started on 8 features pulled from competitive research. 2,523 tests passing at the end.

    The security ones matter most. Three sandbox escape paths in GrepTool, GlobTool, and DiffFileTool -- arguments weren't being re-anchored to the workspace root, so a carefully crafted path could read outside it. Command injection in RunTestsTool: test arguments flowed straight into the shell. All four closed before anything touched a release tag.

    The "why is this broken" ones. Image attachments via the file picker rejected every image as "binary." Chat mode's agent was going rogue with docs attached: trying code tools, burning iterations, then hallucinating a fake answer. Research mode was firing web searches against the embedded doc content instead of the actual question. All three were integration bugs that passed unit tests but fell apart under real usage.

    The off-by-ones. Circuit breaker was tripping at exactly 80% of budget, killing valid conversations at the threshold. Budget enforcement blocked at the limit instead of when exceeded. Air-gap mode wasn't disabling the cloud boost toggle in the UI -- it was enforcing air-gap at the network layer, but the toggle was still live. These are the bugs that make people lose trust. Fixed.

    Onboarding completion never persisted, so the checklist reappeared every restart. Session deletion orphaned memory entries. The memory route crashed on corrupt JSON. The write mutex had no timeout and could hang the app forever. Plus 363 lines of dead CSS from themes we removed weeks ago. And 12 more across the stack.

    What I shipped on top of that:

    • Chat mode now has read-only code tools (grep, glob, file read, code search). You can ask the model about your codebase without switching to Code mode. This was the #1 feature request after the last audit.
    • FIM inline completions now default ON. They were behind an opt-in toggle that nobody found. Better first-run experience.
    • Smart tool stripping. Some small models repeatedly try to call tools that are blocked for their context. After a few failures, the tool list gets removed entirely and the model is forced to respond as text. Reduces the "agent flailing" loop that kills small-model usability.
    • Settings progressive disclosure. New users see Theme, Models, Profile, Boost. That's it. Everything advanced lives behind a toggle. The settings page was turning into a wall.

    Next session: cloud provider as primary (not just boost -- for users without a local GPU), auto-verify after code changes (run tests and build, feed errors back), panel hand-off buttons ("Fix this" from Debug or Research jumps into Agent), and conversation export (JSON + ZIP backup).

  9. The agent that learns mid-run

    First time I watched Qwen 3.5 hallucinate spawn_worker three times in one run, I knew the cross-session learning system wouldn't be enough. Today we closed both gaps.

    Within-loop learning. Before today, Bodega's agent learning was cross-session only. The model makes a mistake, we record it, and the next session gets a rule injected into its context: "don't do that." Works great. 5-stage pipeline, Bayesian confidence tracking, hard pre-execution blocking. But if the model hallucinates a tool in iteration 3 of a 15-iteration run, it could repeat that mistake in iterations 4, 5, and 6 before the session ends and the rule kicks in.

    Now it can't. SessionRuleBuffer records failures in-memory during the run. Iteration 3 fails → iteration 4 sees a SESSION RULES block and the tool is hard-blocked before execution. Max 3 temp rules per session to avoid prompt instability. Rules are ephemeral. injected before each LLM call, stripped after. One new file, 76 lines.

    Rule confidence persistence. We had a Bayesian tracker computing which learned rules were actually working (Beta distribution, alpha/beta posteriors). It flagged rules with confidence below 0.3 for demotion. The math was there. The DB write wasn't. Rules that looked good on paper but triggered false positives were accumulating forever. 20-line fix: shouldDemote() now calls deactivateRule() and the rule goes inactive in SQLite.

  10. 156 E2E tests, Playwright Electron, dark installer, Sentry crash reporting

    Spent the day doing something most indie devs skip: writing real end-to-end tests that launch the actual Electron app and exercise every feature through the UI.

    Not unit tests. Not API tests. Full Playwright Electron tests that boot the app, switch between modes, send messages to local LLMs, verify tool calls hit the disk, and confirm that context survives compaction. 156 tests across 21 spec files. 98.7% pass rate.

    What the tests found: compact was using a stale model config, agent panels weren't auto-scrolling, reasoning-only models showed blank responses, and the auto-router was sending simple prompts to tiny models. All fixed. Also shipped dark-themed NSIS installer with matching uninstaller, crash reporting via Sentry (opt-in, respects air-gap), TopBar layout rearranged per feedback.

  11. Loop write guard, approval card fix, E2E Round 2

    Two things were driving me crazy about the agentic loop. One: the agent would write a file, re-verify, decide it wasn't done, and write the file again. And again. Same content, same path, different iteration. Two: approval cards would appear mid-stream and you'd never see them because they rendered outside the scroll container.

    Both fixed. The repeat-write guard now tracks writes per file path across the loop -- after 3 writes to the same file in a single session, it injects a system message, marks the deliverable satisfied, and breaks the cycle. Approval cards moved inside the scroll container so they actually travel with the content. E2E Round 2 ran 31 tests. Found 11 bugs across todo_write registration, model routing, panel scroll, web search iteration caps, and VRAM warning noise. All 11 fixed and committed.

  12. Chat → Runtime → Loop → QEL

    Shipped the Runtime Layer today. This one is more architectural than visible, but it matters.

    The problem: before each agentic loop, there were ~150 lines of scattered conditionals spread across the chat orchestrator. Is this session in a panel? What iteration budget applies? Does this model support tools? What happens after 3 consecutive failures? These questions were answered in different places with inconsistent logic.

    RuntimeLayer.ts consolidates all of that into a single typed LoopPolicy that gets produced before the loop starts. The classify() call looks at the request, the model's capability profile, the panel context, and the session failure history -- then produces a LoopPolicy with a single executionLane value.

    Four execution lanes:

    • advisory -- bypasses the loop entirely, single LLM call, no tools. Fast. For panels that just need a quick answer.
    • guided -- up to 8 iterations, limited tool set. For supervised agent work.
    • restricted -- panel-constrained tool allowlist. Research panel only gets research tools.
    • full -- complete tool access, computed iteration budget. Normal code mode.

    The capability detection piece is new: CapabilityProfile reads the model's known abilities (tool calling tier: native/xml/weak/none; structured output; reasoning) and can downgrade the lane automatically if the model can't support what was requested. No more sending tool calls to a model that'll ignore them.

    Dynamic failure tracking: if a session sees 3 consecutive tool failures, the lane downgrades automatically for the rest of the session. The model gets fewer chances to break things.

  13. Mar 24-26 -- Phase 9A through 9E

    Shipped the full memory system this week. Five phases in three days. This is the one I'm most proud of so far.

    The problem: every agentic loop iteration starts from scratch. The model has no memory of which files you've been editing together, what patterns you prefer, what errors you've hit before. Every session is day one.

    Phase 9 changes that. Here's what we built:

    • 9A -- HeuristicExtractor wired into the post-loop processor. After every agentic iteration, it extracts facts from what the agent observed and stores them in SQLite. Compression ratio confirmed at 5x+ on real sessions.
    • 9B -- FileAffinityTracker (tracks which files you co-edit, how often, how recently) + ImportGraphExtractor (static import graph for JS/TS/Python/Rust/Go). The context assembler uses both to inject the right files into the next session without you having to specify them.
    • 9C -- LLMObserver -- a second-pass LLM call that extracts implicit facts from assistant turns. Things the heuristic extractor misses. Runs async post-loop on hardware that can afford it, falls back to heuristic-only on low VRAM.
    • 9D -- Memory time decay. Observations have configurable half-lives by type. Stale memories fade instead of polluting context forever. BM25 relevance scoring added alongside recency decay.
    • 9E -- Evaluation harness. 25 scenarios covering injection, retrieval, dedup, decay, and cross-session recall. Memory metrics API exposed for debugging.

    Total: 8 new service files, 2 new tools (CreateDocument, DeepResearch), memory pipeline fully wired end to end. This is what makes Bodega feel like it knows you over time.

  14. Mar 23 -- 30 bugs, one session

    Ran what we're calling Operation Fumigate last Sunday. The goal: clear every known bug before the next beta tag. Final count: 30 bugs fixed in one session.

    It was deliberately parallel. Stood up 4 squads, each with a defined scope and a dedicated branch. No overlap, no conflicts.

    • Squad 1 hit the code editor and FIM (fill-in-middle): 9 bugs. Monaco diff decoration bugs, inline fix streaming edge cases, FIM fence stripping failures.
    • Squad 2 took terminal and the Problems panel: 7 bugs. Terminal duplicate input handlers, xterm focus tracking using the wrong event, OSC 133 command block edge cases.
    • Squad 3 handled streaming and session infrastructure: 8 bugs. Double SSE events, streaming interrupted on panel navigation, session data leaks, permission mode enforcement in chat mode.
    • Squad 4 closed out settings, memory, and project management: 6 bugs. Settings not persisting across restarts, memory rate limit bypasses, orphaned settings keys.

    All 4 squads merged to dev by end of day. Doc sweep ran afterward -- all counts, changelogs, and references updated to match. Tagged beta.6 that same evening.

    The thing that made this work: clear blast radius per squad, no shared files, every fix against a real test case. 30 bugs with no regressions.

  15. Mar 17-18 -- Brain MCP + 13-agent team

    This is the part that doesn't look like normal solo indie dev.

    I've been building with an AI agent team. Not AI-assisted -- an actual team of 15 specialized agents coordinated through a shared memory system called Bodega Brain. Each agent has a defined role, its own identity file, and stays in its lane.

    The roster: Co-Dev (lead), Architect (structural health), Engineer (implementation), Fixer (bugs), Sentinel (security scanning), Scout (competitive intel), Strategist (product direction), QA Engineer, Doc Guardian, Performance Profiler, Integration Tester, Release Manager, Reviewer, UX Auditor, Writer.

    Each one runs on its own git branch. Co-Dev reviews their work, creates PRs, merges after CI passes. I have final say on anything touching main. It's a proper dev workflow, just with agents instead of contractors.

    The Brain is how they coordinate -- a shared system with messaging, task queues, workspace claiming, decision logging, and a live dashboard. When two agents might conflict on the same files, they claim workspaces and check for conflicts before starting.

    This session: 8 PRs reviewed, 5 merged to dev (LSP integration, unified model hub, god-file splits, security hardening, test coverage). The acceleration this enables is real. Phase 0-3 of the V2 overhaul shipped in 48 hours.

  16. QEL ships

    Spent the last few days hardening what I'm calling the QEL -- Quality Enforcement Layer. This was the biggest early architecture decision and it's worth explaining why it exists.

    Most AI coding assistants work like this: you ask a question, the model responds, done. There's no verification that what was produced actually matches what was asked. No check that the code compiles. No detection of stubs. The model hallucinates a solution and calls it a day.

    QEL changes that. Every agentic loop iteration runs three passes: contract extraction (what did the user actually ask for?), completion verification (did the response satisfy it?), and a mode firewall that prevents the wrong class of task from sneaking through. There's a test suite with a letter-grade output system -- the agent has to get an A or B before the response goes out.

    The architecture underneath is Express + SQLite for the backend, with a streaming pipeline that pushes Server-Sent Events to the frontend in real time. 15 defined SSE event types covering everything from tool calls to plan approvals to QEL verification results.

    The other decision I made early: no god files. I've worked on enough codebases that became unmaintainable from one class doing everything. Bodega has hard line limits: 700 lines for service files, 400 for React components. When something hits the limit, it splits. This decision has already paid off four times.

    Current state: QEL shipping, 630 tests passing, agentic loop running on Ollama and OpenAI-compatible providers.

  17. Initial commit day

    Started building Bodega One. Here's what it is and why I'm building it.

    It's a local-first AI desktop IDE. Two modes: Chat Mode for general AI conversation, Code Mode for agentic software development. Runs entirely on your machine. No cloud dependency unless you want one.

    I got tired of tools that route everything through someone else's servers. Not because I have something to hide -- because I don't want to depend on a company's uptime, rate limits, or pricing decisions to do my work. Your code, your hardware, your data.

    The tech stack: Electron 40 for the desktop shell, React 19 + TypeScript on the frontend, Express + SQLite on the backend. It supports Ollama out of the box, with OpenAI-compatible endpoints as a fallback for when you need a heavier model.

    The thing I kept noticing with other AI coding tools is that they're mostly fancy autocomplete with a chat window bolted on. What I wanted was something that could actually reason about what it's doing -- extract requirements from what you ask, verify its own output, and refuse to ship half-finished work. That's the Quality Enforcement Layer. More on that later.

    First commit dropped today. It's rough but it runs. The bones are there.

    Building this in public. Wins, bugs, architecture decisions -- all of it.

For polished release notes, see the Changelog · Join Discord

Follow the build.

Beta is live now for the first 200 users. Join the waitlist for full launch.