Skip to main content

Models & providers

Models & Providers

This section covers everything in Settings → Models: how to connect a provider, configure your model roles, set up FIM autocomplete and codebase embeddings, and tune vision routing. If you're not sure where to start, open the Providers tab and pick a provider, then set your roles in My Models.

Opening the Model Hub

There are two ways to get there:

  • Status bar click - In Code mode, click the active model name at the bottom of the screen. This opens Settings and navigates directly to the Models section.
  • Settings nav - Press Ctrl+, and click Models in the left nav.

The panel has three tabs at the top: Discover, My Models, and Providers. The search field (top-right of the tab bar) works across catalog results and the installed model list simultaneously.

A hardware tier badge (e.g., Large - 24 GB) appears in the panel header when GPU data is available. An air-gap notice appears in the footer when air-gap mode is active.

Providers tab - connecting a provider

Bodega ships 24 pre-configured provider presets - 9 local (no API key required) and 14 cloud (API key required), plus one Custom OpenAI-Compatible slot.

To connect any provider:

  1. Open Settings → Models → Providers.
  2. Find the provider card and click its name to expand it.
  3. Fill in the base URL (pre-filled for most presets) and your API key if required.
  4. Click Test Connection - success shows the model count and latency (Connected - 5 model(s)).
  5. Click Set as Active to make it your primary provider. This saves the key and URL and clears stale model names from your previous provider.

After a successful test, Use This Provider appears as a shortcut that does steps 4 and 5 in one click.

Auto-detect: Click Auto-Detect Providers at the top of the Providers tab. Bodega probes default ports for every local preset. Detected servers appear with a green badge (3 models - 45ms). Auto-detect only checks default ports - custom ports need manual URL entry.

Local server type: A Local server type dropdown appears above the provider list when a local preset is active. Leave it on Auto unless you hit tool-call or FIM issues - mismatching it will break those features.

All 24 provider presets

Local (no API key)

Provider Default port Notes
Ollama 11434 Hot-switches between models; best local default
LM Studio 1234 Single model at a time; restart to switch
vLLM 8000 High-throughput; model set at launch
llama.cpp 8080 Bodega can manage the server process (see below)
LocalAI 8080 Self-hosted OpenAI-compatible
KoboldCpp 5001 GGUF support; single model at a time
GPT4All 4891 Single model at a time
MLX (Apple Silicon) 8080 M1/M2/M3/M4 only; single model at a time
Jan 1337 Hot-switches between models

Cloud (API key required)

Provider Where to get a key
Anthropic (Claude) console.anthropic.com
OpenAI platform.openai.com
Google Gemini aistudio.google.com
Groq console.groq.com
Together AI api.together.ai
OpenRouter openrouter.ai/keys
Azure OpenAI Azure Portal → Azure OpenAI
Mistral AI console.mistral.ai
Cohere dashboard.cohere.com
DeepSeek platform.deepseek.com
Fireworks AI fireworks.ai/settings/users/api-keys
Qwen (Alibaba DashScope) bailian.console.aliyun.com
Kimi (Moonshot AI) platform.moonshot.ai
Featherless AI featherless.ai/account (key prefix rc_...)

Plus: one Custom OpenAI-Compatible slot for any endpoint not in the list.

Switching the active provider automatically clears all 9 model role assignments - model names from one provider won't work on another.

Provider behavior differences

Single-model providers (LM Studio, llama.cpp, KoboldCpp, GPT4All, MLX) load one model at a time. Switching the active model requires a server restart. Auto-routing locks to whichever model is currently loaded.

Hot-switch providers (Ollama, Jan, vLLM) can serve any installed model without restarting. Multi-role setups work best here.

Cloud providers have no switch cost but every token is metered. Running FIM autocomplete against a cloud provider is expensive - see the FIM section below.

Featherless uses HuggingFace org/model format for model names (e.g., deepseek-ai/DeepSeek-V4-Pro). Model availability depends on your subscription tier - gated models return 403 on lower plans even though they appear in the listing.

Qwen and Kimi have separate China and international endpoints. If your account is on the China region, override the base URL in the provider card to the .cn endpoint.

Discover tab - Ollama: browsing and downloading models

When Ollama is the active provider, the Discover tab shows a curated catalog and a download input.

Pull any model by name:

  1. Type a model name in the Pull Any Model field at the top (e.g., llama3.3:70b, mistral:7b, phi4).
  2. Press Enter or click Pull.
  3. Progress shows inline: stage label (Queued → Connecting → Downloading → Validating → Finalizing), percent, and bytes/sec. Click Cancel to abort.

After download completes, the model appears in My Models automatically.

Browse the catalog:

  • Filter chips: Fits My GPU toggle (hides models that exceed your VRAM) and category filters (All / Code / Chat / General / Reasoning).
  • Recommended for Your GPU - up to 4 models that fit and have recommendation badges. Disappears when a filter is active.
  • Featured Models - full catalog, sorted by VRAM fit then headroom.
  • Cloud Alternatives - collapsed section; expand with the chevron.

The search bar (top-right) narrows by display name, model family, and description.

Air-gap mode disables the Pull input and shows a warning. The curated catalog still loads from local JSON.

When llama.cpp is the active provider, the Discover tab switches to a GGUF browser (the Ollama UI disappears).

Download from the curated catalog:

  1. Set llama.cpp as active: Models → Providers → llama.cpp → Set as Active.
  2. Go to Models → Discover.
  3. Browse ~17 curated entries. Filter by category (All / Code / General / Reasoning) or toggle Only fitting to hide quants that exceed your VRAM.
  4. Click a model card to expand it. Per-quant options (Q4/Q5/Q6/Q8) appear - greyed-out quants exceed VRAM. Click a quant button to select it.
  5. Click Download <quant> to start. Progress shows stage, percent, and bytes/sec inline. Click Cancel to abort.

HuggingFace search:

  1. Expand Browse all GGUFs on HuggingFace at the bottom of the tab.
  2. Type a query and press Enter or click Search.
  3. Results show download count and likes. HuggingFace results do not have VRAM scoring. To use one, download the GGUF manually from the HF page and side-load it via My Models → llama.cpp section.

Air-gap mode disables HF search. The curated catalog still works.

My Models tab - 9 model roles

Bodega routes different tasks to different models using 9 role assignments. You set these in Models → My Models → Model Roles.

Global:

  • Default - fallback for any role that is left empty.

Chat mode:

  • Chat - main conversational model.
  • Fast - quick responses; auto-routing picks this for simple questions.
  • Smart - complex reasoning; auto-routing picks this for hard questions.

Code mode panels:

  • Agent - the coding agent (Ctrl+L).
  • Research - Research panel (Ctrl+Shift+R).
  • Debug - Debug panel (Ctrl+Shift+E).
  • Advisor - Advisor panel (Ctrl+Shift+A).

Each role uses a text input with typeahead. Type part of a model name to filter the dropdown. Leave any role empty to fall back to Default. After making changes, click Save Model Settings.

Note: FIM (Fill-in-the-Middle) autocomplete has its own dedicated panel below the role grid. It does not appear in the 8-picker role grid and model cards do not show a FIM badge.

Embedding models are excluded from all role pickers - they cannot chat.

Assigning model roles

  1. Open Settings → Models → My Models.
  2. In the Model Roles card, find the role you want to assign.
  3. Click the input for that role and type part of a model name - the dropdown narrows to matching models.
  4. Select a model or type a full model ID (Featherless users can paste HuggingFace org/model IDs directly).
  5. Repeat for other roles. Leave roles you don't need empty - they fall back to Default.
  6. Click Save Model Settings.

A Multi-Model VRAM Warning banner appears automatically below the role pickers if your Chat and Agent models together would exceed your estimated VRAM. This does not block saving - it is informational.

Installed model list - favorites and categories

Below the Model Roles card, My Models shows all installed models grouped by category: Reasoning, Code, Chat, Multimodal, General, Embeddings, Other.

  • Star icon on any model row marks it as a favorite. Favorites float to the top in a separate section.
  • Click a model row to expand it: shows role badges (which roles that model is assigned to), per-model override controls (temperature, max tokens, context window, reasoning effort), and a Delete button.
  • Delete prompts for confirmation, then permanently removes the model from Ollama. There is no undo.
  • The search bar at the top also narrows the installed model list and the role picker dropdowns simultaneously.

If a model appears in Other, it does not have a recognized model profile yet - it still works, it just has no VRAM estimate or category metadata.

Task Performance - QEL pass rates

At the bottom of My Models, the Task Performance card shows per-model QEL (Quality Evaluation Layer) pass rates from the last 100 tasks.

Column What it means
Pass count / Total How many creation tasks passed QEL out of total scored
Average score Mean score out of 100
Color Green >80%, amber 60–80%, red <60%

This data populates automatically after the first code-creation task. No setup needed.

"No QEL data yet" means the model has not completed any creation tasks (new files, functions, routes). Conversational queries and read-only tasks are not scored.

A 0% pass rate at a low average score (e.g., 45/100) is accurate - it means the model is consistently below the QEL threshold for creation tasks, not that something is broken.

FIM (Fill-in-the-Middle) autocomplete

FIM is the inline code completion that fires as you type in the editor. It is configured separately from the primary chat provider.

Three modes (select in Models → My Models → FIM (Inline Completion)):

  • Primary - reuse the active chat provider for completions.
  • Custom - dedicate a separate provider with its own URL and API key.
  • Off - disable autocomplete.

Provider quality tiers:

  • Native FIM (best quality, lowest latency): Ollama, vLLM, llama.cpp. These use the /v1/completions endpoint with proper FIM tokens.
  • Prompt-injection fallback (slower, still works): LocalAI, LM Studio, KoboldCpp, GPT4All, MLX, Jan, Custom.
  • Not recommended (charges per-keystroke): all cloud providers - OpenAI, Anthropic, Gemini, Groq, Together, OpenRouter, Mistral, Cohere, DeepSeek, Fireworks, Qwen, Kimi, Azure.

qwen3-coder does not have native FIM tokens and falls back to prompt-injection - slightly slower but functional.

In Custom mode, set the Base URL, optionally set an API key, click Test Connection, then pick a model from the dropdown (or leave on (auto-detect)).

Codebase embeddings

Embeddings power the semantic codebase index used by Ask the Map and codebase search. Configure the embedding provider at Models → My Models → Codebase Embeddings.

Providers:

  • Ollama (local, free) - recommended default. Set Base URL (default: http://localhost:11434) and pick a model. Recommended models: qwen3-embedding:4b, nomic-embed-text, mxbai-embed-large, all-minilm, snowflake-arctic-embed.
  • llama.cpp (managed) - Bodega spawns the embedding server automatically (see below).
  • OpenAI - cloud, metered. Models: text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002.
  • Off - disables codebase search.

Toggle Auto-index on project open to build the index 60 seconds after a project loads.

Changing the embedding model after an index exists requires a full rebuild. A banner will appear - click Rebuild now. The old index uses a different vector dimension and is incompatible with the new model.

Managed llama.cpp embedding server

With the managed option, Bodega spawns a dedicated llama-server instance for embeddings so you do not have to run it yourself.

Requirements: The llama.cpp binary must be installed (Settings → Models → llama.cpp section). At least one GGUF must be downloaded via the llama.cpp Discover tab.

Setup:

  1. In Models → My Models → Codebase Embeddings, select llama.cpp.
  2. Enable Let Bodega manage the embedding server.
  3. Pick an installed GGUF from the dropdown, or type a full file path.
  4. Set the port (default: 8081). This is separate from the chat server on port 8080.
  5. Save settings. Bodega starts the server on next use.

GGUFs appear in the dropdown only after they have been downloaded via the llama.cpp Discover tab.

Vision routing

When you attach an image to a message, Bodega routes it based on your active provider:

  • Cloud vision-capable models (Claude Sonnet, GPT-4o, Gemini) - the image is sent inline. No swap, no delay.
  • llama.cpp with a text-only model - if configured, Bodega pauses the text model, loads a VLM (vision language model), answers the vision question, and resumes. The UI shows swap progress: swap_started → loading_weights → ready → querying → complete. Swap typically takes 10–60 seconds.
  • Ollama with a multimodal model - Ollama handles the swap internally with no visible pause.

Configure vision in Settings → Models → Vision Binding:

  • Vision engine - Auto (prefers Ollama to avoid the pause) / Ollama / llama.cpp.
  • Allow llama.cpp model swap - disable this if you want vision questions declined rather than triggering a slow swap.

To install a vision model:

  • llama.cpp: download a VLM-capable GGUF (e.g., LLaVA) via Models → Discover.
  • Ollama: pull a multimodal model (llava:13b, etc.) via the Pull Any Model input.

If llama.cpp swap is disabled and there is no Ollama VLM, vision questions are declined with a message directing you to install a VLM.

Claude Fast Mode

Fast Mode skips extended thinking (extended reasoning) on Claude models, so replies start sooner. The underlying model is the same - only the reasoning pre-pass is skipped.

When a Claude model is selected, click the Fast toggle in the message composer (next to the reasoning control).

From this point, all Claude calls skip extended thinking unless you explicitly set a high reasoning level on a specific message using the composer's reasoning control. The per-message control always wins over this setting.

Fast Mode applies only to Claude models (model name contains claude). It has no effect on other providers.

llama.cpp advanced flags

Power-user controls for the managed llama-server process are in Models → My Models → Advanced Flags (visible only when llama.cpp is the active provider).

Available controls: GPU layers, context size, batch size, and a free-text field for extra flags.

Changes require a server restart to take effect. The server manager attempts crash recovery if a bad value causes the process to exit.

Hardware tier scoring

Bodega detects your GPU and assigns a hardware tier, shown in the Models panel header and used to score which catalog models fit.

Tier VRAM
Tiny ≤ 4 GB
Small 4–8 GB
Medium 8–16 GB
Large 16–24 GB
XL 24+ GB

Apple Silicon uses unified memory - all RAM counts as VRAM.

The Fits My GPU filter on the Discover tab hides models that would exceed your tier. The Multi-Model VRAM Warning in Model Roles uses the same data to flag when your Chat and Agent models combined would exceed available VRAM.

Keyboard shortcuts

KeysAction
Ctrl+,Open Settings (then click Models in the left nav)
Click model name in status barOpen Settings → Models directly from Code mode
Enter (in Pull Any Model input)Start an Ollama model download
Ctrl+LOpen Agent panel (uses the Agent model role)
Ctrl+Shift+ROpen Research panel (uses the Research model role)
Ctrl+Shift+EOpen Debug panel (uses the Debug model role)
Ctrl+Shift+AOpen Advisor panel (uses the Advisor model role)

This page mirrors the in-app docs hub for app version 1.0.0-beta.26.1. Found something unclear or out of date? Tell us on Discord. New here? Download the free beta and follow along.