Skip to main content

Models & providers

Vision (Image Understanding)

Bodega can look at images - screenshots, diagrams, UI mockups, error dialogs - using a vision-capable model. You do not have to switch your main model to do it.

How vision works

Keep your text model as the default. When you attach an image and your active model can't read images, Bodega routes just that question to a vision model and brings the answer back - your text model stays in place.

Three paths, depending on your setup:

  • Cloud model (GPT-4o, Claude, Gemini) - reads the image directly. No swap, nothing to install.
  • Ollama - uses an Ollama vision model directly; no process swap.
  • llama.cpp - briefly hot-swaps to your downloaded vision model, answers, then restores your text model (see below).

Attach an image

  1. Drag an image file onto the chat input, or click + → Attach Screenshot to capture one.
  2. Ask your question about it.
  3. The answer streams back in the conversation.

Works in both Chat and Code mode.

Local vision: the auto-swap (llama.cpp)

On llama.cpp, Bodega owns one model process, so seeing an image means temporarily loading the vision model. You'll see a brief "switching to the vision model" indicator with progress - the first load takes ~10–60s while weights load; subsequent queries are fast. After the answer, your text model is loaded back automatically.

Choose how vision is handled in Settings → Models → Vision - prefer the no-swap engine (Ollama) when one is available, or allow the llama.cpp swap.

Get a vision model

Download a VLM from Models → Discover:

  • Qwen2.5-VL - strong general-purpose vision
  • LLaVA - widely supported
  • Moondream - tiny, fast, low-VRAM

Ollama also has its own vision models. Once a VLM is installed, the routing above picks it up automatically - no role assignment needed.

Vision in the agent loop

The agent can use vision as a tool, not just in chat:

  • vision_query takes an image and a question and sends both to your bound vision model - this is how a text-only agent (e.g. Claude driving the loop) can "see."
  • preview_interaction can screenshot the live Preview tab; the agent then runs vision_query on that screenshot to read what rendered. That's how it can check a UI it just built.

Air-gap

In air-gap mode, only local vision models are used - cloud vision (GPT-4o, Claude, Gemini) is blocked along with all other cloud traffic.

This page mirrors the in-app docs hub for app version 1.0.0-beta.26.1. Found something unclear or out of date? Tell us on Discord. New here? Download the free beta and follow along.