How do I set up Ollama with Bodega One Code?

Install Ollama, pull a model (start with llama3.2:3b for 8GB RAM or qwen2.5-coder:14b for 16GB+), then open Bodega One Code Settings and select Ollama from the LLM providers list. The endpoint is pre-configured to http://localhost:11434. Pick your model from the dropdown and you're running.

What size Ollama model should I use on my machine?

On 8GB RAM use llama3.2:3b or phi3:mini. On 16GB RAM use llama3.1:8b comfortably or a quantized 13B. On 32GB RAM you can run 13B to 33B models. Apple Silicon M1-M4 runs 5-6x faster than Docker containers because Ollama uses the Neural Engine and unified memory natively.

Why is my Ollama model not responding or terminal just sitting there?

Check your hardware first. If you see the llama runner exited error, you ran out of RAM loading the model. Pull a smaller model or a more aggressively quantized version. Also verify context window: Ollama defaults to 2048 tokens regardless of model support, which can make responses seem stuck. Increase context in your client settings.

Does Ollama work offline and how do I verify it's running?

Yes, Ollama runs completely offline with no API key or account required. All data stays on your machine. After installation and pulling a model, verify it's running by typing curl http://localhost:11434 in your terminal. If it returns Ollama is running, you're connected. Test the model directly with ollama run llama3.2.

Setting up Ollama with Bodega One Code: model guide

Bodega One Code Chat Mode with a model picker, ready to run a local model

Quick answer

Ollama runs LLMs locally on your machine at localhost:11434, no account or API key required. To wire it to Bodega One Code, install Ollama, pull a model (start with llama3.2:3b on 8GB RAM, or qwen2.5-coder:14b on 16GB+), then select the Ollama preset in Bodega One Code's provider picker. Works offline, air-gap safe, BYOLLM-compatible.

The first time I saw Ollama mentioned, the post was three lines. Install it. Pull llama3.2. Done.

That's not wrong. But if you've hit a wall (model not responding, terminal just sitting there, responses making no sense), there's usually a gap between the three-line version and actually knowing what's going on. This is the fuller version.

What Ollama is, plain

Ollama runs large language models on your own hardware. Install it, pull a model, and you have a working API endpoint at localhost:11434. No account, no API key, no data leaving your machine.

It handles the parts that used to be painful: model quantization (compressing models so they actually fit in your RAM), GPU offloading if you have one, and a REST API that other tools can talk to. The reason r/LocalLLaMA recommends it by default is that it just works for most setups. You're not configuring CUDA drivers or wrestling with Python version conflicts. You install it, pull a model, and it runs.

What hardware you actually need

Be honest with yourself here. People skip this and spend two hours wondering why their machine is crawling.

The rough breakdown:

8GB RAM: Runs 3B models fine, 7B models slowly. Stick to llama3.2:3b or phi3:mini.
16GB RAM: 7B runs comfortably. You can push a quantized 13B model. Most developers land here and it's enough.
32GB RAM: 13B to 33B range depending on the model and quantization level.

A GPU helps but is not required. Ollama runs on CPU. It's just slower.

If you're on Apple Silicon (M1 through M4), the picture is better than the RAM numbers suggest. Ollama uses Apple's Neural Engine and unified memory natively. Benchmarks on M3 Pro show 28-35 tokens per second on Llama 3.1 8B. Fast enough for real work. On M1 base with 16GB you're looking at 12-15 tokens per second. Slower, but usable.

If you're on a Mac: run Ollama natively, not in Docker. Native performance runs 5-6x faster because Docker containers don't get access to Apple Silicon's GPU. That's not a small difference.

The things nobody mentions

Before you pull anything, here are the mistakes that cost people hours:

The naming typo. Ollama uses llama3.2 (no hyphen). Type llama-3.2 and it either fails silently or does the wrong thing. The naming conventions across the library are not consistent, and Ollama's error messages don't always tell you what went wrong.

The context window default. Ollama defaults to a 2048-token context window regardless of what the model actually supports. Most modern models support 8k, 16k, or 128k. If your model seems to forget earlier parts of a conversation, this is why. You can set a higher context in your client, but the default will catch you off guard.

"llama runner exited." This error means you ran out of RAM or VRAM loading the model. It's not a bug. It's your hardware limit. The fix is to pull a smaller model or find a more aggressively quantized version of the same one.

Getting it running

Step 1: Install. Download from ollama.com for your platform. On Linux:

curl -fsSL https://ollama.com/install.sh | sh

On Mac and Windows it installs as a background service that starts automatically.

Step 2: Pull a model. Start conservative.

ollama pull llama3.2

for 8GB RAM, or

ollama pull llama3.1:8b

for 16GB RAM and up. First pull takes a few minutes. After that, models are cached locally.

Step 3: Verify.

curl http://localhost:11434

"Ollama is running" means you're good. Test the model directly with:

ollama run llama3.2

That opens an interactive chat in your terminal.

Which models are worth pulling

The library has over 100 models. Most people need three at most. Here's what the community actually uses:

Start here:

llama3.1:8b: 108 million downloads on Ollama. Meta's flagship general-purpose model. Fast, handles most tasks well, good at following instructions. This is the default recommendation because it's reliable and well-tested, not because it's the most hyped.

For coding specifically:

deepseek-coder:6.7b: Trained on code. More accurate at technical output than a general model of the same size. If you're using a local model for coding tasks, this one earns its place.

codellama:13b: Stronger at code completion and explanation. Needs 16GB RAM minimum.

For reasoning tasks (if you have the hardware):

deepseek-r1: 78.8 million pulls on Ollama. An open-source reasoning model that competes with frontier model quality on math and logic. The community reaction was genuine surprise. A locally-running model that gets close to what you'd expect from a hosted API on reasoning tasks.

For limited hardware:

phi3:mini: Microsoft's 3.8B model. Surprisingly capable for its size, battery-efficient on laptops.

llama3.2:3b: Good enough for basic coding assistance on 8GB RAM.

One thing to keep in mind: model quality varies a lot by task type. A model that struggles with complex reasoning can still be fast and accurate at reading code and writing straightforward functions. If something isn't working, try a different model before blaming your hardware.

Ollama vs LM Studio

If you've seen LM Studio recommended alongside Ollama and aren't sure which to use, here's the actual difference:

Ollama is terminal-first. It runs as an API server that other tools connect to. No built-in GUI. If you want to script it, automate it, or connect it to a coding environment, Ollama is the right choice.

LM Studio is GUI-first. Download it, open it, start chatting. Lower friction if you just want to talk to a model without touching a terminal.

You can use both. They're not competing for the same job.

Connecting Ollama to Bodega One Code

Once Ollama is running locally, wiring it to a real coding environment takes about two minutes.

Bodega One Code ships with Ollama as a first-class provider preset. Open Settings, select Ollama from the LLM providers list, and the endpoint is already set to http://localhost:11434. Pick the model you pulled from the dropdown.

Your code, prompts, and responses stay on your machine. Bodega One Code talks directly to your Ollama server with no cloud routing.

The practical difference between running Ollama in a terminal and running it through a coding environment is significant. Instead of pasting code snippets into a chat window, the AI has full access to your file tree, can run shell commands, and writes actual diffs to your codebase. 26 built-in tools. An autonomous coding agent with a Quality Enforcement Layer that validates changes before they land in your project.

That's a different category of tool than a local chat window.

What to expect

Local models are slower than hosted APIs. A 7B model on CPU responds in a few seconds, not milliseconds. With a GPU or Apple Silicon the gap closes significantly. For most coding tasks (reading files, writing functions, explaining code, refactoring) the speed is workable.

What you get in return: no rate limits, no upstream outages, no model behavior changes that ship silently, no token bill at the end of the month.

If a specific task needs a stronger model (complex architectural decisions, unusual reasoning chains), Bodega One Code lets you switch providers in seconds. Keep Ollama for everyday work, switch to Claude or OpenAI when you need the heavy lifting. That's the BYOLLM approach: you pick the right tool for the job rather than committing to one model for everything.

Try it yourself

The full setup (Ollama running, a model pulled, connected to a real coding environment) takes about 20 minutes the first time. After that it just runs in the background.

Beta is free and open to everyone. Download free.

Not sure which Ollama model to actually pull? The Local LLM Guide ranks 30+ models by SWE-bench Verified score with per-VRAM-tier picks, including every model available on Ollama.

Sources

Ollama install instructions (cross-platform): ollama.com/download
Ollama context-window default issue (the 2048-token gotcha covered above): github.com/ollama/ollama/docs/faq
Ollama model library (every tag and quant): ollama.com/library
Best local LLMs ranked by SWE-bench Verified (Bodega One's model catalog, every Ollama-ready tag flagged): bodegaone.ai/local-llms
BYOLLM overview (why local-first matters and the 15+ provider list): bodegaone.ai/byollm
Bodega One Code air-gap mode (pairs naturally with a local Ollama setup): bodegaone.ai/blog/air-gap-mode-9-layers-of-enforcement

Common questions

How do I set up Ollama with Bodega One Code?: Install Ollama, pull a model (start with llama3.2:3b for 8GB RAM or qwen2.5-coder:14b for 16GB+), then open Bodega One Code Settings and select Ollama from the LLM providers list. The endpoint is pre-configured to http://localhost:11434. Pick your model from the dropdown and you're running.
What size Ollama model should I use on my machine?: On 8GB RAM use llama3.2:3b or phi3:mini. On 16GB RAM use llama3.1:8b comfortably or a quantized 13B. On 32GB RAM you can run 13B to 33B models. Apple Silicon M1-M4 runs 5-6x faster than Docker containers because Ollama uses the Neural Engine and unified memory natively.
Why is my Ollama model not responding or terminal just sitting there?: Check your hardware first. If you see the llama runner exited error, you ran out of RAM loading the model. Pull a smaller model or a more aggressively quantized version. Also verify context window: Ollama defaults to 2048 tokens regardless of model support, which can make responses seem stuck. Increase context in your client settings.
Does Ollama work offline and how do I verify it's running?: Yes, Ollama runs completely offline with no API key or account required. All data stays on your machine. After installation and pulling a model, verify it's running by typing curl http://localhost:11434 in your terminal. If it returns Ollama is running, you're connected. Test the model directly with ollama run llama3.2.

Written by the Bodega One team. We build Bodega One Code, the local-first AI IDE, and we write here about local models, AI costs, and what we learn shipping it. More about the team and why we build local-first on the about page.

Stay in the loop

Build-in-public updates, model picks, and Copilot/Cursor news as it breaks.

Ready to own your tools?