Skip to main content
ollamalocal-firstBYOLLMsetup

Setting up Ollama with Bodega One (and which models to actually run)

Bodega OneUpdated 8 min read

The first time I saw Ollama mentioned, the post was three lines. Install it. Pull llama3.2. Done.

That's not wrong. But if you've hit a wall (model not responding, terminal just sitting there, responses making no sense), there's usually a gap between the three-line version and actually knowing what's going on. This is the fuller version.

What Ollama is, plain

Ollama runs large language models on your own hardware. Install it, pull a model, and you have a working API endpoint at localhost:11434. No account, no API key, no data leaving your machine.

It handles the parts that used to be painful: model quantization (compressing models so they actually fit in your RAM), GPU offloading if you have one, and a REST API that other tools can talk to. The reason r/LocalLLaMA recommends it by default is that it just works for most setups. You're not configuring CUDA drivers or wrestling with Python version conflicts. You install it, pull a model, and it runs.

What hardware you actually need

Be honest with yourself here. People skip this and spend two hours wondering why their machine is crawling.

The rough breakdown:

  • 8GB RAM: Runs 3B models fine, 7B models slowly. Stick to llama3.2:3b or phi3:mini.
  • 16GB RAM: 7B runs comfortably. You can push a quantized 13B model. Most developers land here and it's enough.
  • 32GB RAM: 13B to 33B range depending on the model and quantization level.

A GPU helps but is not required. Ollama runs on CPU. It's just slower.

If you're on Apple Silicon (M1 through M4), the picture is better than the RAM numbers suggest. Ollama uses Apple's Neural Engine and unified memory natively. Benchmarks on M3 Pro show 28-35 tokens per second on Llama 3.1 8B. Fast enough for real work. On M1 base with 16GB you're looking at 12-15 tokens per second. Slower, but usable.

If you're on a Mac: run Ollama natively, not in Docker. Native performance runs 5-6x faster because Docker containers don't get access to Apple Silicon's GPU. That's not a small difference.

The things nobody mentions

Before you pull anything, here are the mistakes that cost people hours:

The naming typo. Ollama uses llama3.2 (no hyphen). Type llama-3.2 and it either fails silently or does the wrong thing. The naming conventions across the library are not consistent, and Ollama's error messages don't always tell you what went wrong.

The context window default. Ollama defaults to a 2048-token context window regardless of what the model actually supports. Most modern models support 8k, 16k, or 128k. If your model seems to forget earlier parts of a conversation, this is why. You can set a higher context in your client, but the default will catch you off guard.

"llama runner exited." This error means you ran out of RAM or VRAM loading the model. It's not a bug. It's your hardware limit. The fix is to pull a smaller model or find a more aggressively quantized version of the same one.

Getting it running

Step 1: Install. Download from ollama.com for your platform. On Linux:

curl -fsSL https://ollama.com/install.sh | sh

On Mac and Windows it installs as a background service that starts automatically.

Step 2: Pull a model. Start conservative.

ollama pull llama3.2

for 8GB RAM, or

ollama pull llama3.1:8b

for 16GB RAM and up. First pull takes a few minutes. After that, models are cached locally.

Step 3: Verify.

curl http://localhost:11434

"Ollama is running" means you're good. Test the model directly with:

ollama run llama3.2

That opens an interactive chat in your terminal.

Which models are worth pulling

The library has over 100 models. Most people need three at most. Here's what the community actually uses:

Start here:

llama3.1:8b: 108 million downloads on Ollama. Meta's flagship general-purpose model. Fast, handles most tasks well, good at following instructions. This is the default recommendation because it's reliable and well-tested, not because it's the most hyped.

For coding specifically:

deepseek-coder:6.7b: Trained on code. More accurate at technical output than a general model of the same size. If you're using a local model for coding tasks, this one earns its place.

codellama:13b: Stronger at code completion and explanation. Needs 16GB RAM minimum.

For reasoning tasks (if you have the hardware):

deepseek-r1: 78.8 million pulls on Ollama. An open-source reasoning model that competes with frontier model quality on math and logic. The community reaction was genuine surprise. A locally-running model that gets close to what you'd expect from a hosted API on reasoning tasks.

For limited hardware:

phi3:mini: Microsoft's 3.8B model. Surprisingly capable for its size, battery-efficient on laptops.

llama3.2:3b: Good enough for basic coding assistance on 8GB RAM.

One thing to keep in mind: model quality varies a lot by task type. A model that struggles with complex reasoning can still be fast and accurate at reading code and writing straightforward functions. If something isn't working, try a different model before blaming your hardware.

Ollama vs LM Studio

If you've seen LM Studio recommended alongside Ollama and aren't sure which to use, here's the actual difference:

Ollama is terminal-first. It runs as an API server that other tools connect to. No built-in GUI. If you want to script it, automate it, or connect it to a coding environment, Ollama is the right choice.

LM Studio is GUI-first. Download it, open it, start chatting. Lower friction if you just want to talk to a model without touching a terminal.

You can use both. They're not competing for the same job.

Connecting Ollama to Bodega One

Once Ollama is running locally, wiring it to a real coding environment takes about two minutes.

Bodega One ships with Ollama as a first-class provider preset. Open Settings, select Ollama from the LLM providers list, and the endpoint is already set to http://localhost:11434. Pick the model you pulled from the dropdown.

Your code, prompts, and responses stay on your machine. Bodega One talks directly to your Ollama server with no cloud routing.

The practical difference between running Ollama in a terminal and running it through a coding environment is significant. Instead of pasting code snippets into a chat window, the AI has full access to your file tree, can run shell commands, and writes actual diffs to your codebase. 23 built-in tools. An autonomous coding agent with a Quality Enforcement Layer that validates changes before they land in your project.

That's a different category of tool than a local chat window.

What to expect

Local models are slower than hosted APIs. A 7B model on CPU responds in a few seconds, not milliseconds. With a GPU or Apple Silicon the gap closes significantly. For most coding tasks (reading files, writing functions, explaining code, refactoring) the speed is workable.

What you get in return: no rate limits, no upstream outages, no model behavior changes that ship silently, no token bill at the end of the month.

If a specific task needs a stronger model (complex architectural decisions, unusual reasoning chains), Bodega One lets you switch providers in seconds. Keep Ollama for everyday work, switch to Claude or OpenAI when you need the heavy lifting. That's the BYOLLM approach: you pick the right tool for the job rather than committing to one model for everything.

Try it yourself

The full setup (Ollama running, a model pulled, connected to a real coding environment) takes about 20 minutes the first time. After that it just runs in the background.

Beta opens May 1, 2026. Join the waitlist and you'll get access in the first batch. Complete the 14-day beta and we'll send you a $30 promo code before launch.

Not sure which Ollama model to actually pull? The Local LLM Guide ranks 30+ models by SWE-bench Verified score with per-VRAM-tier picks, including every model available on Ollama.

Ready to own your tools?

Beta opens May 2026. Complete 14 days and earn a $30 promo code.