Ollama Setup Guide: Run Any LLM Locally in 5 Minutes

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

TL;DR: Ollama is the fastest way to get a local LLM running on your machine. One install, one command, and you're chatting with Llama 3, Mistral, or Gemma locally. If you have a GPU with at least 8 GB of VRAM, you'll get fast responses. This guide walks you through everything from install to a full browser-based chat interface.

Why Ollama?

There are several ways to run LLMs locally -- we compare the big three here -- but Ollama wins on simplicity. It wraps llama.cpp (the open-source inference engine that actually runs models) in a clean command-line interface with automatic model management, GPU detection, and a local API server. You don't need to understand quantization formats or compile anything.

Think of it as Docker for LLMs: pull a model, run it, done.

What You Need Before Starting

Minimum hardware:

8 GB RAM (16 GB recommended)
Any modern CPU (2018 or newer)
10 GB free disk space for the app plus your first model

For GPU acceleration (strongly recommended):

NVIDIA GPU with 6+ GB VRAM and CUDA support (GTX 1060 or newer)
AMD GPU with ROCm support (RX 6000 series or newer, Linux only as of March 2026)
Apple Silicon Mac (M1 or newer -- these are excellent for local LLMs)

If you're unsure whether your GPU has enough VRAM, check our guide on how much VRAM you actually need.

Step 1: Install Ollama

Windows

Go to ollama.com/download
Download the Windows installer
Run the .exe -- it installs in about 30 seconds
Ollama runs as a background service automatically

That's it. No PATH configuration, no environment variables. It just works.

One thing to know: On Windows, Ollama runs as a system service. It starts automatically when your PC boots. You'll see a small llama icon in your system tray.

macOS

Go to ollama.com/download
Download the .dmg for macOS
Drag Ollama to your Applications folder
Launch it once -- it'll ask for permission to install the command-line tool

On Apple Silicon Macs (M1, M2, M3, M4), Ollama automatically uses the GPU cores through Metal. No drivers to install, no configuration. This is one reason Apple Silicon is so good for local LLMs.

Linux

The one-liner:

curl -fsSL https://ollama.com/install.sh | sh

This installs Ollama and sets it up as a systemd service. It auto-detects NVIDIA GPUs if you have CUDA drivers installed, and AMD GPUs if you have ROCm.

For NVIDIA users: Make sure you have the NVIDIA driver installed first (version 525+ recommended). You don't need the full CUDA toolkit -- just the driver. Check with:

nvidia-smi

If that command shows your GPU, you're good. If not, install drivers first:

# Ubuntu/Debian
sudo apt install nvidia-driver-550
sudo reboot

For AMD GPU users on Linux: You need ROCm 6.0+. This is a bit more involved:

# Check AMD's official ROCm install docs for your distro
# After ROCm is installed, Ollama detects it automatically

Step 2: Pull and Run Your First Model

Open a terminal (Command Prompt or PowerShell on Windows, Terminal on Mac/Linux) and run:

ollama run llama3.2

That's the whole command. Ollama will:

Download the Llama 3.2 3B model (about 2 GB)
Load it into your GPU (or CPU if no GPU is detected)
Drop you into a chat interface

Type a message and hit Enter. You're now running an LLM locally. No API key, no cloud, no subscription.

Picking Your First Model

Llama 3.2 3B is the default for a reason -- it's small, fast, and surprisingly capable. But here's a quick decision guide:

If you have 8 GB VRAM: Start with ollama run llama3.1:8b. The 8-billion parameter version of Llama 3.1 is the sweet spot of quality vs. speed for most GPUs. Uses about 5-6 GB VRAM in the default quantization (the process of compressing a model to use less memory while keeping most of its quality).

If you have 16+ GB VRAM: Try ollama run llama3.1:8b first, then step up to ollama run qwen2.5:14b for better reasoning. With 16 GB you can also try ollama run deepseek-r1:14b for strong coding and math.

If you have 24 GB VRAM (RTX 3090/4090): You can run ollama run llama3.1:70b in a q4 quantization. This is genuinely close to GPT-4 quality for many tasks. See our best GPUs for local LLMs guide for more on what each GPU tier can handle.

If you're on Apple Silicon: An M2 Pro with 16 GB unified memory (memory shared between CPU and GPU -- Apple Silicon's secret weapon for LLMs) handles 8B models beautifully. M4 Max with 64 GB+ can run 70B models. Check our M4 Max vs RTX 4090 comparison for specifics.

Useful Model Commands

# List available models to download
ollama list

# Pull a model without running it
ollama pull mistral

# See what models you have downloaded
ollama list

# Remove a model to free disk space
ollama rm llama3.2

# Run a specific quantization
ollama run llama3.1:8b-q8_0

# Run with a specific context window
ollama run llama3.1:8b --context 8192

Step 3: Verify GPU Acceleration Is Working

This is where most people get tripped up. You pulled a model, it's running, but is it actually using your GPU? Or is it limping along on CPU?

How to Check

While a model is running, open another terminal and run:

ollama ps

This shows loaded models and whether they're using GPU or CPU. You'll see something like:

NAME              ID          SIZE   PROCESSOR    UNTIL
llama3.1:8b      a23456...   5.5 GB 100% GPU     4 minutes from now

"100% GPU" means you're good. Every layer of the model is on your GPU.

"100% CPU" means GPU acceleration isn't working. See troubleshooting below.

"48% GPU / 52% CPU" means your model is too big for your VRAM and some layers are offloaded to CPU. This still works but you'll see slower tokens per second (the standard measure of LLM speed -- how many words per second the model generates).

NVIDIA Troubleshooting

If ollama ps shows CPU when you expected GPU:

Check that nvidia-smi works. If it doesn't, your drivers aren't installed properly.
Check VRAM availability. Run nvidia-smi and look at "Memory-Usage." If your VRAM is already full (gaming, other apps), Ollama falls back to CPU.
Restart Ollama. Sometimes the GPU detection happens at startup:

# Linux
sudo systemctl restart ollama

# Windows - right-click the tray icon and quit, then reopen

# Mac - quit from the menu bar and reopen

Check the logs. On Linux: journalctl -u ollama -f. Look for lines mentioning CUDA or GPU detection.

Apple Silicon Troubleshooting

Apple Silicon should just work. If it's slow:

Make sure you're not running the x86 version through Rosetta. Check in Activity Monitor -- the "Kind" column should say "Apple" not "Intel."
Close memory-hungry apps. Apple Silicon shares memory between CPU and GPU, so a browser with 50 tabs eats into your LLM's available memory.

Step 4: Set Up Open WebUI for a Browser-Based Chat Interface

The terminal chat is fine for quick tests, but for real use you want Open WebUI -- a self-hosted chat interface that looks like ChatGPT but runs entirely on your machine.

Install with Docker (Recommended)

Make sure Docker is installed, then:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in your browser. Create an account (it's local-only, just for the UI). Select a model from the dropdown. Start chatting.

What you get with Open WebUI:

ChatGPT-style interface with conversation history
Ability to switch between all your downloaded Ollama models
System prompt customization per conversation
Document upload for RAG (retrieval-augmented generation)
Multi-user support if others in your household want to use it
Mobile-friendly -- access from your phone on the same network

Install Without Docker

If you don't want Docker:

pip install open-webui
open-webui serve

This requires Python 3.11+. The Docker method is cleaner, but this works.

Connect Open WebUI to Ollama

By default, Open WebUI connects to Ollama at http://localhost:11434. If Ollama is running on the same machine (which it should be for this setup), it auto-connects.

If you see "No models found" in Open WebUI:

Make sure Ollama is running (ollama list should work in terminal)
Check that port 11434 is accessible
On Linux with Docker, the --add-host flag in the command above handles this

Step 5: The Ollama API -- For Developers and Power Users

Ollama runs a local API server on port 11434. This is what makes it powerful beyond just chatting -- you can plug it into any app that supports the OpenAI API format.

Quick API Test

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "What is the capital of France?",
  "stream": false
}'

OpenAI-Compatible Endpoint

Ollama supports the OpenAI API format, which means tools built for ChatGPT can talk to your local model:

curl http://localhost:11434/v1/chat/completions -d '{
  "model": "llama3.1:8b",
  "messages": [{"role": "user", "content": "Hello!"}]
}'

This compatibility is huge. It means you can use Ollama with:

Continue.dev (VS Code AI assistant)
Aider (AI pair programming)
LangChain and LlamaIndex for building apps
Any tool that lets you set a custom OpenAI base URL

Just point the tool to http://localhost:11434/v1 as the API base URL.

Model Management Tips

Where Models Are Stored

macOS: ~/.ollama/models
Linux: /usr/share/ollama/.ollama/models (or ~/.ollama/models if running as your user)
Windows: C:\Users\<username>\.ollama\models

Models are big. A 7B model is 4-5 GB, a 70B model is 40+ GB. Plan your disk space accordingly.

Moving the Model Directory

If your boot drive is small, move models to another drive:

# Linux/Mac - set environment variable
export OLLAMA_MODELS=/path/to/external/drive/ollama/models

# Windows - set system environment variable
# OLLAMA_MODELS = D:\ollama\models

Restart Ollama after changing this.

Custom Modelfiles

You can create custom model configurations with a Modelfile:

FROM llama3.1:8b
SYSTEM "You are a helpful coding assistant. Be concise."
PARAMETER temperature 0.7
PARAMETER num_ctx 8192

Then build it:

ollama create my-coding-assistant -f Modelfile
ollama run my-coding-assistant

This is great for setting up specialized assistants -- a coding helper, a writing assistant, a research tool -- each with their own system prompts and parameters.

Performance Expectations

Here's what you should roughly expect with the default Llama 3.1 8B model (q4_K_M quantization):

NVIDIA GPUs:

RTX 3060 12GB: ~35-45 tokens/sec
RTX 3080 10GB: ~50-60 tokens/sec
RTX 4070 Ti 12GB: ~65-75 tokens/sec
RTX 4090 24GB: ~90-110 tokens/sec

Apple Silicon:

M1 Pro 16GB: ~20-25 tokens/sec
M2 Max 32GB: ~35-40 tokens/sec
M4 Max 64GB: ~50-60 tokens/sec

CPU only (no GPU):

Modern i7/Ryzen 7: ~5-10 tokens/sec (usable but slow)

For a deep dive into performance across configurations, check our local LLM speed test.

These numbers are for prompt evaluation speed. If tokens feel slow, make sure GPU acceleration is working (Step 3).

Common Issues and Fixes

"Error: model not found" You typed the model name wrong. Run ollama list to see exact model names you have. Model names are case-sensitive.

Ollama is slow to start First run after boot can take 10-15 seconds as it initializes. Subsequent model loads are faster. If it's always slow, check that your antivirus isn't scanning the model files.

Out of memory errors The model is too big for your VRAM. Either use a smaller model or a more aggressive quantization. Drop from q8_0 to q4_K_M, or from a 13B model to an 8B model. See how much VRAM you need for sizing guidance.

Model outputs garbage This usually means a corrupt download. Remove and re-pull:

ollama rm llama3.1:8b
ollama pull llama3.1:8b

Port 11434 already in use Another instance of Ollama is running, or something else is on that port. Kill existing processes:

# Linux/Mac
sudo lsof -i :11434
kill <PID>

# Windows
netstat -ano | findstr 11434
taskkill /PID <PID> /F

What's Next

Once Ollama is running, you've got the foundation for everything else in local AI:

Compare your options: See how Ollama stacks up against the alternatives in our llama.cpp vs Ollama vs LM Studio comparison
Optimize your hardware: Check the best GPUs for local LLMs in 2026 if you're thinking about an upgrade
Build a dedicated rig: Our $3,000 dual-GPU LLM build guide shows you how to build a machine purpose-built for local AI
Go deeper: The llama.cpp advanced guide covers the flags and tuning that Ollama abstracts away

Benchmarks in this article are from March 2026 testing. Prices and performance may shift with new releases.

Ollama Setup in 5 Minutes

graph TD
    A["Install Ollama"] --> B["ollama serve"]
    B --> C["ollama pull model"]
    C --> D{"Model Size?"}
    D -->|"3B-7B"| E["Works on 6GB+ VRAM"]
    D -->|"13B"| F["Needs 10GB+ VRAM"]
    D -->|"30B+"| G["Needs 20GB+ VRAM"]
    E --> H["ollama run llama3.2"]
    F --> H
    G --> H
    H --> I["localhost:11434/api"]
    style A fill:#1A1A2E,color:#fff
    style H fill:#F5A623,color:#000
    style I fill:#22C55E,color:#000