Ollama — Local AI Glossary | CraftRigs

Ollama is an open-source application that wraps llama.cpp into a clean, Docker-like interface for running local language models. It handles model downloading, storage, GPU configuration, and serving — everything that's tedious to manage manually when using llama.cpp directly.

The core experience is a single command: ollama run llama3.2 downloads the model if needed and starts an interactive chat session. For most users, this is the fastest path to running a local LLM.

What Ollama Handles For You

Model library: Ollama maintains a curated library of models at ollama.com/library. ollama pull downloads them to a local cache. Popular options include Llama 3.2, Qwen 2.5, Mistral, Gemma, DeepSeek, and Phi.
GPU detection: Automatically detects and uses your GPU (NVIDIA CUDA, AMD ROCm, or Apple Metal). No manual driver configuration.
Model versioning: You can pull specific quantization variants — ollama pull llama3.2:7b-instruct-q4_K_M for example.
Background service: Ollama runs as a background daemon, keeping a model loaded in VRAM between requests to avoid reload time.

The REST API

Ollama exposes a local REST API at http://localhost:11434. This API is partially OpenAI-compatible — you can point many tools that support OpenAI's API format at Ollama by changing the base URL, with no API key required. This makes it easy to integrate Ollama with tools like Open WebUI, Continue (VS Code extension), or custom scripts.

Limitations

Ollama's model library covers popular open models but not every GGUF file available on Hugging Face. For models not in the library, you can create a custom Modelfile pointing to a local GGUF, or use llama.cpp directly. Fine-tuned models and custom quantizations require this approach.

Platform Support

Ollama runs on macOS (Apple Silicon optimized), Linux, and Windows (with WSL or native support). macOS support is particularly well-optimized, using the Metal backend for Apple Silicon.

Why It Matters for Local AI

Ollama removed the main barrier to entry for local LLMs: the complexity of setup. If you want to try local models without configuration headaches, Ollama is the right starting point. Its OpenAI-compatible API also makes it easy to build local-first applications without changing much code.

Related guides: Ollama vs LM Studio vs llama.cpp vs vLLM — when to use Ollama versus alternatives. How to run LLMs locally: beginner's guide — start-to-finish setup with Ollama. vLLM on a single consumer GPU — when you've outgrown Ollama's single-user queuing and need multi-user serving.