CraftRigs
CraftRigs / Glossary / llama.cpp
Software & Tools

llama.cpp

A C++ inference engine that runs quantized GGUF models on CPU, GPU, or both simultaneously.

llama.cpp is an open-source C++ library created by Georgi Gerganov in March 2023 as a way to run Meta's LLaMA model on a MacBook without requiring a GPU. It has since evolved into the foundational inference engine for the local LLM ecosystem, with support for hundreds of model architectures and virtually every consumer hardware platform.

Ollama, LM Studio, Jan, and many other user-friendly tools are built directly on top of llama.cpp under the hood. When you run a model in those tools, llama.cpp is doing the actual computation.

What Makes It Distinctive

CPU+GPU split inference is llama.cpp's key feature for VRAM-constrained hardware. If a model is too large to fit entirely in VRAM, llama.cpp loads as many transformer layers as possible into GPU VRAM and processes the remainder on CPU RAM. You can specify exactly how many layers go to GPU with the --n-gpu-layers flag.

The result is a usable — if slower — setup for models that would otherwise be impossible to run. A 70B Q4_K_M model that requires 40GB might not fit in a 24GB card, but running 60 of 80 layers on GPU and 20 on CPU might yield 8–12 t/s instead of full GPU speed (15–20 t/s), which is still functional.

Hardware Support

llama.cpp supports:

  • NVIDIA GPUs via CUDA
  • AMD GPUs via ROCm/HIP
  • Apple Silicon via Metal (often the best-optimized backend)
  • Intel GPUs via SYCL (experimental)
  • CPU-only on any x86_64 or ARM machine, with AVX2/AVX-512 optimizations

Direct vs Wrapped Usage

You can run llama.cpp directly from the command line, which gives you full control over every parameter. Most users interact with it through Ollama or LM Studio, which handle model management and provide cleaner interfaces while calling llama.cpp internally.

Why It Matters for Local AI

llama.cpp is why running models locally became accessible. Its support for quantized GGUF models, CPU offloading, and every major hardware platform made local LLMs practical for anyone with a capable laptop or desktop. Understanding it helps you troubleshoot issues in Ollama or LM Studio, since most problems trace back to llama.cpp's behavior.

Related guides: CPU+GPU hybrid inference with llama.cpp: run 70B models on 16GB VRAM — how to configure --n-gpu-layers for split inference. Ollama vs LM Studio vs llama.cpp vs vLLM — when to use llama.cpp directly versus a wrapper. AMD vs NVIDIA for local LLMs — how different GPU platforms interact with llama.cpp's backends.