CPU — Local AI Glossary | CraftRigs

The CPU is the general-purpose processor in your rig — the chip that runs your OS, schedules work, and feeds data to the GPU. For local LLM inference it can also run model layers directly, but it's an order of magnitude slower than a GPU at the matrix math that dominates transformer workloads.

CPU vs GPU for Inference

GPUs win at LLM inference because they have thousands of parallel cores and far higher memory bandwidth than any consumer CPU. A modern desktop CPU pulls from system RAM at maybe 80–100 GB/s; a VRAM-resident model on a 3090 sees nearly 1 TB/s. Since decode is bandwidth-bound, that gap shows up directly in tokens per second. Llamafile 0.10.0 explicitly framed CPU-only inference at 2–3 tok/s as "a punishment" worth escaping by restoring CUDA.

When the CPU Gets Pulled In

The CPU gets work when a model is too large for VRAM and layers spill to system memory — known as VRAM offloading. Running a 32B model on a 16GB card with CPU offload typically throttles output to around 3 tok/s, because every token now waits on layers crossing the PCIe bus and executing on the slower processor. The hardware upgrade ladder treats "fits fully in VRAM, no CPU offloading" as the threshold where 27B–34B models become practically usable. Everything below that is a compromise.

Why It Matters for Local AI

Your CPU choice rarely bottlenecks a well-sized local rig — any modern 6-core part is enough to feed a GPU doing all the inference work. The real CPU question is how much you'll lean on it: if your model fits in VRAM, the CPU barely matters; if it doesn't, the CPU and your system RAM bandwidth become the entire performance story. That's why CraftRigs builds size VRAM first and treat CPU specs as secondary — the cheapest way to make a CPU "fast" for LLMs is to stop using it for inference.