GPU — Local AI Glossary | CraftRigs

The GPU is the single component that decides what models you can run locally and how fast they answer. Everything else in a local AI rig — CPU, system RAM, storage — is supporting cast around the GPU and its onboard memory.

What the GPU Actually Does for LLMs

For inference, the GPU loads model weights into VRAM and streams them through tensor cores to compute each token. Throughput is gated by two things: how much VRAM you have (which sets the model size you can fit) and memory bandwidth (which sets your tokens per second). Compute matters too, but for most local LLM workloads bandwidth is the bottleneck — which is why GDDR7 cards punch above their FLOPS rating.

Tiers, Tradeoffs, and Stacking

A first GPU for local LLMs typically lands in the $400–600 range, and the practical ceiling for single-card builds is the 24GB tier. Past that, builders either jump to workstation hardware — like the Lenovo ThinkStation P5 Gen 2's dual NVIDIA RTX Pro 6000 Blackwell Max-Q with 96GB ECC GDDR7 per card and 192GB combined — or stack two consumer cards into a dual-GPU rig to pool VRAM. Apple's Mac Mini and Mac Studio are the alternative path: weaker raw compute, but unified memory lets a single chip address far more model than a same-priced discrete GPU. Software ecosystem matters here — CUDA is still the path of least resistance, with ROCm and MLX as the AMD and Apple alternatives.

Why It Matters for Local AI

Pick the wrong GPU and you're either stuck running tiny models or paying for compute you can't feed. VRAM caps which models load at all, bandwidth caps how fast they generate, and the software stack caps which runtimes (Ollama, llama.cpp, ExLlamaV2) actually work without friction. Know the ceiling of the card you're buying — it's the ceiling of every model you'll run on it.