Local LLM — Local AI Glossary | CraftRigs

A local LLM is a large language model whose weights, runtime, and prompt processing all sit on hardware you control — a workstation, gaming PC, Mac, or homelab server. Nothing is sent to OpenAI, Anthropic, or any third-party API. The cost is the hardware and the electricity; the trade-off is that you are responsible for the throughput, the memory budget, and the model file itself.

Why People Run Models Locally

Privacy — sensitive prompts (legal, medical, business notes) never leave the device.
Cost — no per-token charges; once the GPU is paid for, every additional token is electricity-only.
Latency and offline use — works on a plane, in a clinic, behind an air-gap, or anywhere the network is unreliable.
Control — pick the exact model, quantization, and runtime; no surprise rate limits or model retirements.

What "Local" Actually Demands

Two specs dominate local-LLM hardware: VRAM (how big a model fits) and memory bandwidth (how fast tokens come out). A 7B model in Q4 quantization needs ~5 GB; a 70B Q4 model needs ~40 GB and almost always means a 24 GB card with CPU offload, dual GPUs, or a high-bandwidth Apple Silicon Mac.

Most people start with Ollama or LM Studio because both wrap llama.cpp into a one-command install. Heavier setups move to vLLM for multi-user serving or ExllamaV2 for raw single-stream speed.

Common Misconceptions

"Local means slow." Modern consumer GPUs hit dozens to hundreds of tokens per second on 7B–13B models. The bottleneck is usually not the GPU but the model size you choose.
"I need a $4,000 rig." A $400 used RTX 3060 12 GB runs almost every 13B model usefully. The hardware floor is much lower than headline benchmark articles suggest.
"Local models are far behind cloud models." For most everyday tasks (drafting, summarizing, code completion on smaller projects, RAG over personal docs), a Q4 70B local model is competitive. The cloud's edge is concentrated at the very largest frontier-tier models.