CraftRigs
CraftRigs / Glossary / Local LLM
Software & Tools

Local LLM

A large language model that runs entirely on your own hardware — no cloud API, no per-token billing, no data leaving the machine.

A local LLM is a large language model whose weights, runtime, and prompt processing all sit on hardware you control — a workstation, gaming PC, Mac, or homelab server. Nothing is sent to OpenAI, Anthropic, or any third-party API. The cost is the hardware and the electricity; the trade-off is that you are responsible for the throughput, the memory budget, and the model file itself.

Why People Run Models Locally

  • Privacy — sensitive prompts (legal, medical, business notes) never leave the device.
  • Cost — no per-token charges; once the GPU is paid for, every additional token is electricity-only.
  • Latency and offline use — works on a plane, in a clinic, behind an air-gap, or anywhere the network is unreliable.
  • Control — pick the exact model, quantization, and runtime; no surprise rate limits or model retirements.

What "Local" Actually Demands

Two specs dominate local-LLM hardware: VRAM (how big a model fits) and memory bandwidth (how fast tokens come out). A 7B model in Q4 quantization needs ~5 GB; a 70B Q4 model needs ~40 GB and almost always means a 24 GB card with CPU offload, dual GPUs, or a high-bandwidth Apple Silicon Mac.

Most people start with Ollama or LM Studio because both wrap llama.cpp into a one-command install. Heavier setups move to vLLM for multi-user serving or ExllamaV2 for raw single-stream speed.

Common Misconceptions

  • "Local means slow." Modern consumer GPUs hit dozens to hundreds of tokens per second on 7B–13B models. The bottleneck is usually not the GPU but the model size you choose.
  • "I need a $4,000 rig." A $400 used RTX 3060 12 GB runs almost every 13B model usefully. The hardware floor is much lower than headline benchmark articles suggest.
  • "Local models are far behind cloud models." For most everyday tasks (drafting, summarizing, code completion on smaller projects, RAG over personal docs), a Q4 70B local model is competitive. The cloud's edge is concentrated at the very largest frontier-tier models.