System RAM (vs VRAM)
General-purpose computer memory shared by the CPU and OS — slower than VRAM for GPU inference, but essential for CPU-only setups.
System RAM (sometimes called main memory or DDR RAM) is the memory your CPU uses — storing the operating system, running applications, browser tabs, and everything else your computer is doing. It's distinct from VRAM, which is dedicated to the GPU.
For local LLM inference on a discrete GPU, system RAM is not a substitute for VRAM. The GPU cannot access system RAM directly during inference at GPU speeds — it can only use what's in VRAM. If a model doesn't fit in VRAM, layers that overflow to system RAM must be read through the CPU and the PCIe bus, which is 10–50x slower than direct VRAM access.
System RAM Specs for LLM Rigs
For GPU-based inference:
- 16GB system RAM — minimum for running the OS plus inference software
- 32GB — comfortable, allows other applications alongside inference
- 64GB+ — useful for heavy offloading or CPU-only inference of large models
System RAM speed matters more for CPU-only inference. DDR5 (5600 MT/s and above) is significantly faster than DDR4 for LLM workloads that stay on CPU.
CPU-Only Inference
Without a GPU, the model runs entirely in system RAM accessed by the CPU. This is viable for smaller models:
- 7B Q4_K_M on a modern CPU: 5–15 t/s depending on core count, AVX support, and RAM bandwidth
- 13B Q4_K_M on CPU: 2–8 t/s — slow but functional for batch tasks
- 70B on CPU: 0.5–2 t/s — mostly impractical for interactive use
CPU inference is practical when you have a powerful multi-core processor (modern Ryzen or Intel Core with AVX-512), fast RAM, and only need occasional or batched inference.
The Apple Silicon Exception
Apple Silicon blurs the line between system RAM and VRAM entirely. Unified memory is accessible by both CPU and GPU at GPU-class speeds. A Mac with 64GB of unified memory can run a 70B model on its GPU without any of the penalties associated with system RAM offloading on a PC.
Why It Matters for Local AI
System RAM is relevant in two scenarios: as overflow storage when VRAM runs out (with a major speed penalty), and as the primary inference medium for CPU-only setups. For GPU-based inference, prioritize VRAM first — system RAM is background infrastructure, not a performance lever.