System RAM (vs VRAM) — Local AI Glossary | CraftRigs

System RAM (sometimes called main memory or DDR RAM) is the memory your CPU uses — storing the operating system, running applications, browser tabs, and everything else your computer is doing. It's distinct from VRAM, which is dedicated to the GPU.

For local LLM inference on a discrete GPU, system RAM is not a substitute for VRAM. The GPU cannot access system RAM directly during inference at GPU speeds — it can only use what's in VRAM. If a model doesn't fit in VRAM, layers that overflow to system RAM must be read through the CPU and the PCIe bus, which is 10–50x slower than direct VRAM access.

System RAM Specs for LLM Rigs

For GPU-based inference:

16GB system RAM — minimum for running the OS plus inference software
32GB — comfortable, allows other applications alongside inference
64GB+ — useful for heavy offloading or CPU-only inference of large models

System RAM speed matters more for CPU-only inference. DDR5 (5600 MT/s and above) is significantly faster than DDR4 for LLM workloads that stay on CPU.

CPU-Only Inference

Without a GPU, the model runs entirely in system RAM accessed by the CPU. This is viable for smaller models:

7B Q4_K_M on a modern CPU: 5–15 t/s depending on core count, AVX support, and RAM bandwidth
13B Q4_K_M on CPU: 2–8 t/s — slow but functional for batch tasks
70B on CPU: 0.5–2 t/s — mostly impractical for interactive use

CPU inference is practical when you have a powerful multi-core processor (modern Ryzen or Intel Core with AVX-512), fast RAM, and only need occasional or batched inference.

The Apple Silicon Exception

Apple Silicon blurs the line between system RAM and VRAM entirely. Unified memory is accessible by both CPU and GPU at GPU-class speeds. A Mac with 64GB of unified memory can run a 70B model on its GPU without any of the penalties associated with system RAM offloading on a PC.

Why It Matters for Local AI

System RAM is relevant in two scenarios: as overflow storage when VRAM runs out (with a major speed penalty), and as the primary inference medium for CPU-only setups. For GPU-based inference, prioritize VRAM first — system RAM is background infrastructure, not a performance lever.