RAM (System RAM)
General-purpose computer memory used by the CPU and OS — distinct from VRAM, but relevant for LLM offloading and CPU-only inference.
RAM (Random Access Memory) is your computer's main working memory — the fast temporary storage that the CPU uses for all active processes. In local LLM contexts, RAM is distinct from VRAM (the dedicated memory on your GPU), and the two serve different roles.
RAM vs. VRAM for Local LLMs
For GPU-accelerated inference, VRAM is the critical resource. RAM becomes relevant in two specific scenarios:
1. VRAM offloading — When a model is too large to fit entirely in VRAM, some model layers are offloaded to system RAM. The model still runs, but any layer that isn't in VRAM requires a PCIe bus transfer before processing — dramatically slower than pure VRAM inference. A model generating 80 tokens/second fully in VRAM might drop to 8-15 tokens/second when partially offloaded.
2. CPU-only inference — If you have no GPU, or are using a model on an Apple Silicon chip (which uses unified memory), system RAM holds the model weights directly. llama.cpp and Ollama both support CPU-only inference. Speed depends heavily on CPU memory bandwidth — typically 20-60 GB/s for DDR5, versus 600-1,000+ GB/s for GPU VRAM.
How Much RAM You Need
| Use case | Minimum | Recommended |
|---|---|---|
| GPU inference (model in VRAM) | 16GB | 32GB |
| GPU inference with partial offloading | 32GB | 64GB |
| CPU-only inference (7B model) | 16GB | 32GB |
| CPU-only inference (70B model) | 96GB | 128GB |
For most GPU-based setups where the model fits in VRAM, 32GB of system RAM is the practical target — enough for the OS, running applications, and comfortable headroom.
RAM Speed and Local AI
DDR5 is faster than DDR4, which matters more for CPU inference than GPU. For pure GPU inference where data lives in VRAM, RAM speed has minimal impact. For offloaded or CPU-only workloads, faster RAM with higher bandwidth directly improves token generation speed.
On Apple Silicon, unified memory is shared between CPU and GPU at LPDDR5X speeds (around 400 GB/s on M4 Max). This is why Apple chips punch above their weight for local AI — the memory bandwidth available to the "GPU" side is far higher than a discrete GPU using system RAM over PCIe.