Unified Memory — Local AI Glossary | CraftRigs

On a traditional PC, the CPU has system RAM and the GPU has its own VRAM — two separate pools that don't overlap. Moving data between them requires copying across a bus (PCIe), which adds latency and limits throughput.

Apple Silicon eliminates this split. The CPU cores, GPU cores, Neural Engine, and memory controller all share a single pool of high-bandwidth memory. There's no copying between CPU and GPU — both access the same data directly.

Why This Changes Local LLM Math

On a discrete GPU setup, "16GB VRAM" means the GPU gets exactly 16GB. If the model exceeds that, it spills to system RAM and performance collapses.

With Apple Silicon, "16GB unified memory" means the entire chip — CPU and GPU both — shares that 16GB. A model that needs 14GB can load into unified memory and the GPU can run inference against it without any data transfer overhead.

This is why an M4 Pro with 24GB unified memory can run models that would choke a discrete GPU card with 16GB VRAM. The effective usable capacity is higher because there's no bus penalty, and the entire memory pool is available to the GPU.

The Bandwidth Trade-Off

Unified memory bandwidth is competitive but not top-of-class:

M4 Max (128GB config): ~546 GB/s
RTX 4090: 1,008 GB/s
RTX 5090: ~1,800 GB/s

For models that fit comfortably in memory, the RTX 4090's bandwidth advantage translates to faster token generation. But the M4 Max can run much larger models without offloading at all, which matters more in practice.

Why It Matters for Local AI

If you're choosing between a Mac with 64GB+ unified memory and a PC GPU with 24GB VRAM, the Mac wins on model size flexibility. You can run 70B quantized models that would be impossible on a single consumer GPU. The trade-off is peak generation speed on smaller models — the RTX 4090 will outrun Apple Silicon on a 7B model, but can't touch it on 70B.

Related guides: AMD Strix Halo mini PC vs Mac Mini M4 for local AI — how AMD's unified memory architecture compares to Apple Silicon. M5 Max 128GB local LLM benchmarks vs expectations — real-world performance of Apple unified memory at scale. AMD vs NVIDIA for local LLMs — how discrete VRAM compares to unified memory pools across the full hardware landscape.