Memory Bandwidth — Local AI Glossary | CraftRigs

Memory bandwidth is the rate at which data can be read from or written to memory — measured in gigabytes per second (GB/s). For local LLM inference, it's the single biggest predictor of how fast a model generates tokens, once the model fits in VRAM.

Why Bandwidth Dominates Inference Speed

LLM inference is memory-bandwidth bound, not compute bound. During the decode phase (when tokens stream out), the GPU reads through model weights repeatedly — once per token generated. A 7B model at Q4_K_M is about 4.5GB of weights. Generating each token requires reading all those weights off the memory chips. The faster that transfer happens, the faster tokens appear.

This is why two GPUs with similar compute specs but different memory configurations produce very different inference speeds.

Real-World Bandwidth Numbers

Hardware	Memory Bandwidth
RTX 4090	1,008 GB/s
RTX 4080 Super	736 GB/s
RTX 4070 Ti Super	672 GB/s
RTX 4070	504 GB/s
RTX 5090	~1,800 GB/s
M4 Max (64GB)	~546 GB/s
M4 Pro (48GB)	~273 GB/s

The RTX 4090's 1,008 GB/s is roughly 2x the RTX 4070's 504 GB/s — and that translates almost directly to 2x the tokens per second on the same model.

Bandwidth vs VRAM Capacity

These are different constraints. VRAM capacity determines what models you can load. Memory bandwidth determines how fast you can run them. An RTX 3090 has 24GB of VRAM (same as the 4090) but only 936 GB/s of bandwidth — slightly slower. The RTX 4090 is faster primarily because of its bandwidth, not just its capacity.

Why It Matters for Local AI

When comparing GPUs for local LLM work, look at bandwidth alongside VRAM. A card with more VRAM but lower bandwidth (like the RTX 3090 vs 4090) will be slower to generate tokens even on the same model. Bandwidth is the throughput ceiling for every model you run.

Related guides: Best GPUs for local LLMs in 2026 — full GPU comparison ranked by bandwidth and tokens-per-second performance. Best 16GB GPU for local LLMs — how bandwidth differences play out across RTX 4060 Ti, RTX 5060 Ti, and Arc B580 at the same VRAM tier.