CraftRigs
CraftRigs / Glossary / Memory Bandwidth
Memory & Storage

Memory Bandwidth

How fast data moves between memory and the processor, measured in GB/s.

Memory bandwidth is the rate at which data can be read from or written to memory — measured in gigabytes per second (GB/s). For local LLM inference, it's the single biggest predictor of how fast a model generates tokens, once the model fits in VRAM.

Why Bandwidth Dominates Inference Speed

LLM inference is memory-bandwidth bound, not compute bound. During the decode phase (when tokens stream out), the GPU reads through model weights repeatedly — once per token generated. A 7B model at Q4_K_M is about 4.5GB of weights. Generating each token requires reading all those weights off the memory chips. The faster that transfer happens, the faster tokens appear.

This is why two GPUs with similar compute specs but different memory configurations produce very different inference speeds.

Real-World Bandwidth Numbers

HardwareMemory Bandwidth
RTX 40901,008 GB/s
RTX 4080 Super736 GB/s
RTX 4070 Ti Super672 GB/s
RTX 4070504 GB/s
RTX 5090~1,800 GB/s
M4 Max (64GB)~546 GB/s
M4 Pro (48GB)~273 GB/s

The RTX 4090's 1,008 GB/s is roughly 2x the RTX 4070's 504 GB/s — and that translates almost directly to 2x the tokens per second on the same model.

Bandwidth vs VRAM Capacity

These are different constraints. VRAM capacity determines what models you can load. Memory bandwidth determines how fast you can run them. An RTX 3090 has 24GB of VRAM (same as the 4090) but only 936 GB/s of bandwidth — slightly slower. The RTX 4090 is faster primarily because of its bandwidth, not just its capacity.

Why It Matters for Local AI

When comparing GPUs for local LLM work, look at bandwidth alongside VRAM. A card with more VRAM but lower bandwidth (like the RTX 3090 vs 4090) will be slower to generate tokens even on the same model. Bandwidth is the throughput ceiling for every model you run.

Related guides: Best GPUs for local LLMs in 2026 — full GPU comparison ranked by bandwidth and tokens-per-second performance. Best 16GB GPU for local LLMs — how bandwidth differences play out across RTX 4060 Ti, RTX 5060 Ti, and Arc B580 at the same VRAM tier.