Decode Speed — Local AI Glossary | CraftRigs

Decode is the second phase of LLM inference — after the model has processed your input (prefill), it begins generating output tokens one at a time. This sequential generation process is what you observe as text streaming onto the screen. Decode speed is measured in tokens per second (t/s) and is the number most people mean when they say "how fast is this model."

Why Decode is Memory-Bandwidth Bound

During decode, the GPU generates one token per step. Each step requires reading the full set of model weights from memory — all of them, every time. For a 7B Q4_K_M model, that's ~4.5GB of data read per token. For a 70B Q4_K_M model, it's ~40GB per token.

The GPU's compute cores are mostly idle during this process — they're waiting for data to arrive from memory. This makes decode fundamentally limited by memory bandwidth (GB/s), not compute (TFLOPS).

This is a critical distinction: a GPU with double the CUDA cores but the same memory bandwidth will not meaningfully improve decode speed. A GPU with double the bandwidth will roughly double it.

Decode Speed at Different Hardware Tiers

Running Llama 3.1 8B Q4_K_M (a common benchmark model):

RTX 4090 (1,008 GB/s): ~127 t/s
RTX 4080 Super (736 GB/s): ~90 t/s
RTX 4070 (504 GB/s): ~55 t/s
M4 Max 64GB (546 GB/s): ~65–75 t/s

The relationship between bandwidth and decode speed is close to linear for single-user inference on a model that fits fully in VRAM.

Batch Size Effects

Decode speed changes significantly when serving multiple simultaneous requests. Under batched inference (e.g., running a local API server for multiple users), GPU compute starts to matter more. Data center deployments optimize heavily for batched throughput. For single-user local setups, batch size is usually 1 and bandwidth dominates.

VRAM Offloading Penalty

When a model doesn't fully fit in VRAM and layers offload to system RAM, decode speed collapses. Accessing system RAM through the CPU for each token is 10–50x slower than reading from VRAM. A setup that runs at 80 t/s with full VRAM can drop to 3–8 t/s when even a fraction of layers offload to RAM.

Decode speed is the number that determines how your local AI setup feels to use day-to-day.