Decode speed is the number of tokens your GPU generates per second — 50+ feels instant, 20-30 is acceptable, below 10 is laggy. It's almost entirely determined by VRAM memory bandwidth, not raw compute power. A GPU with high bandwidth beats one with high TFLOPS for local LLM use, every time.
If you've ever wondered why your setup feels slow even though your GPU has "good specs," this is the answer.
What Is Decode Speed and Why Does It Feel Slow?
Decode speed is the rate at which your GPU produces output tokens. Each token is roughly 0.75 words, so 50 tok/s translates to about 37 words per second.
Here's the technical reality: during text generation, the model produces one token per forward pass. Each pass loads the relevant weight matrices from VRAM into the compute cores. The bottleneck isn't math — it's how fast VRAM can stream those weights. That's decode speed.
Think of it like a printer's page-per-minute rating. You care about the output rate you actually see, not the theoretical maximum processing speed of the print head. TFLOPS is the print head spec. Bandwidth is the page-per-minute.
Why Decode Speed Defines the Feel of Local AI
This is the metric that determines whether your local LLM feels like ChatGPT or like a typewriter.
At 50+ tok/s, responses stream faster than you can comfortably read — it's indistinguishable from a cloud API in feel. At 8-12 tok/s, you're watching words appear one at a time, and the rhythm breaks every conversation.
Human reading speed is roughly 200-250 words per minute, which works out to 3-4 words per second — about 4-5 tok/s. Technically, you only need 10 tok/s to keep up with your reading. But the visual rhythm below 20 tok/s still feels slow, even if you're not getting outpaced. There's a psychological component: the moment you're waiting on the model, it breaks the flow.
Practical Tok/s Thresholds and User Experience
| Tok/s | Experience |
|---|---|
| 80+ | Responses appear instantly — indistinguishable from ChatGPT in feel |
| 50-80 | Fast streaming, comfortable for daily driver use |
| 20-50 | Acceptable; slight lag but not disruptive |
| 10-20 | Mildly frustrating for long responses; fine for quick Q&A |
| <10 | Noticeably slow; model may be partially CPU-offloaded |
Decode speed is the first thing to benchmark after getting a model running. It tells you immediately whether your hardware matches your model choice.
How GPU Memory Bandwidth Determines Decode Speed
Each decode step loads all relevant model weights from VRAM into the compute cores. A 7B model at Q4 quantization requires reading roughly 4.5 GB of data per token generated.
The math is straightforward:
- RTX 4090 (1,008 GB/s bandwidth): can read 4.5 GB weights ~224 times per second → theoretical ceiling ~224 tok/s for a 7B Q4 model
- RTX 4070 (504 GB/s bandwidth): ~112 tok/s theoretical ceiling (actual: 55-65 tok/s with overhead)
- RTX 3060 (360 GB/s bandwidth): ~80 tok/s theoretical ceiling (actual: 40-50 tok/s)
The gap between theoretical and actual comes from overhead: KV cache reads, attention operations, and framework-level costs. But the relationship is linear — double the bandwidth, double the decode speed (roughly).
How Decode Speed Works — Prefill vs Decode Explained
Inference has two distinct phases with completely different performance profiles. Most people don't know this, and it causes a lot of confusion.
Prefill is compute-bound. The GPU processes all your input tokens in parallel — this is where TFLOPS actually matters.
Decode is memory-bandwidth-bound. The GPU generates one token at a time, loading weights each step — VRAM bandwidth determines everything here. TFLOPS are nearly irrelevant.
Most conversations are more than 90% decode time. This is why memory bandwidth is the spec that matters for real-world feel.
Prefill Speed (Prompt Processing)
Prefill processes all your input tokens in parallel, so a 1,000-token prompt doesn't take 1,000 times longer than a 1-token prompt. On an RTX 4070, a 1K-token prompt takes about 0.3 seconds to process; a 4K-token prompt takes about 0.8 seconds.
This is what "time to first token" (TTFT) measures — the delay before the model starts responding. Prefill is compute-bound, so higher TFLOPS helps here. But prefill is a small fraction of total response time for most use cases.
Decode Speed (Token Generation)
Decode is serial and autoregressive — each token depends on the previous one. It cannot be parallelized. Loading the weights is the bottleneck, not the math.
This is why an RTX 4090 decodes roughly 2× faster than an RTX 4070 despite having ~3× the compute. The bandwidth ratio between them is about 2×, not 3×. The compute advantage is nearly invisible.
Quantization helps decode speed directly: a Q4 model loads 4× less data per step than an FP16 model, which translates to roughly 4× faster decode on the same GPU. This is why running Q4_K_M instead of FP16 is almost always the right call on consumer hardware.
CPU Offload and Its Decode Speed Penalty
When a model doesn't fully fit in VRAM, layers get offloaded to system RAM. System RAM bandwidth (DDR5: ~50-80 GB/s) is 6-20× slower than VRAM bandwidth. You feel it immediately.
A concrete example with a 13B Q4 model on an RTX 3060 (12 GB):
- Model fits in VRAM fully (8.3 GB): ~45 tok/s
- One layer offloaded to RAM: drops to ~30 tok/s
- Half the model in RAM: ~12 tok/s
The rule: if a model doesn't fit in VRAM with at least 1 GB of headroom, you'll feel the penalty. Run a smaller model or higher quantization instead of fighting the offload.
"My GPU Has High TFLOPS, So It Should Be Fast" — Wrong
This is the most common misconception for first-time local LLM builders, and GPU spec sheets make it worse by leading with TFLOPS everywhere.
TFLOPS measures how many math operations per second your GPU can perform. But decode is bottlenecked by data movement, not math. An RTX 4080 with 49 TFLOPS FP32 decodes within 30% of an RTX 4090 with 82 TFLOPS — because their bandwidth ratio (716 GB/s vs 1,008 GB/s) is much closer than their compute ratio.
The RTX 4060 Ti 16 GB vs RTX 4070 12 GB comparison makes this painfully clear. The 4060 Ti has more VRAM — 16 GB vs 12 GB. But it has 288 GB/s bandwidth vs the 4070's 504 GB/s. The 4070 decodes roughly 2× faster despite less VRAM — 58 tok/s vs 28 tok/s on a 7B Q4 model. More VRAM, slower GPU for LLM work.
Integrated GPUs with shared system memory are even worse. They share the same pool of DDR5 memory (50-80 GB/s) that the CPU uses. A discrete GPU with 8 GB of GDDR6 and 250 GB/s bandwidth will crush an iGPU sharing 32 GB of system RAM.
TFLOPS is the most advertised GPU metric because it maps to gaming and workstation rendering performance, and it sounds impressive. For LLM decode, it's close to irrelevant.
Tip
When comparing GPUs for local LLM use, look up the memory bandwidth spec, not TFLOPS. For NVIDIA cards, you can find it on TechPowerUp's GPU database. Sort by GB/s per dollar, not TFLOPS per dollar.
Decode Speed in Practice — RTX 4070 vs RTX 3060 vs RTX 4090
Test setup: Mistral 7B Q4_K_M in Ollama 0.3.x, single user, 512-token output, 256-token prompt, 4K context window.
Experience
Comfortable daily driver
Fast, streaming feel
Excellent, essentially instant
Overkill for 7B, shines on 13B-70B The RTX 3060 at 42 tok/s is a legitimate daily driver — it clears the 20 tok/s threshold comfortably and the response rhythm feels acceptable. The RTX 4070 Ti at 74 tok/s is where it starts to feel genuinely fast rather than merely acceptable.
The RTX 4090 is overkill for a 7B model, but it earns its place on larger models. Running Llama 3.1 70B Q4_K_M on an RTX 4090 yields about 27 tok/s — still acceptable for single-user use, though it starts to fall apart if you're trying to serve more than two simultaneous conversations.
Note
These benchmarks reflect single-user decode with default Ollama settings. Actual numbers vary based on context length, batch size, and system configuration. The relative ordering between GPUs is consistent — the absolute numbers may shift 5-15% depending on your setup.
How to check your own numbers: Run ollama run mistral --verbose and look at the eval rate in the output. That's your tok/s. Compare it to the table above to see where you land.
Warning
If your tok/s is dramatically lower than expected for your GPU, check for CPU offloading first. Run ollama run mistral --verbose and look for offloaded layers in the output. Any non-zero number there will tank your decode speed.
Related Concepts for Local AI Builders
Understanding decode speed fully means understanding the surrounding concepts:
Prefill speed — the other half of inference. Prefill processes your input prompt before decode begins and is compute-bound rather than bandwidth-bound.
VRAM bandwidth — the single spec that directly determines your decode speed. Every hardware comparison for local LLM should start here.
Quantization — lowers model data size, which directly improves decode speed in proportion. Q4 is roughly 4× faster than FP16 at the same model size.
Batch size — increasing batch size improves total throughput for multi-user setups but cuts per-user decode speed. Single-user rigs should run batch=1.
For picking the right GPU tier based on bandwidth-per-dollar, see the best GPU for local LLM guide. For measuring your setup's actual tok/s, the Ollama setup guide covers the --verbose flag in detail.
The bottom line: bandwidth is the spec. Everything else is secondary.