CraftRigs
CraftRigs / Glossary / Tokens Per Second (t/s)
Performance

Tokens Per Second (t/s)

The primary speed metric for LLM inference — how many tokens the model generates each second.

Tokens per second (t/s) is the standard benchmark for local LLM speed. It measures how many tokens the model outputs during the decode phase — the part where you watch text stream onto the screen. It does not measure how fast the model processes your input prompt (that's prefill).

What the Numbers Feel Like

  • < 5 t/s — Noticeably slow. You're waiting for individual sentences.
  • 5–15 t/s — Usable but sluggish. Reading faster than the model outputs.
  • 20–40 t/s — Comfortable. Roughly the pace of fast human reading.
  • 60–100 t/s — Fast. The model keeps up with or outpaces reading speed.
  • 100+ t/s — Very fast. Entire paragraphs appear nearly instantly.

Most people find 20 t/s the minimum for a smooth experience. Below that, inference starts to feel like waiting.

Real Benchmark Numbers

On a 7B Q4_K_M model (one of the most common local setups):

  • RTX 4090: ~127 t/s
  • RTX 4080 Super: ~90 t/s
  • RTX 4070 Ti Super: ~80 t/s
  • RTX 4070: ~55 t/s
  • M4 Max (64GB): ~65–75 t/s
  • M4 Pro (24GB): ~35–45 t/s
  • RTX 3090: ~90 t/s

For larger models, speed drops. The same RTX 4090 runs a 70B Q4_K_M at around 12–15 t/s (split across VRAM and RAM), or requires a multi-GPU or large-VRAM card to stay fully in VRAM.

What Determines t/s

For the decode phase, memory bandwidth is the dominant factor. The GPU must read the full set of model weights for every token generated. Higher bandwidth = faster reads = more tokens per second.

Compute (CUDA cores, shader units) matters more during prefill and for batched inference with multiple concurrent users. For a single-user local setup, bandwidth rules.

Why It Matters for Local AI

t/s is the number to optimize for in a local rig. When comparing GPU options, look at memory bandwidth benchmarks as a proxy for t/s on your target model size. A card with 2x the bandwidth will produce roughly 2x the tokens per second on the same model.