What is the most important benchmark metric for local LLM inference?

Token generation speed (TG) in tokens per second — the rate at which the model outputs tokens after it starts generating. This is what you feel as the response streams out. Prompt processing speed (PP) matters secondarily for long-document tasks. Both are reported by llama-bench.

How do I run a benchmark on my local LLM setup?

Use llama.cpp's built-in tool: run `./llama-bench -m path/to/model.gguf`. It performs a standard prompt processing test (512 tokens) and token generation test (128 tokens) and reports both speeds. Run multiple times to get stable averages, as first-run cache effects can inflate scores.

What token speeds should I expect on a typical consumer GPU?

On a 24GB GPU (RTX 4090 or 3090) running a 7B model at Q4_K_M, expect 80–120 tokens/second generation speed. On a 13B model, roughly 40–65 t/s. On a 70B model with full VRAM (48GB+), approximately 15–30 t/s. These are decode rates — actual generation speed, not prompt processing.

My benchmark is slower than expected — what should I check first?

Check that GPU layers are actually loading to the GPU, not silently falling back to CPU. Run nvidia-smi or rocm-smi while inference runs and confirm GPU utilization is high (80%+). Second, verify the quantization — Q8_0 uses roughly double the VRAM of Q4_K_M and runs slower on VRAM-constrained setups. Third, confirm your context size (-c flag) isn't set unreasonably high, which inflates KV cache overhead.

How to Benchmark Your Local LLM Setup: Tokens/Sec and Beyond

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

TL;DR: Tokens per second is the number that matters for local LLM performance. Run llama.cpp's built-in bench tool, compare against the baselines in this guide, and if you're consistently below expected numbers, the fix is almost always a driver or configuration issue — not bad hardware.

Most people who build a local LLM rig never actually measure what it's doing. They notice it feels fast or slow, but they don't have a baseline to compare against. This guide gives you a simple way to measure performance, understand what the numbers mean, and diagnose when something is underperforming.

The Primary Metric: Tokens Per Second

Tokens per second (T/s or tok/s) measures how quickly a model generates output tokens during inference. When you ask a model a question and watch the words stream out, each word fragment is roughly a token. Higher T/s means faster output, shorter wait times, better usability.

There are actually two separate measurements worth knowing:

Prompt processing speed (PP or "prefill"): How fast the model processes your input prompt before it starts generating. Measured in tokens/sec for the input. This matters more when you're sending long documents or code for analysis.

Token generation speed (TG): How fast the model generates each output token. This is the number you feel in real-time. This is the number that matters most for day-to-day use.

When benchmarking, you'll see both. Focus on TG for the experience metric, PP for understanding your system's throughput capacity.

Running llama.cpp Bench

llama.cpp includes a built-in benchmarking tool: llama-bench. This is the right way to measure your system.

Basic benchmark command:

./llama-bench -m path/to/your/model.gguf

This runs a standard test with:

Prompt processing test (512 token prompt by default)
Token generation test (128 tokens by default)
Reports both PP and TG in tokens/second

More comprehensive benchmark:

./llama-bench -m path/to/model.gguf -p 512,1024,2048 -n 128,256 -ngl 99

This tests multiple prompt lengths and generation lengths. The -ngl 99 flag ensures all layers are on the GPU (essential for accurate GPU benchmarks).

Running with specific GPU layers: If you want to compare full GPU inference vs CPU offload:

# Full GPU
./llama-bench -m model.gguf -ngl 99

# CPU only (no GPU offload)
./llama-bench -m model.gguf -ngl 0

The difference between -ngl 0 and -ngl 99 shows you exactly how much your GPU is helping.

What Good Performance Looks Like

These are approximate baselines based on common hardware configurations running Q4_K_M quantized models. Your results will vary based on exact model, quantization, context length, and system configuration.

RTX 4090 (24GB):

7B model: ~120–150 T/s generation
13B model: ~80–100 T/s generation
34B model: ~35–50 T/s generation
70B Q4 (split to CPU or dual GPU): ~15–25 T/s generation

RTX 3090 (24GB):

7B model: ~85–110 T/s generation
13B model: ~55–75 T/s generation
34B model: ~25–40 T/s generation
70B Q4 (split): ~10–18 T/s generation

RTX 3080 (10GB):

7B model: ~65–85 T/s generation
13B model (partial offload): ~25–40 T/s generation (CPU handles overflow layers)

Apple M3 Pro (18GB unified memory):

7B model: ~55–75 T/s generation
13B model: ~30–45 T/s generation

CPU-only (Ryzen 9 7950X):

7B model: ~8–15 T/s generation
13B model: ~4–8 T/s generation

The GPU vs CPU gap is enormous. This is why VRAM matters so much — you need the model to fit entirely on the GPU to get usable generation speeds.

GPU vs CPU Inference: The Real Comparison

GPU inference isn't just faster — it's a completely different class of speed. The reason is memory bandwidth.

RTX 4090 VRAM bandwidth: ~1,008 GB/s
RTX 3090 VRAM bandwidth: ~936 GB/s
Ryzen 9 7950X system RAM bandwidth: ~77 GB/s (DDR5-6000 dual channel)

LLM inference is fundamentally memory-bandwidth bound. The GPU has 12–15x more memory bandwidth than even fast system RAM. Every time the model generates a token, it reads through the entire set of model weights. Faster memory bandwidth = faster token generation.

This is why adding more CPU cores doesn't help much with local LLM speed. The bottleneck is memory bandwidth, not compute cores.

Common Causes of Underperformance

If your numbers are significantly below the baselines above:

Not enough GPU layers (-ngl not set correctly): If you're not passing -ngl 99 (or a high enough number), layers are being handled by the CPU. Run with verbose output to see layer allocation: add -v to your llama.cpp command and look for "offload X layers to GPU" in the output.

Wrong GPU being used: If you have an integrated GPU and a discrete GPU, llama.cpp might be using the wrong one. Check with nvidia-smi that your discrete GPU is showing activity during inference.

Thermal throttling: If the GPU overheats, it reduces its clock speed to protect itself. Check GPU temperature during inference — anything over 85°C on the junction or 90°C on the GPU core may cause throttling. Monitor with nvidia-smi dmon or GPU-Z.

Driver issues: Outdated CUDA drivers can cause significant performance penalties. Keep NVIDIA drivers current. For llama.cpp specifically, CUDA 12.x performs better than 11.x in most cases.

Wrong quantization for your hardware: Q8_0 quantization is higher quality but uses more VRAM and bandwidth. Q4_K_M is the standard balance. If a Q8 model barely fits in VRAM, it may run slower than a Q4 version because of VRAM pressure causing cache eviction.

Context length too high: Longer context lengths significantly slow down token generation because of the attention mechanism's quadratic scaling. A context of 32k is dramatically slower than 4k. If you're seeing slow generation, check your context length setting.

The Benchmark Comparison That Actually Matters

The most useful benchmark isn't against published numbers — it's before and after you make a change.

Before upgrading RAM: run llama-bench. After upgrading: run again. Before installing new drivers: benchmark. After: benchmark.

This gives you a real-world measurement of whether changes actually helped on your specific system with your specific workload. Don't chase someone else's numbers — understand your own hardware.