Benchmarks

Real inference benchmarks on consumer hardware — tokens per second by GPU, model size, and framework. Reproducible methodology, no cherry-picked numbers.

15 articles

Sort:

Benchmark

Apple Silicon M-Series LLM Benchmark: M1 Through M5 tok/s Comparison

Your M2 Air chokes at 0.4 tok/s while M4 Max hits 18.4 tok/s — 46x gap. We sourced 340+ benchmarks to name the exact tier you need.

April 18, 2026

apple siliconm4 maxmlx

Benchmark

Budget GPU Benchmark Shootout: RTX 4060 vs 4070 vs 3060 for Local LLMs

8 GB hits the wall at 13B parameters—see exact tok/s for 8B, 32B, 70B models. 3060 12 GB wins budget 70B, 4070 wins speed, 4060 loses unless under $280.

April 18, 2026

RTX 4060RTX 4070RTX 3060

Benchmark

Context Length Performance: How VRAM and Speed Change from 2K to 128K

Your '128K context' model hits OOM at 16K. We measured VRAM growth and tok/s decay for 7B to 70B models—here's which configs actually work on 24 GB GPUs.

April 18, 2026

context lengthVRAM usageKV cache

Benchmark

CPU vs GPU Inference in 2026: When Does CPU Actually Make Sense?

24 GB VRAM walls force CPU offload at 2 tok/s — but 8-channel DDR5 Threadripper hits 7 tok/s on 70B for $3,800. The bandwidth math NVIDIA won't show you.

April 18, 2026

cpu inferencellama.cppthreadripper

Benchmark

Multi-GPU Benchmark: Two RTX 3090s vs One RTX 4090 for 70B Models

Dual 3090 hits 15 tok/s not 25 — we tested why. NVLink vs PCIe, 94% speed at $1,400 less, but only with specific motherboard topology.

April 18, 2026

RTX 3090RTX 4090multi-GPU

Benchmark

Power Draw vs. Performance: Best tok/s Per Watt by GPU

Your 4090 draws 447 W for 46 tok/s—Arc B580 hits 22 tok/s at 190 W. Real wall-power data shows which GPU wins on efficiency, not marketing.

April 18, 2026

llm inferencepower efficiencyarc b580

Benchmark

Intel Arc B-Series for Local LLMs: Real Benchmarks vs NVIDIA Budget Cards

RTX 4060 8 GB hits OOM at 13B models — Arc B580's 12 GB runs them native at 38 tok/s, but vLLM XPU needs Linux. Real MLPerf numbers inside.

April 18, 2026

intel arc b580rtx 4060local llm

Benchmark

llama.cpp Benchmark Methodology: How to Run Reproducible Results on Your Hardware

llama-bench skews 15% without a warm-up pass. This exact 10-repetition command sequence produces defensible, shareable results.

April 18, 2026

llama.cppllama-benchbenchmark methodology

Benchmark

MoE Model Benchmarks: DeepSeek V3, Qwen3-30B-A3B, Gemma 4 26B on Consumer GPUs

Your 24 GB card can't run DeepSeek V3 at full speed—here's 4.7 tok/s vs Qwen3's 18.3 tok/s reality, with exact VRAM math that explains why.

April 18, 2026

deepseek v3qwen3gemma 4

Benchmark

Ollama vs llama.cpp vs vLLM: Throughput Benchmark for Single-GPU

Ollama costs 23% speed vs llama.cpp on RTX 4090. vLLM wins 4.2x multi-user throughput but fails 70B. Single-GPU framework verdict with real tok/s numbers.

April 18, 2026

ollamallama.cppvllm

Benchmark

Quantization Benchmark: Q4_K_M vs Q5_K_M vs Q8_0 — Speed, Quality, VRAM

Q4_K_M costs 2.3% quality at 70B. Q8_0 needs 40 GB VRAM. Q5_K_M hits 0.8% loss at 16 GB — but only if you set n_gpu_layers right.

April 18, 2026

GGUF quantizationllama.cppAMD ROCm

Benchmark

RTX 5060 Ti 16GB vs 9GB: LLM Inference Benchmark and Which to Buy

9GB's 96-bit bus cuts 25% bandwidth—32B models run 23% slower, 70B OOMs. 16GB at $429 is the only safe buy. Here's the tok/s data.

April 18, 2026

rtx 5060 ti16gb vs 9gblocal llm

Benchmark

RTX 5090 vs RTX 4090 for Local LLMs: Real Inference Benchmarks

RTX 5090 promises 77% more bandwidth but delivers only 37% faster 70B inference. See exact Q4_K_M tok/s and why the $1,200 used 4090 still wins at 32B.

April 18, 2026

rtx 5090rtx 4090local llm

Benchmark

Tokens Per Second by GPU: Consumer GPU LLM Benchmark Table (2026)

Your GPU claims 85 tok/s but hits 12 on real models. 340+ verified llama-bench results: exact Q4_K_M speeds from RTX 5090 to RX 7900 XTX.

April 18, 2026

llama-benchtokens per secondconsumer GPU

Benchmark

vLLM vs Ollama: Multi-User Concurrency Benchmark (1 to 32 requests)

Your Ollama server slows at 4 users while vLLM hits 3.2x throughput at 16. See the crossover for 24 GB and 16 GB cards—and when Ollama still wins.

April 18, 2026

vLLMOllamaconcurrency