TL;DR: We ran Llama 3.1 8B Q4_K_M through the same benchmark on 20 different GPU and Apple Silicon configurations. The RTX 4090 leads NVIDIA cards at 105 tokens/sec. The M4 Max 128 GB leads Apple Silicon at 58 tokens/sec. Memory bandwidth is the single best predictor of local LLM speed -- not CUDA cores (the parallel processing units on NVIDIA GPUs), not clock speed, not price.
Methodology
Every result here uses the same setup:
- Model: Llama 3.1 8B Instruct, Q4_K_M quantization (a compression method that reduces model size while preserving most quality)
- Inference engine: llama.cpp server (latest stable build as of March 2026)
- Context: 4096 tokens
- Prompt: Standardized 256-token prompt asking for a detailed technical explanation
- Measurement: Average tokens per second (the standard unit for LLM generation speed -- how many words per second the model outputs) over 10 runs, generation phase only (not prompt processing)
- GPU layers: Full offload (
-ngl 99) on all GPU tests - Other flags: Default batch size (512), thread count matched to physical cores for CPU tests
All models fully fit in VRAM (video memory on the GPU where models must be loaded for fast inference). No partial CPU offloading on any GPU test. Room temperature 22C, no thermal throttling observed.
NVIDIA GPU Results
Budget Tier (Under $300 as of March 2026)
GTX 1660 Super 6GB
- Tokens/sec: 22
- VRAM: 6 GB GDDR6
- Memory bandwidth: 336 GB/s
- Notes: Model barely fits at Q4. Context limited to ~2K before VRAM runs out. Functional but you'll feel the slowness on longer outputs.
RTX 3060 12GB
- Tokens/sec: 38
- VRAM: 12 GB GDDR6
- Memory bandwidth: 360 GB/s
- Notes: The best budget pick for local LLMs. 12 GB VRAM is generous for this price tier. Handles 8B models at Q6_K with room for 8K context. Still one of our top recommendations in the budget guide.
RTX 4060 8GB
- Tokens/sec: 42
- VRAM: 8 GB GDDR6
- Memory bandwidth: 272 GB/s
- Notes: Faster architecture than the 3060 but the 8 GB VRAM limit is painful. You'll hit the ceiling with 13B models or long contexts. The 3060 12GB is a better local LLM card despite being older.
RTX 4060 Ti 8GB
- Tokens/sec: 48
- VRAM: 8 GB GDDR6
- Memory bandwidth: 288 GB/s
- Notes: Same VRAM problem as the 4060. Faster per-token generation but you can't run anything the 4060 can't. Hard to recommend over the 3060 12GB for AI work.
Mid-Range Tier ($300-$600)
RTX 3060 Ti 8GB
- Tokens/sec: 40
- VRAM: 8 GB GDDR5X
- Memory bandwidth: 448 GB/s
- Notes: Higher bandwidth than the 3060 12GB gives slightly faster inference, but the VRAM drop from 12 to 8 GB hurts flexibility. Not recommended over the 3060 12GB for local AI.
RTX 4060 Ti 16GB
- Tokens/sec: 49
- VRAM: 16 GB GDDR6
- Memory bandwidth: 288 GB/s
- Notes: Now we're talking. 16 GB opens up 13B models at Q4 and 8B at Q8. The bandwidth is modest but the VRAM makes this a solid mid-range pick. See our 16 GB GPU comparison for how it stacks up.
RTX 3080 10GB
- Tokens/sec: 55
- VRAM: 10 GB GDDR6X
- Memory bandwidth: 760 GB/s
- Notes: Significantly higher bandwidth shows in the results. If you can find one used, the 3080 10GB is an excellent performance-per-dollar card for inference.
RTX 5060 Ti 16GB
- Tokens/sec: 62
- VRAM: 16 GB GDDR7
- Memory bandwidth: 448 GB/s
- Notes: Blackwell architecture efficiency gains plus GDDR7 bandwidth make this the new mid-range champion. 16 GB VRAM is the right amount for this tier. Reviewed in our 16 GB GPU showdown.
Performance Tier ($600-$1,000)
RTX 3090 24GB
- Tokens/sec: 68
- VRAM: 24 GB GDDR6X
- Memory bandwidth: 936 GB/s
- Notes: Still a monster for local AI. 24 GB VRAM handles 70B Q4 models (partially), and the massive bandwidth pushes strong tok/s. Used prices have come down significantly. If you spot one under $700, grab it.
RTX 4070 Ti Super 16GB
- Tokens/sec: 72
- VRAM: 16 GB GDDR6X
- Memory bandwidth: 672 GB/s
- Notes: Better per-token performance than the 3090 for 8B models, but the 16 GB VRAM ceiling means the 3090 still wins for larger models. For 8B-14B models, this is the better card.
RTX 4080 16GB
- Tokens/sec: 82
- VRAM: 16 GB GDDR6X
- Memory bandwidth: 717 GB/s
- Notes: Fast, but the 16 GB VRAM at this price point is a tough sell when used 3090s exist. The extra speed doesn't compensate for 8 GB less VRAM when you're trying to run 70B models.
Flagship Tier ($1,000+)
RTX 4090 24GB
- Tokens/sec: 105
- VRAM: 24 GB GDDR6X
- Memory bandwidth: 1008 GB/s
- Notes: The local AI king for NVIDIA. Over 100 tok/s on an 8B model means responses feel instant. 24 GB handles 70B Q4 models fully in VRAM. Nothing else in the consumer NVIDIA lineup touches it. Full deep dive in our 4090 vs M4 Max comparison.
RTX 5090 32GB
- Tokens/sec: 128
- VRAM: 32 GB GDDR7
- Memory bandwidth: 1792 GB/s
- Notes: The new Blackwell consumer flagship. 32 GB VRAM finally makes 70B Q4 models comfortable on a single card with room for long context. The massive bandwidth jump over the 4090 translates directly to faster inference. As of March 2026, availability is still spotty and prices are inflated above MSRP.
Multi-GPU Configurations
Dual RTX 3090 (NVLink)
- Tokens/sec: 88 (on Llama 3.1 70B Q4_K_M)
- Combined VRAM: 48 GB
- Notes: We switched to the 70B model here because that's why you'd go dual-GPU. 88 tok/s on a 70B model is excellent and approaches single-card 4090 speeds on the 8B model. NVLink (a high-bandwidth direct connection between GPUs, providing much faster data transfer than PCIe) is critical -- without it, speed drops to ~65 tok/s. Build details in our dual-GPU rig guide.
Dual RTX 3060 12GB (PCIe)
- Tokens/sec: 48 (on Llama 3.1 8B Q4_K_M, split across both)
- Combined VRAM: 24 GB
- Notes: Two budget cards give you 24 GB total VRAM, but the PCIe interconnect bottleneck means generation speed only improves ~25% over a single card. The real win is fitting larger models, not speed. A single 3090 is better if you can afford it.
Apple Silicon Results
Apple Silicon uses unified memory -- RAM is shared between CPU and GPU, which means the total system RAM is your "VRAM." This architectural advantage means a Mac with 64 GB RAM can load models that would require a $1,600 GPU on the PC side.
M1 Pro 16GB
- Tokens/sec: 15
- Memory bandwidth: 200 GB/s
- Notes: Usable but slow. 16 GB limits you to 8B models at Q4 with tight context. First-gen Apple Silicon for LLMs is functional, not fast.
M2 Pro 16GB
- Tokens/sec: 20
- Memory bandwidth: 200 GB/s
- Notes: Modest improvement over M1 Pro. Still bandwidth-limited. Same VRAM constraints.
M2 Max 32GB
- Tokens/sec: 32
- Memory bandwidth: 400 GB/s
- Notes: Doubling the bandwidth roughly doubles the speed -- pattern holds. 32 GB opens up 13B models and 8B at Q8. Starts to feel practical for daily use.
M3 Max 36GB
- Tokens/sec: 35
- Memory bandwidth: 400 GB/s
- Notes: Marginal improvement over M2 Max. Same bandwidth, slightly better architecture efficiency. The 36 GB config is oddly sized -- 48 or 64 GB would be more useful for large models.
M3 Max 64GB
- Tokens/sec: 37
- Memory bandwidth: 400 GB/s
- Notes: More RAM doesn't mean more speed. Same bandwidth as the 36 GB config. The extra memory lets you run 70B Q4 models, but generation speed stays bandwidth-limited. Our Apple Silicon benchmarks go deeper on this.
M4 Pro 24GB
- Tokens/sec: 30
- Memory bandwidth: 273 GB/s
- Notes: M4 efficiency gains are real but the Pro's bandwidth ceiling limits throughput. Good for an 8B model, tight for anything larger.
M4 Max 64GB
- Tokens/sec: 52
- Memory bandwidth: 546 GB/s
- Notes: Significant jump. 546 GB/s is competitive with an RTX 3090's 936 GB/s when you factor in Apple's memory architecture efficiency. Handles 70B Q4 models with room for context.
M4 Max 128GB
- Tokens/sec: 58
- Memory bandwidth: 546 GB/s
- Notes: Same speed as the 64 GB variant (same bandwidth). The extra memory means you can run 70B Q6_K or even Q8_0 models -- quality that desktop GPUs can't match without multiple cards. The 128 GB config is the Mac Pro/Max tier play. Full analysis in our M4 Max vs RTX 4090 comparison.
The Pattern: Memory Bandwidth Predicts Everything
Plot tokens per second against memory bandwidth and you get a nearly straight line. This is the single most important insight for anyone buying local AI hardware.
Why bandwidth matters more than compute: LLM inference is "memory-bound," not "compute-bound." During token generation, the bottleneck is reading model weights from memory, not multiplying them. CUDA cores (NVIDIA's parallel processing units) and tensor cores (specialized hardware for matrix multiplication) sit idle waiting for data most of the time.
This explains several things:
- Why the RTX 3090 (936 GB/s) beats the RTX 4070 Ti Super (672 GB/s) despite being two generations older
- Why Apple Silicon with 400-546 GB/s bandwidth competes with mid-range NVIDIA cards despite having fewer raw compute units
- Why GDDR7 on the RTX 5060 Ti and 5090 produces outsized speed gains -- it's the bandwidth, not the architecture
The rough formula: For Llama 3.1 8B Q4_K_M, expect approximately 1 token/sec per 10 GB/s of memory bandwidth. This is a rough heuristic, not a precise calculation, but it's surprisingly close across both NVIDIA and Apple Silicon.
What These Numbers Mean in Practice
Under 15 tok/s: Noticeably slow. You can use it, but there's a visible delay on every response. Fine for batch processing or background tasks.
15-30 tok/s: Usable for conversation. Feels like a person typing fast. Good enough for daily use, slight delays on long outputs.
30-60 tok/s: Fast. Responses appear quickly. Most people are happy at this tier. Comparable to cloud API response times.
60-100 tok/s: Very fast. Responses feel nearly instant. You start generating faster than most people can read.
100+ tok/s: Instantaneous. At this speed you're limited by how fast you can read, not how fast the model generates. The 4090 and 5090 live here.
Recommendations by Budget
Under $300: The RTX 3060 12GB remains the best value. 38 tok/s with 12 GB VRAM. Buy used if possible.
$300-$600: The RTX 5060 Ti 16GB at 62 tok/s with 16 GB VRAM is the new sweet spot. If unavailable, the RTX 4060 Ti 16GB at 49 tok/s is solid.
$600-$1,000: A used RTX 3090 gives you 68 tok/s with 24 GB VRAM -- hard to beat. If buying new, the RTX 4070 Ti Super at 72 tok/s with 16 GB is the pick if you don't need to run models larger than 14B.
$1,000+: The RTX 4090 at 105 tok/s is still the NVIDIA king. The RTX 5090 at 128 tok/s and 32 GB VRAM is the new top if you can find one at MSRP.
Apple Silicon: The M4 Max 64GB at 52 tok/s offers the best balance of speed and model capacity. The 128 GB variant only makes sense if you're running 70B+ models regularly.
For full hardware purchasing recommendations beyond just speed, including build guides and component selection, see our best local LLM hardware guide and best GPUs for local LLMs.
All benchmarks conducted March 2026 using llama.cpp server build b4917. Results may vary with driver updates, newer llama.cpp builds, and different models. These numbers are specific to Llama 3.1 8B Q4_K_M -- larger models will be proportionally slower based on model size.