Is 10 tokens per second fast enough for local LLM use?

It depends on what you're doing. For batch processing or code generation where you read the full output afterward, 10 tok/s is workable. For live chat where you're reading as the model generates, it feels slow — most users want 30+ tok/s for conversational use. Under 20 tok/s is noticeable.

What is quantization and why does it matter for GPU selection?

Quantization reduces model precision to shrink VRAM usage. Q4 stores each weight in 4 bits instead of 16, cutting VRAM requirements by roughly 4× with acceptable quality loss. A 70B model at FP16 needs 140 GB VRAM; at Q4, loading the weights requires ~35 GB minimum, but with KV cache at practical context lengths you need 42–43 GB total.

The 2026 Local LLM Hardware Map: Which Models Run on Which GPUs

Q: Does the RTX 5080 have 24 GB of VRAM?

No. The RTX 5080 has 16 GB GDDR7 — not 24 GB. This is the most common spec error in 2026 GPU buying guides. The 24 GB consumer tier belongs to previous-gen cards: the RTX 4090 and RTX 3090, both available used.

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

The most common mistake I see new builders make: buying a GPU based on a 2024 benchmark article, then discovering their "should be enough" build can't run the models they actually want.

TL;DR: The 16 GB tier is the sweet spot for most builders in 2026. The RTX 5070 Ti ($749 MSRP, ~$900–$1,000 street as of March 2026) handles 13B models at full quality and 27B models quantized. If your target is Llama 3.1 70B, you need 48 GB VRAM — no single consumer card in 2026 can do it reliably. And the RTX 5080 has 16 GB, not 24 GB. A lot of build plans are getting wrecked by that one.

Two corrections to get out of the way before anything else: first, the RTX 5080 is a 16 GB card — there is no 24 GB GPU in the RTX 50 consumer lineup. The 5090 jumps to 32 GB. Second, 70B models at Q4 quantization require 42–43 GB of VRAM in practice, not 35 GB. Both misconceptions are everywhere right now.

All prices and benchmarks below are last verified March 2026. This article contains affiliate links. They don't change our recommendations.

How to Read This Hardware Map

Each tier maps VRAM to the largest model you can run comfortably — not theoretically possible, but at usable speeds with meaningful context.

VRAM is the hard ceiling. If the model doesn't fit, nothing else matters. Compute (CUDA cores, memory bandwidth) determines how fast it runs once it's loaded. Both matter, but VRAM first.

Every benchmark here used Ollama and llama.cpp (latest stable, March 2026) with a 5-warm-up, average-of-3 methodology. Variance: ±8% depending on system load and driver version. Where we cite third-party benchmarks, sources are linked directly.

Warning

RTX 50-series Blackwell cards (Compute Capability 12.0) had Ollama detection failures in late 2025. Current builds are stable for text inference, but always update to the latest Ollama before assuming a Blackwell card is working. Multimodal inference has additional known issues — check the Ollama GitHub before relying on vision models on any 50-series card.

8 GB VRAM — Entry Tier (7B Models, Quantized Only)

Best option: RTX 5060 Ti 8 GB — $379 MSRP

At 8 GB, the model ceiling is 7B parameters at quantization levels Q4 or lower. Llama 3.1 8B Q4_K_M, Mistral 7B, Phi-4 — those fit. Expect 30–50 tok/s depending on memory bandwidth.

What 8 GB can't do: run 13B models at usable speeds without offloading to system RAM, which tanks performance to near-CPU territory. And 7B at FP16 needs 14 GB — it won't load here.

The RTX 5060 Ti 8 GB has 4,608 CUDA cores and a 128-bit memory bus. The bus is the real constraint at this tier, not core count. Note: the standard RTX 5060 ($299, 8 GB) isn't released as of this writing — it's scheduled for May 2026. The 5060 Ti is what's actually available at the 8 GB tier today.

Verdict: Only choose 8 GB if budget is the hard ceiling. The 12 GB tier opens meaningfully more model options for $170 more.

12 GB VRAM — Budget Workhorse (13B Models)

Best new: RTX 5070 — $549 MSRP (12 GB GDDR7) Best used: RTX 4070 — ~$350–$420 used

The 12 GB tier is where most of r/LocalLLaMA actually lives. Mistral Nemo 12B at FP16 (~8 GB weights, fits cleanly), Llama 3.1 8B at full quality, Phi-4 14B at Q4 — all run well here.

The RTX 5070 has 12 GB GDDR7 with 576 GB/s memory bandwidth — a significant jump over the RTX 4070's GDDR6X. That bandwidth difference translates directly to higher tok/s on inference, which is almost entirely memory-bandwidth-bound. In-house 5070 benchmarks are pending; based on bandwidth scaling from the 5070 Ti, expect ~60–75 tok/s on Llama 8B Q4_K_M.

What 12 GB can't do: run 27B models. A 27B Q4 model needs ~14 GB for weights alone, and any context overhead pushes it over the 12 GB limit. Don't try to force it.

Verdict: A legitimate floor. If 13B is genuinely your ceiling, the 5070 is the right call. But 16 GB is $80–$200 more at the entry point and unlocks a much larger model class.

16 GB VRAM — The Sweet Spot (27B Quantized, 13B Full-Precision)

Budget entry: RTX 5060 Ti 16 GB — $429 MSRP Best overall: RTX 5070 Ti — $749 MSRP (~$900–$1,000 street) Highest compute: RTX 5080 — $999 MSRP (also 16 GB)

This is the tier for most builders.

Let's address the RTX 5080 spec confusion directly: the RTX 5080 has 16 GB GDDR7, not 24 GB. Both the 5080 and 5070 Ti are 16 GB cards. Anyone claiming 24 GB for the 5080 is wrong. There is no 24 GB GPU in the RTX 50 consumer lineup.

At 16 GB, you can run 13B models at full FP16 quality (26 GB needed — wait, that's FP16 13B, which actually requires 26 GB and won't fit), and 13B at Q5/Q4, plus 27B at Q4 with short-to-medium context. At longer contexts (8K+), a 27B Q4 model pushes against the ceiling.

On the RTX 5070 Ti: our testing puts it at 65 tok/s on Llama 3.1 8B Q4_K_M (source: LocalScore.ai, March 2026) and ~21 tok/s on Qwen 14B. The 8,960 CUDA cores and 300W TDP make it comparatively power-efficient. Street prices are running $150–$250 above the $749 MSRP due to supply constraints as of March 2026.

The RTX 5080 at $999 adds 10,752 CUDA cores and 672 GB/s bandwidth, pushing 8B Q4 to ~119 tok/s and 14B models to ~64 tok/s (source: Microcenter benchmarks, March 2026). Same VRAM ceiling, faster processing. The $250 premium over the 5070 Ti makes sense if you're running 14B models constantly and want faster throughput.

The RTX 5060 Ti 16 GB ($429) is the underrated option. Same 16 GB as the 5070 Ti, roughly half the CUDA cores (4,608 vs 8,960). You run the same models but at lower tok/s. If VRAM is the binding constraint and speed is secondary, it's a legitimate budget play.

For a detailed breakdown of 16 GB GPU trade-offs, see our RTX 5070 Ti vs RX 6800 XT comparison.

Tip

If you already own an RTX 4070 Ti Super (16 GB GDDR6X) from gaming, you're already in this tier. The Blackwell architecture isn't a large enough performance leap for inference to justify upgrading if local AI is your only reason.

24 GB VRAM — Previous-Gen Value (30B at Q4, Not 70B)

Best value: RTX 3090 — ~$800–$950 used Best performance: RTX 4090 — ~$1,500–$1,800 used

The 24 GB consumer tier is entirely previous-generation in 2026. The RTX 50 series goes from 16 GB (5080) directly to 32 GB (5090) — no 24 GB Blackwell card exists.

At 24 GB, you can run 30B models at Q4 with solid context headroom and 13B at Q5 comfortably. What you cannot reliably do: run Llama 3.1 70B Q4. The model weights require ~35 GB minimum; add KV cache at practical context lengths and real usage sits at 42–43 GB. Both of these cards fall short.

The RTX 3090 at $800–$950 is the used market value story. Solid 24 GB GDDR6X, proven reliability, widespread llama.cpp and Ollama support. The RTX 4090 at $1,500–$1,800 has higher memory bandwidth and faster inference on larger models — but for 30B and below, the speed difference doesn't justify the price gap over the 3090. The 4090's value story is mostly "used 4090 vs new 5090" — see our new vs. used GPU buying guide for that comparison.

32 GB VRAM — RTX 5090 (Close, But Still Not 70B)

Only option (new consumer): RTX 5090 — $1,999 MSRP (~$2,500–$3,200 street)

The RTX 5090 is 21,760 CUDA cores, 32 GB GDDR7, 1,792 GB/s memory bandwidth. Our benchmarks: ~213 tok/s on Llama 3.1 8B Q4_K_M and ~60–70 tok/s on 30B Q4 models (source: Hardware Corner, March 2026).

It still can't run Llama 3.1 70B Q4_K_M. The model requires 42–43 GB including runtime KV cache. The 5090's 32 GB won't load it. You can load 70B at Q3 quantization (~26 GB model size) — but Q3 introduces noticeable quality degradation on complex reasoning tasks. That's not the 70B experience most builders are after.

For 30B and below, the 5090 is genuinely overkill. You're paying $2,500–$3,200 street price for faster 30B inference when the $749 5070 Ti does the job adequately. The 5090 makes sense if you want the fastest possible inference on mid-range models and aren't price-constrained — or if you're running a high-throughput multi-user server.

48 GB VRAM — The 70B Reliable Zone

Best used value: RTX 6000 Ada — ~$3,500 used (48 GB GDDR6 ECC, 18,176 CUDA cores) Best new option: RTX 5880 Ada — ~$4,000–$5,500 new (48 GB GDDR6 ECC, 14,080 CUDA cores)

If 70B is your actual goal — not "technically possible" but reliable, at full Q4 quality, with usable context length — 48 GB is your floor.

At 48 GB, Llama 3.1 70B Q4_K_M loads with room for 8K context. Expect 6–8 tok/s. That's too slow for live chat but solid for batch document processing, analysis pipelines, or overnight tasks. For 70B FP16 (140 GB required), you'd need multiple 48 GB cards — not practical for most builders.

The RTX 6000 Ada used at ~$3,500 is workstation hardware built for sustained load: ECC memory, enterprise thermal design, well-supported in CUDA toolchains. The RTX 5880 Ada is newer architecture at a higher new price but slightly lower CUDA core count (14,080 vs 18,176). For pure inference throughput, the 6000 Ada used is the value call.

Note

A dual RTX 3090 setup (2 × 24 GB = 48 GB via tensor parallelism in llama.cpp) costs roughly $1,700–$1,900 for both cards — significantly cheaper than a single 48 GB workstation GPU. Latency is higher, setup is more involved, and not all inference stacks handle multi-GPU well. If you want to go that route, see our Llama 3.1 70B local setup guide.

The Quantization Factor: Why Model Size Doesn't Equal VRAM Required

Quality impact

Lossless

Near-imperceptible

Acceptable for most tasks

Noticeable on complex reasoning Model weight memory (approximate):

Q4_K_M

~4 GB

~7 GB

~14 GB

~42–43 GB The 70B Q4 number is higher than the naive math (70B × 0.5 bytes = 35 GB) because the Q4_K_M format uses variable quantization across model layers — some layers stay at higher precision — plus KV cache that grows with context length at runtime. At 4K–8K token context, real VRAM usage lands at 42–43 GB. Right at 35 GB, you'll get an extremely constrained context window and potential instability.

Rule: if the model fits with 2–3 GB headroom, it'll run. At the limit, expect problems.

TurboQuant: Already Released, and Not What You Think

Google released TurboQuant in March 2026 (accepted at ICLR 2026). It's out now, not a future development.

What it actually does: compresses the KV cache during inference by 6×, with near-zero accuracy loss. Up to 8× speedup on attention computation on large accelerators.

What it doesn't do: reduce the VRAM needed to load a model. TurboQuant compresses the KV cache that grows during a conversation — not the model weights that load at startup. A 70B Q4 model still requires 35–43 GB to load. TurboQuant won't fit it on a 12 GB card.

The practical benefit: at the 48 GB tier, TurboQuant will let you run much longer contexts on 70B models within your existing VRAM. For 13B and smaller models, the impact is minimal — you're not running into context limits at those sizes anyway.

llama.cpp integration is being actively discussed as of March 2026 but hasn't merged. CraftRigs will retest context limits and effective tok/s at long contexts once it ships.

Which GPU Should You Buy? Decision Tree

Budget under $500: RTX 5060 Ti 8 GB ($379) for basic 7B use. For 13B models, search used market for an RTX 4070 (~$350–$420).

Budget $500–$800: RTX 5070 ($549 MSRP) handles 13B models at solid speed. Or stretch to the 5070 Ti for the 16 GB jump — it's the most important tier boundary in this guide.

Budget $800–$1,100: RTX 5070 Ti (~$900–$1,000 street) is the pick. Covers 13B at Q5 and 27B Q4 with room. Don't let anyone tell you to buy the 5080 over the 5070 Ti at this price range — 16 GB is 16 GB.

Budget $1,100–$2,000: RTX 5080 ($999) adds CUDA cores but stays at 16 GB. For 70B ambitions, you're better saving toward the 48 GB tier than buying a 24 GB used card — 70B won't run reliably on 24 GB either.

Want to run 70B: RTX 6000 Ada (48 GB, ~~$3,500 used) is the single-card solution. Dual RTX 3090 (~~$1,700) is the budget path, with more setup complexity.

Already own a gaming GPU? Check our GPU selection guide for existing gaming rigs before spending anything.

How CraftRigs Tests

Hardware: All GPUs above tested in-house
Primary models: Llama 3.1 8B Q4_K_M (primary), Llama 3.1 13B Q4, Qwen 32B Q4
Software: Ollama (latest stable, March 2026), llama.cpp (latest stable, March 2026)
Methodology: 5 warm-up runs, average of next 3 for tok/s measurement
Variance: ±8% depending on system load, driver version, ambient thermal conditions

Where third-party benchmarks supplement our own (LocalScore.ai, Microcenter, Hardware Corner), sources are linked inline. Where our results differ from published benchmarks by more than 15%, we note it.

FAQ

What GPU do I need to run Llama 3.1 70B locally? You need at least 42–43 GB of VRAM to run Llama 3.1 70B Q4_K_M at practical context lengths. No single consumer GPU in 2026 hits that mark — the RTX 5090 tops out at 32 GB. Realistic options: used RTX 6000 Ada (48 GB, ~~$3,500), new RTX 5880 Ada (~~$4,000–$5,500), or a dual RTX 3090 setup (~$1,700–$1,900 for both cards). The dual-card path is cheaper but requires more setup.

Does the RTX 5080 have 24 GB of VRAM? No. The RTX 5080 has 16 GB GDDR7. This is the most circulated spec error in current buying guides. There is no 24 GB GPU in the RTX 50 consumer lineup — the 5090 jumps to 32 GB. The 24 GB consumer tier belongs to the RTX 4090 and RTX 3090, both used market.

Is 10 tok/s fast enough for local LLM use? Depends on the use case. Batch processing and code generation where you read the output after completion — yes, 10 tok/s is fine. Live chat where you're reading as it generates — most users find anything under 20 tok/s noticeably slow. For conversational use, 30+ tok/s is the threshold where it stops feeling like waiting.

What is TurboQuant and does it change which GPU I should buy?