Running LLMs locally means one thing matters more than anything else: VRAM. Not CUDA cores. Not clock speed. VRAM.
A model has to fit in your GPU’s memory or performance falls off a cliff. Everything in this guide flows from that single constraint.
Here’s exactly what each GPU can run, what it costs right now, and which one you should buy.
Why VRAM Is the Only Spec That Matters
When you generate text with a local LLM, your GPU spends most of its time loading model weights from VRAM — not doing math. This makes inference memory-bandwidth bound, not compute-bound.
Two things follow from this:
- If the model doesn’t fit in VRAM, it spills to system RAM. Speeds drop from 50+ tokens/second to 2-5 tokens/second. Unusable for anything interactive.
- More bandwidth = faster tokens. Between two cards with the same VRAM, the one with higher memory bandwidth generates text faster.
The second factor that matters is quantization — compressing a model to use less memory. A 7-billion parameter model needs about 14GB in full precision (FP16), but only 4.3GB when quantized to Q4_K_M. You trade a small amount of output quality for dramatically lower VRAM requirements.
Common quantization levels:
- Q4_K_M — Most popular. ~4.5 bits per parameter. Fits large models into consumer GPUs with minimal quality loss.
- Q5_K_M — Better quality, slightly larger. Good middle ground.
- Q8_0 — Near-lossless. Roughly double the size of Q4. Use this when VRAM allows.
What Each GPU Can Actually Run
Before looking at specific cards, here’s the cheat sheet. This is what determines your purchase — find the model sizes you want to run, then buy the cheapest card that fits them.
| Model Size | Quantization | VRAM Needed | Minimum GPU |
|---|---|---|---|
| 7B | Q4_K_M | ~4.3 GB | RTX 4060 Ti 8GB |
| 7B | Q8_0 | ~7.6 GB | RTX 4060 Ti 8GB |
| 13B | Q4_K_M | ~7.4 GB | Any 16GB card |
| 13B | Q8_0 | ~13 GB | Any 16GB card |
| 14B | Q4_K_M | ~8.5 GB | Any 16GB card |
| 30B | Q4_K_M | ~16.7 GB | RTX 4090 (24GB) |
| 30B | Q8_0 | ~30 GB | RTX 5090 (32GB, tight) |
| 70B | Q4_K_M | ~37 GB | Dual GPU setup |
Key takeaway: No single consumer GPU can run 70B models at good quality. You either need two cards or aggressive quantization that visibly degrades output.
The GPUs, Ranked
Best Overall: RTX 4090 (24GB) — ~$1,700
The proven workhorse. 24GB of GDDR6X at 1,008 GB/s bandwidth. Runs everything through 34B parameters at high quantization without breaking a sweat.
Performance:
- Llama 3.1 8B (Q4): ~149 tokens/second
- 34B models at Q5_K_M: fits comfortably with room for context
- 70B at Q4: does NOT fit on a single card
Power draw: 450W. You’ll want a 850W+ PSU.
The 4090 has been the enthusiast standard since launch and nothing has fully replaced it. The RTX 5090 has more VRAM and bandwidth but costs nearly double at current street prices.
If you’re running models in the 13B-34B range and want headroom for larger context windows, this is the card.
Best Value: RTX 4070 Ti Super (16GB) — ~$900
The sweet spot most developers land on. 16GB GDDR6X with 672 GB/s bandwidth — enough to run 7B and 13B models fast, with room to stretch to 14B.
Performance:
- Llama 3.1 8B (Q4): ~90-100 tokens/second
- 13B models: smooth, plenty of VRAM headroom
- 30B+: won’t fit
Power draw: 285W.
Multiple benchmark sites call this the “best value for most developers” as of early 2026. You get solid speed at a price point that doesn’t require justifying the purchase to anyone.
The 16GB ceiling is real — you can’t run 30B models. But for the majority of practical use cases (coding assistants, chatbots, document analysis), 13-14B quantized models are more than capable.
Best Budget: RTX 4060 Ti 16GB — ~$480
The entry point for serious local AI work. 16GB of GDDR6, same VRAM as the 4070 Ti Super, at almost half the price.
The catch: memory bandwidth. At 288 GB/s (versus 672 GB/s on the 4070 Ti Super), you generate tokens at roughly half the speed on the same models.
Performance:
- Llama 3.1 8B (Q4): ~48 tokens/second
- Qwen 2.5 14B (Q4): ~26 tokens/second
- 13B Q8: fits, runs at ~20 tokens/second
Power draw: 165W — the most efficient card on this list.
48 tokens/second on an 8B model is perfectly usable. That’s faster than you can read. If budget is the primary constraint and you don’t need blazing speed, this card gives you access to the same model range as cards costing twice as much.
Important: Avoid the 8GB version. It can only run 7B models and nothing larger fits. The 16GB model is worth every dollar of the price difference.
Best New-Gen: RTX 5070 Ti (16GB) — ~$940
The Blackwell architecture brought a meaningful bandwidth upgrade. The 5070 Ti delivers ~896 GB/s on 16GB GDDR7 — a 33% speed boost over the 4070 Ti Super at roughly the same price.
Performance:
- Llama 3.1 8B (Q4): ~80 tokens/second
- Qwen 2.5 14B (Q4): ~48 tokens/second
- Blackwell-exclusive NVFP4 quantization: 1.6x throughput vs BF16
Power draw: 300W.
The NVFP4 quantization format is exclusive to Blackwell GPUs and delivers significantly better speed with only 2-4% quality loss compared to full precision. As more inference engines add NVFP4 support, this card’s advantage will grow.
If you’re buying new and want a 16GB card, the 5070 Ti is the pick over the 4070 Ti Super. Same VRAM, meaningfully faster.
Skip: RTX 5080 (16GB) — ~$1,450
The 5080 costs 54% more than the 5070 Ti for the same 16GB VRAM ceiling and only marginally higher bandwidth (960 vs 896 GB/s). For LLM inference specifically, the extra compute cores don’t help — you’re memory-bound.
At $1,450, you’re approaching RTX 4090 territory, which gives you 24GB and runs an entire class of models the 5080 can’t touch.
The 5080 is a fine gaming GPU. For local LLMs, the value math doesn’t work.
The Flagship: RTX 5090 (32GB) — ~$3,000+
The new consumer king. 32GB GDDR7 at 1,792 GB/s — 77% more bandwidth than the 4090.
Performance:
- Llama 3.1 70B: ~85 tokens/second (with quantization to fit 32GB)
- 8B models: 200+ tokens/second
- Two cards via NVLink: 64GB combined VRAM — runs 70B at Q4_K_M without compromise
Power draw: 575W. Two-slot cooler, needs serious airflow and a 1000W+ PSU.
The 5090 is objectively the best consumer GPU for local AI. But the MSRP of $1,999 is a fantasy in February 2026 — a global DRAM shortage has pushed street prices to $3,000-3,500, with projections of $5,000 by mid-2026. Founders Edition cards sold out in 8 minutes at launch.
At MSRP, the 5090 is a compelling upgrade. At current street prices, you’re paying a 75% premium for the privilege of owning one now.
The Dark Horse: RTX 3090 Used (~$800)
Here’s a move most buyers overlook: a used RTX 3090 gives you 24GB of VRAM — the same as a 4090 — at less than half the price.
Performance:
- Bandwidth: 936 GB/s (only 7% less than the 4090’s 1,008 GB/s)
- 70B models at aggressive quantization: ~42 tokens/second
- Same practical model ceiling as the 4090
Power draw: 350W.
The 3090 was the AI workhorse before the 4090 existed. Used prices have dropped as miners and early adopters move to newer cards. At $800, it’s the best VRAM-per-dollar ratio available at $33/GB — compared to $71/GB on a 4090 and $94+/GB on a 5090.
Downsides: no warranty, older architecture, no NVFP4 support, and you’re buying used. But for pure model-fitting ability, nothing beats it at this price.
What About AMD?
The RX 7900 XTX offers 24GB at ~$950 — cheaper than a 4090 with competitive bandwidth (960 GB/s). On paper, it’s a strong option.
In practice, AMD’s ROCm software stack still has gaps. It works with llama.cpp and some inference servers, but not all frameworks support it, debugging resources are thin, and you’ll spend more time troubleshooting setup than with NVIDIA’s CUDA ecosystem.
If you’re comfortable with Linux, ROCm, and potential compatibility issues, the 7900 XTX is real competition. For everyone else, stick with NVIDIA.
The 2026 Memory Crisis
A factor you need to know about: GPU prices in early 2026 are inflated. A global GDDR/DRAM shortage — driven by massive AI datacenter buildouts — has pushed prices 25-75% above MSRP across the board. RAM module prices have risen roughly 172% since mid-2025.
This means:
- RTX 5090: MSRP $1,999, street price $3,000+
- RTX 5080: MSRP $999, street price ~$1,450
- RTX 5070 Ti: MSRP $749, street price ~$940
If you can wait 6-12 months, prices will likely normalize. If you need a card now, factor the premium into your budget and lean toward used 40-series or 30-series cards for better value.
Quick Recommendations
“I just want to try local AI” — RTX 4060 Ti 16GB (~$480). Runs 7B-14B models, low power draw, won’t break the bank.
“I’m serious about this” — RTX 4070 Ti Super ($900) or RTX 5070 Ti ($940). Fast enough for daily use, handles 13-14B models well.
“I want the best I can get under $2,000” — RTX 4090 (~$1,700). 24GB handles 30B+ models, proven track record, massive community support.
“I want maximum capability” — RTX 5090 (~$3,000+). 32GB, fastest bandwidth available, NVLink option for 64GB. But you’re paying a steep markup right now.
“I’m on a tight budget” — Used RTX 3090 (~$800). 24GB VRAM, same model ceiling as a 4090, best value per GB.
Multi-GPU: When It Makes Sense
Adding a second GPU is only useful when your model doesn’t fit on a single card. If the model already fits, a second GPU adds communication overhead and actually makes inference slightly slower.
Two realistic multi-GPU setups:
- 2x RTX 3090 (~$1,600 total): 48GB pooled VRAM. Runs 70B at Q4_K_M with room to spare. Best budget path to 70B models.
- 2x RTX 5090 via NVLink: 64GB combined. Runs 70B at higher quantization or 100B+ models. The premium option.
Your motherboard needs to support x8/x8 PCIe lanes for dual GPUs, and your PSU needs to handle the combined power draw. For dual 3090s, plan for a 1200W+ power supply.
Bottom Line
The GPU market in early 2026 is unusual — a DRAM shortage has inflated prices and limited stock, especially on new Blackwell cards. The fundamentals haven’t changed though: buy the most VRAM you can afford, favor bandwidth over raw compute, and pick your card based on the model sizes you plan to run.
For most people getting into local AI, the RTX 4070 Ti Super or RTX 5070 Ti hits the sweet spot. For power users, the RTX 4090 remains hard to beat. And if budget is king, a used RTX 3090 gives you flagship-tier VRAM at mid-range prices.
The best GPU is the one that fits your models and your budget. Everything else is a spec sheet distraction.