The 163 tokens per second headline that went around Reddit last month is real. It's also completely misleading.
That number comes from running Llama 3.2 1B Instruct on the RX 9070 XT — a model so small it fits in the VRAM of a gaming laptop from 2019. Nobody builds a local AI rig for 1B models. The number people actually care about, 7B inference with HIP on Windows, is closer to 71 t/s. Still decent. Still worth knowing. But not 163.
The RTX 5090 meanwhile is sitting at $4,100 street price while scalpers lick their chops and NVIDIA quietly diverts production toward datacenter orders. MSRP is $1,999. Good luck finding one for that.
So here's the honest comparison.
The Hardware, Stripped Down
RTX 5090
- 32GB GDDR7
- 1,792 GB/s memory bandwidth
- Blackwell architecture (GB202)
- 575W TDP
- Street price: $3,800–$4,100+ (MSRP $1,999, perpetually unavailable at that price)
RX 9070 XT
- 16GB GDDR6
- ~717 GB/s memory bandwidth
- RDNA4 architecture (gfx1201)
- 304W TDP
- Street price: $729–$810 (actually in stock)
The bandwidth gap is real — the 5090 has roughly 2.5x the memory bandwidth, and for LLM inference that matters more than compute. Memory bandwidth is how fast weights flow through the GPU every token. Wider pipe, faster generation.
But 2.5x bandwidth doesn't mean 2.5x real-world throughput. Software overhead, quantization format, and the specific model architecture all chip away at the theoretical ceiling.
The 163 t/s Number: What's Actually Being Measured
LocalScore benchmarks the 9070 XT at 163 tokens/sec generation speed — on Llama 3.2 1B Instruct Q4_K-Medium. That model weighs about 0.8GB. The GPU is basically sitting idle between token generations.
Drop to the 3B model in the same benchmark suite and it falls to 81 t/s. Still good, but you've already halved it by choosing a slightly less tiny model.
For a 7B model with llama.cpp's HIP backend on Windows — the configuration most people actually run — the 9070 XT lands around 71 t/s. That's for something like Qwen 3.5 4B-class workloads; a proper 7B Q4 lands in the 55–70 t/s range depending on quantization and context length.
Note
Why model size matters for throughput: LLM generation is memory-bandwidth-bound, not compute-bound. Each token requires loading the full model weights through VRAM. Smaller models = fewer bytes moved = faster tokens. A 1B model and a 7B model live in different performance universes on the same hardware.
The 5090 on a 7B model generates roughly 200–240 t/s. Independent benchmarks from hardware-corner.net and localllm.in corroborate this — the 5090's 32GB GDDR7 with 1.8 TB/s bandwidth puts it in a genuinely different class for generation speed. The ratio works out to about 3–3.5x faster than the 9070 XT at the 7B level.
The Price Math Nobody Wants to Do
The 9070 XT is $729. The 5090 is $4,100 street. That's a 5.6x price difference.
You're paying 5.6x more for 3–3.5x the throughput.
Per dollar of GPU, the 9070 XT generates roughly 97 tokens per second per $1,000 spent. The 5090 clears about 51 t/s per $1,000 at street price. The AMD card wins on price-performance by nearly 2:1, which is a fairly brutal ratio when you write it out.
The one place the 5090 genuinely wins the value argument is if you need to run large models. The 9070 XT's 16GB ceiling means 30B+ models require aggressive quantization or partial CPU offload — and the moment layers start hitting system RAM, throughput craters. The 5090's 32GB handles 30B Q4 comfortably and can run 70B at Q3 quantization with full GPU residency.
For 7B and 13B models — the workhorses of most local setups — the extra VRAM buys you nothing practical.
AMD's Software Stack: The Part Review Sites Gloss Over
The 9070 XT's gfx1201 architecture is new enough that the software situation in early 2026 is genuinely messy on Windows.
The main Ollama release didn't recognize the card as of mid-2025. A community fork (ollama-for-amd) patched this earlier, and official support arrived by mid-2025, but getting it working on Windows still requires replacing DLLs manually in some configurations. There are active GitHub issues as of March 2026 where users report ROCm initializing, detecting the GPU, then falling back to CPU anyway.
The Vulkan vs. HIP split is the real gotcha. On Windows, Vulkan should "just work" as a backend — and it does, for older models. But newer architectures like Qwen 3.5 see severe performance regression under Vulkan. One documented case shows 71 t/s with HIP dropping to 18 t/s under Vulkan on the same 4B model. That's not a rounding error. That's a different product.
Caution
Windows HIP setup on the 9070 XT is not plug-and-play. The FP8 (e4m3fn) operators needed for some optimizations still throw NotImplementedError on RDNA4 under Windows as of March 2026. If you're running Windows and want maximum performance, plan for an afternoon of driver archaeology and DLL swapping — or use WSL2.
Linux is a different story. ROCm 6.4.1 on Ubuntu 25.04 with gfx1201 target compiles cleanly. Ollama via Docker with ROCm, llama.cpp HIP builds, and vLLM all work. If Linux is viable for your setup, the 9070 XT is a significantly less painful experience.
NVIDIA's CUDA ecosystem simply doesn't have this problem. The 5090 works with Ollama, LM Studio, llama.cpp, vLLM, and anything else on day one. No forks, no DLL replacements, no backend selection drama.
What the 5090's 32GB Actually Unlocks
At 16GB, the 9070 XT handles:
- 7B models at any quantization
- 13B models at Q4 and below
- Some 22B models at aggressive quantization (tight)
- 30B+ models: partial CPU offload required
At 32GB, the 5090 handles:
- Everything above, plus
- 30B Q4 fully in VRAM
- 70B at Q3_K_S (roughly 30GB)
- Experimental 32B models without compromise
The jump from 70B CPU-offloaded to 70B fully GPU-resident is enormous — we're talking the difference between 3–5 t/s and 85 t/s on the 5090. If running frontier-class models locally is the actual goal, 32GB isn't a luxury.
Tip
Running 70B models locally? The 5090's 32GB gets Llama 3.3 70B at Q3_K_S just barely into VRAM, hitting ~85 t/s. No other consumer GPU pulls this off without offloading. If 70B inference is your main use case, the calculus flips — the 5090 becomes the only real option without going multi-GPU.
Who Should Buy What
Buy the RX 9070 XT if:
- You're on Linux (or willing to use WSL2)
- 7B–13B models cover 90%+ of your usage
- Budget is a real constraint — $729 is $729
- You want the card now, not when scalper prices finally crack
Buy the RTX 5090 if:
- You regularly need 30B+ models fully in VRAM
- You're running Windows and want zero configuration headaches
- You're serving multiple users from the same machine and need raw throughput
- You can actually find one anywhere near $1,999 (good luck)
- Money is genuinely not a factor
The awkward middle case: if you're on Windows, primarily running 7B models, and the $3,371 price difference matters at all to you — the 9070 XT makes more sense on paper but requires accepting that AMD's Windows software stack is a work-in-progress. That's not a dealbreaker, but it's also not nothing.
The Honest Verdict
The 9070 XT is the right GPU for most people building a local AI rig in 2026. It's available, it's fast enough for the models most people actually use, and at $729 the value ratio isn't close.
The RTX 5090 is genuinely better at everything — faster generation, more VRAM, zero software friction. It's also, at $4,100 street price, asking you to pay roughly $3,371 more for the privilege. That buys you 3x the throughput on 7B models and the ability to run 70B natively.
Whether that trade makes sense depends entirely on what you're running and how much workflow friction you can stomach. The 163 t/s headline was never the honest number. But 71 t/s on real models, in stock, for $729? That's still pretty good.
See Also
- RTX 3090 vs RX 9070 XT: Which Card Is Right for Your Local LLM Setup? — 9070 XT vs the 24GB used alternative
- Gemma 4 Is Coming: The GPU Sweet Spot for Google's Next Open Model — which tier covers the next Google open model
- RTX 5090 vs RTX 4090 for Local AI: Is the Upgrade Worth It? — if you're choosing between NVIDIA cards