CraftRigs
Hardware Comparison

Tenstorrent QuietBox 2 vs. Dual RTX 5090: Which $10K Local AI Setup Wins?

By Chloe Smith 7 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

476.5 tokens per second on Llama 3.1 70B. That's what Tenstorrent claims for the QuietBox 2, a freshly announced RISC-V workstation that plugs into a standard 120-volt wall outlet and costs $9,999.

A dual RTX 5090 system runs the same model at about 27 tokens per second in Ollama.

On paper, it's not even close. But those numbers are measuring completely different things, and if you're seriously considering dropping $10K on local AI compute right now, you need to understand what each machine is actually good for before you wire the transfer.

What You're Actually Buying

The QuietBox 2 ships in Q2 2026. Four Blackhole ASICs, each carrying 32GB of GDDR6, for 128GB of accelerator memory total. Add 256GB of DDR5 system RAM, an AMD EPYC 8124P processor (16 cores, 32 threads), and liquid cooling quiet enough to sit on an office desk. Tenstorrent calls the compute 2,654 TFLOPS at BlockFP8 precision. Every layer of the stack — from kernel to compiler — is open source under Apache 2.0.

The dual RTX 5090 doesn't arrive as a product. You build it. Two Blackwell cards at roughly $2,000 each, a motherboard with dual PCIe x16 slots, a 1,600-watt power supply (the GPUs pull 575W each at load), 128GB of DDR5, and a case big enough to fit the whole thing. Minimum realistic cost for a capable build: $8,500. A properly specced workstation with a Threadripper Pro and 256GB of RAM pushes past $12,000. Let's call the honest comparison point around $10,500 once you've bought real components rather than the cheapest motherboard that technically fits.

The 5090 gives you 32GB of GDDR7 per card, 64GB combined, with 1,792 GB/s of memory bandwidth per card. No HBM. But the bandwidth is genuinely fast — fast enough that a single 5090 running tiny models can break 1,000 tokens/sec with aggressive kernel optimization.

The Numbers That Look Obvious Aren't

Let me explain why 476 vs. 27 is misleading.

Tenstorrent's 476.5 tok/s figure for Llama 3.1 70B is a throughput benchmark. The machine is serving multiple concurrent requests, batching them together, and reporting aggregate output. Think of it as how many tokens the system produces per second across all users simultaneously. This is the relevant metric if you're building an API backend or running agentic pipelines where dozens of requests fly in parallel.

The dual 5090's 27 tok/s comes from an Ollama benchmark measuring single-user sequential inference. One request, one response, measured end-to-end. That's the latency-focused use case: you're sitting at a terminal, asking a question, waiting for an answer.

[!INFO] Throughput vs. latency: Throughput measures total tokens/sec across all concurrent users. Latency measures how fast a single user gets their response. Batch-oriented systems (like the QuietBox 2's Tensix architecture) dominate throughput benchmarks. Single-user interactive tools (Ollama on CUDA) tend to optimize for latency.

For a solo developer doing code review and Q&A, 27 tok/s feels fine — that's faster than most people read. For a team of eight running simultaneous agentic tasks, 27 tok/s becomes a bottleneck almost immediately.

One more thing the 476.5 number doesn't tell you: those are batch-optimized conditions. Real-world throughput on the QuietBox 2 with cold starts, varied sequence lengths, and the software actually installed on your machine will look different. The Register reviewed the original QuietBox last November and described it as "a high-performance RISC-V AI workstation trapped in a software blackhole." That was the first-generation machine, but the software maturity problem hasn't disappeared overnight.

The CUDA Problem (and Why It Actually Matters)

NVIDIA's CUDA ecosystem is 18 years old. That's not marketing. It means 18 years of kernel optimizations, 18 years of library development, 18 years of Stack Overflow answers.

Drop a dual RTX 5090 into a Linux box and within 15 minutes you can run Ollama, LM Studio, vLLM, ComfyUI, Automatic1111, Whisper, any fine-tuning framework you've ever heard of. PyTorch assumes CUDA. Transformers assumes CUDA. Every ML paper that shipped sample code in the last decade assumed CUDA.

Tenstorrent is not CUDA. Their stack is TT-Metalium — a custom programming model with its own compiler (TT-NN) and its own inference framework. The software is genuinely open source and architecturally interesting. But it means every tool you want to run has to be ported or shimmed to work with it.

Warning

The CUDA compatibility gap is real. As of March 2026, Tenstorrent maintains its own fork of vLLM. PyTorch integration goes through tt-xla (a PJRT device backend), and the torch-xla version compatible with vLLM 0.17 wasn't finished at launch. Fine-tuning frameworks like Axolotl don't run on Tenstorrent hardware at all. If you're doing anything beyond inference with supported models, you're in early-adopter territory.

What does work on the QuietBox 2? Llama 3.1 70B. GPT-OSS 120B (OpenAI's recently released open-weights model, which runs fully on the device). Qwen models. A curated list of workloads that Tenstorrent pre-validates.

What doesn't work? Pretty much anything you're not explicitly told works.

The dual RTX 5090 setup runs everything. Image generation, video generation, TTS, fine-tuning on custom datasets, multimodal models, embeddings. If a model exists and you have enough VRAM, it runs.

The Power and Noise Reality

Two RTX 5090s at full load draws around 1,150 watts just from the GPUs. A full system at load easily clears 1,400 watts. That's around 12 amps on a 120V circuit, which means if your office runs on standard 15-amp circuits, you're using most of one circuit for the GPU rig. A 20-amp dedicated circuit is the right answer.

The QuietBox 2 runs on a standard 120V outlet. Tenstorrent explicitly markets this — no special power, no infrastructure changes, plug it in where you sit. Idle power dropped 50% from the first generation. Under full inference load, the system likely pulls somewhere in the 600-900W range (Tenstorrent hasn't published a precise figure for the QB2 yet).

Tip

Office deployment reality check: If you're putting this in an office that wasn't designed as a server room, the QuietBox 2's power requirements make installation straightforward. A dual 5090 workstation under sustained load will trip a standard 15-amp circuit if you're not careful about what else is on the circuit.

The "QuietBox" name is also literal. The liquid cooling keeps acoustic noise low enough for an office desk. A dual 5090 system with adequate cooling will run louder under AI load than most people want next to them all day.

Who Wins on Price (It's Complicated)

The QuietBox 2 starting at $9,999 is a complete, tested, supported system. You get the hardware, the software stack, and presumably some level of support from Tenstorrent. The full 4x Blackhole configuration with the highest-spec cards is estimated at approximately $11,999 (only the $9,999 starting price is confirmed on Tenstorrent's website at time of writing).

A dual RTX 5090 at $10,000 is a pile of parts. You're buying the GPUs, the motherboard, the PSU, the CPU, the RAM, the cooling, the case. You're assembling it, configuring it, and supporting yourself when something goes wrong.

If you value your time and want a box that works out of the box, the QuietBox 2's pricing is more defensible than it looks. If you already build systems and want flexibility, the dual 5090 route gives you a machine that does more things.

The Training Question

If you're fine-tuning models — even LoRA adapters on a 7B base — the dual RTX 5090 is the only real choice right now. CUDA is where all the fine-tuning tooling lives. Axolotl, Unsloth, LLaMA-Factory, TRL — none of these work on Tenstorrent hardware.

The QuietBox 2 is an inference machine. Tenstorrent is honest about this: inference has overtaken training as the dominant AI workload, and they built accordingly. If your workflow is purely "run models, get outputs," the inference-only limitation doesn't sting.

But if there's any chance you'll want to adapt a model to your domain — add your company's documentation, fine-tune on your codebase — the dual 5090 preserves that option. The QuietBox 2 doesn't.

The Real Verdict

Buy the dual RTX 5090 if: you want to run the widest possible range of models and tools, you care about fine-tuning, you need image or video generation alongside text inference, or you want to use mainstream frameworks without wrestling with software ports.

Buy the QuietBox 2 if: you're running a fixed inference workload at scale, you specifically need to serve 70B+ models to multiple concurrent users at low latency, you care deeply about open-source silicon, or you want a fully supported system that doesn't require infrastructure changes to deploy.

The honest answer is that for most buyers spending $10K on local AI compute today, the dual RTX 5090 setup wins on practical versatility. CUDA's ecosystem depth isn't a lock-in tax — it's 18 years of collective optimization that you get for free the moment you plug the cards in.

But Tenstorrent's hardware is genuinely impressive. 476.5 tok/s on a 70B model from a box that runs on a standard wall outlet is not a small thing. As the software matures — and it is maturing, faster than most CUDA alternatives have — the gap will narrow.

The QuietBox 2 is a bet on where AI infrastructure is going. The dual RTX 5090 is what works right now.

tenstorrent rtx-5090 local-llm inference hardware-comparison blackhole cuda 2026

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.