What is the Tenstorrent QuietBox 2?

The QuietBox 2 is a $9,999 AI workstation featuring 4x Blackhole ASICs, 128GB GDDR6 memory, and 2,654 TFLOPS at BlockFP8 precision. It runs an open-source software stack including the TT-Forge compiler with PyTorch, ONNX, and JAX support.

Can QuietBox 2 run llama.cpp or Ollama models?

Not natively, as of March 2026. TT-Forge is the primary compiler, and llama.cpp/Ollama support is not yet confirmed. You'll need to port models through TT-Forge or wait for community integration work to mature.

How does QuietBox 2 compare to dual RTX 4090s for local LLM inference?

Dual RTX 4090s offer 48GB VRAM, a mature CUDA ecosystem, and full Ollama/llama.cpp support for $4,000-6,000 less. QuietBox 2 wins on raw TFLOPS and memory capacity but the software ecosystem is significantly less mature for production LLM use.

When does the QuietBox 2 ship?

Tenstorrent is targeting Q2 2026 for QuietBox 2 shipments. Given the company's track record with the original QuietBox, expect some schedule slippage.

Tenstorrent QuietBox 2: The First Open-Source AI Workstation — What Local LLM Builders Need to Know

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Quick Summary

What it is: 4x Blackhole ASIC workstation, 128GB GDDR6, 2,654 TFLOPS BlockFP8, $9,999, ships Q2 2026
The catch: TT-Forge software stack is early-stage; llama.cpp/Ollama support unconfirmed — this is a developer buy, not a plug-and-play rig
The alternative: Dual RTX 4090 ($4k-6k, 48GB VRAM, mature CUDA) or M5 Max 128GB ($4,999, unified memory) both offer better software maturity today

Jim Keller has spent decades designing chips at AMD, Intel, Apple, and Tesla. Now at Tenstorrent, he's taking a different bet: an open-source AI chip company built on RISC-V cores and a clean-room software stack. The QuietBox 2 is his most concrete statement yet — a $9,999 AI workstation that ships Q2 2026 and targets every pain point with NVIDIA's closed ecosystem.

For local LLM builders, this is worth paying close attention to. Whether you buy one is a different question.

What's Inside the QuietBox 2

The QuietBox 2 centers on four Blackhole ASICs — Tenstorrent's latest generation chip, fabricated at TSMC on a 6nm process. Each Blackhole die includes a large array of Tensix processing cores alongside RISC-V management cores that handle scheduling and communication without a separate CPU dependency. The workstation ships with 128GB GDDR6 across all four chips, which puts it ahead of dual RTX 4090 configurations (48GB) and on par with the M5 Max 128GB in raw memory capacity.

The headline compute figure is 2,654 TFLOPS at BlockFP8 precision. That's the format Tenstorrent has optimized their hardware for — not the standard FP8 you'll see on NVIDIA specs sheets. In practice, BlockFP8 delivers better accuracy than standard INT8 while hitting similar throughput, which matters for larger parameter models where quantization artifacts are visible.

The chassis design reflects the "Quiet" in QuietBox — acoustic dampening, managed airflow, and a form factor designed to sit next to a workstation desk rather than in a server room.

The Open-Source Angle

This is where Tenstorrent genuinely differentiates. The entire software stack is open source:

TT-Forge is the primary compiler, handling model ingestion from PyTorch, ONNX, and JAX. It handles graph optimization, operator fusion, and mapping compute graphs to Tensix cores. The compiler repo is public on GitHub, which means you can inspect exactly what transformations are happening to your models — something NVIDIA's TensorRT doesn't offer.

TT-Metalium is the low-level runtime, equivalent to CUDA at the kernel level. It's also open source, which means researchers can write custom kernels that run directly on Blackhole hardware without going through vendor approval.

TT-NN provides the neural network op library that sits between TT-Metalium and TT-Forge — think cuDNN but without the closed-source binary distribution.

For enterprise and research users concerned about vendor lock-in, this stack represents something genuinely new. You own the entire chain from hardware to model output with no black boxes.

Software Maturity: The Honest Assessment

The open-source stack is real, but it's early. Here's what you need to know before writing a check:

llama.cpp integration is unconfirmed. The llama.cpp project has no current Tenstorrent backend. Running Llama 3, Mistral, Qwen, or any of the popular open-weight models through llama.cpp requires porting work that hasn't shipped publicly. Ollama, which sits on top of llama.cpp, has the same gap.

TT-Forge supports PyTorch models — so if you're willing to run models directly through PyTorch with the TT-Forge compiler, you have access to anything on HuggingFace. The workflow is more involved than ollama run llama3:70b, but it's not theoretical.

Community size is small. The NVIDIA ecosystem has thousands of contributors, commercially motivated tooling vendors, and years of production hardening. Tenstorrent has a passionate but small developer community. If you hit a bug or an unsupported op, the turnaround time for fixes is measured in weeks, not days.

The RISC-V management cores are an unknown for most builders. They handle scheduling internally, which is actually a feature (less CPU bottleneck), but troubleshooting issues requires understanding a different architecture than x86.

How It Compares to Your Alternatives

vs. Dual RTX 4090 ($4,000-6,000, 48GB VRAM)

The dual 4090 setup costs significantly less and delivers the most mature GPU inference stack available. Every model on Ollama works. llama.cpp runs natively. ExLlamaV2, vLLM, TGI — all production-ready. The VRAM gap (48GB vs 128GB) is the legitimate concern: running 70B models in Q4 quantization requires ~40GB, which fits dual 4090s but leaves no headroom. The QuietBox 2's 128GB gives you room for unquantized 34B or mixed-precision 70B runs.

Choose dual 4090 if: You want to run production models today with minimal friction.

Choose QuietBox 2 if: You're specifically researching Tenstorrent's architecture or need the memory headroom and can absorb the software setup cost.

vs. Apple M5 Max 128GB ($4,999)

The M5 Max 128GB is the more direct competitor on memory capacity. Apple Silicon's unified memory architecture means models load directly without VRAM limitations, and mlx-lm provides a mature inference layer purpose-built for Apple's hardware. Real-world 70B inference runs at 12-18 t/s on the M5 Max — slower than what Tenstorrent promises but in a machine that costs $5,000 less.

The M5 Max is also silent, runs macOS with full application compatibility, and requires zero driver management.

Choose M5 Max if: You want high-memory-capacity inference with a mature software stack and don't need raw TFLOPS.

Choose QuietBox 2 if: You specifically need the compute density and are building for research or multi-model concurrent workloads.

See our GPU comparison guide, the M5 Max 128GB benchmark reality check, and M4 Max vs RTX 4090 head-to-head for deeper numbers.

The Jim Keller Factor

Keller's track record is real. Zen (AMD), A4-A13 (Apple), FSD (Tesla) — the chip design resume is legitimate. Tenstorrent's architecture bets are coherent: RISC-V management cores reduce dependency on proprietary microcontrollers, open-source software reduces switching costs, and the Tensix dataflow architecture handles sparse computation patterns better than traditional GPU tensor cores.

The question isn't whether the hardware is well-designed. It's whether the software ecosystem can close the gap with CUDA before the next NVIDIA generation ships.

Who Should Buy This

Buy if:

You're a researcher or developer specifically studying Tenstorrent's architecture
You need 128GB+ inference capacity and budget rules out multi-GPU NVIDIA setups
You want to be early on an ecosystem and can contribute to TT-Forge integration work
Your organization has a strategic reason to reduce NVIDIA dependency

Wait if:

You want to run Ollama or llama.cpp models out of the box
You need reliable production inference today
Budget is the primary constraint — the dual 4090 path costs $4k-6k less
You're not prepared to debug compiler issues or write PyTorch integration code

The QuietBox 2 is a serious machine from a serious team. It's also a first-mover product in an ecosystem that is 12-24 months away from matching CUDA's operational maturity. For most local LLM builders, the right move is to watch closely, let the community build the llama.cpp backend, and revisit at the QuietBox 3. For GTC 2026 coverage on what NVIDIA is announcing for local AI builders this week, see our GTC 2026 coverage hub.

If the QuietBox 2's software immaturity gives you pause and you want a proven alternative for high-memory-capacity inference, the NVIDIA DGX Spark vs Mac Studio vs AMD Strix Halo comparison covers the current state of purpose-built AI workstations with mature software stacks. For builders who want large model capacity without the $9,999 price tag, the AMD Strix Halo mini PC vs Mac Mini M4 guide covers 128GB unified memory options starting at $1,100.