TT-Forge software stack — Local AI Glossary | CraftRigs

TT-Forge is Tenstorrent's software stack for compiling and executing ML workloads on its Tensix-core accelerators, including the QuietBox 2 RISC-V AI workstation. It's the bridge between PyTorch/ONNX-style model graphs and the underlying Tenstorrent silicon — playing the same role CUDA plays for NVIDIA or ROCm plays for AMD.

What It Actually Does

TT-Forge ingests model graphs from frameworks like PyTorch, lowers them through MLIR-based compiler passes, and emits kernels that run on Tenstorrent's grid of Tensix cores. It ships alongside lower-level components (TT-Metal, TT-NN) that expose progressively more direct hardware control. The stack is open source, which is part of Tenstorrent's pitch against the closed NVIDIA ecosystem — but openness doesn't equal maturity.

The Maturity Problem

TT-Forge is early-stage. Support for the runtimes most local AI builders actually use — llama.cpp and Ollama — is unconfirmed at the time of the QuietBox 2 launch. That means popular GGUF workflows, the entire LM Studio ecosystem, and most quantized-model pipelines don't drop in cleanly. You're working closer to the metal: porting models, writing or adapting kernels, and chasing operator coverage gaps. This is the same gap tinygrad's software stack faces — promising architecture, thin runtime support.

Why It Matters for Local AI

For most local LLM builders, "does it run Ollama?" is a hard requirement, and TT-Forge can't yet answer yes with confidence. That makes Tenstorrent hardware a developer buy — appealing if you want to contribute to an open accelerator stack or run custom inference pipelines, but a poor fit if you just want to load a Q4_K_M model and start chatting. Until TT-Forge ships first-class llama.cpp and Ollama paths, an NVIDIA or Apple Silicon rig will get a model serving tokens faster, even if the raw silicon is less interesting.