Tensor Cores — Local AI Glossary | CraftRigs

Tensor Cores are dedicated matrix math units built into NVIDIA GPUs starting with the Volta architecture (2017). They're purpose-built for the matrix multiplication operations that dominate neural network computation, and they're the reason modern AI inference on NVIDIA GPUs is so fast.

Why Tensor Cores Matter for Local AI

LLM inference is fundamentally a sequence of matrix multiplications. Every token generated requires multiplying large weight matrices by activation vectors. Standard CUDA cores can do this computation, but Tensor Cores do it roughly 8–16x faster at the same clock speed and power, by operating on entire matrices simultaneously rather than individual floating-point values.

When you see benchmark numbers like "127 t/s" on an RTX 4090, Tensor Cores are doing most of the heavy lifting. Without them, inference speeds would drop to a fraction of those numbers.

Tensor Core Generations

Each GPU generation brings improved Tensor Cores:

Turing (RTX 20 series): 1st and 2nd generation Tensor Cores, INT8/FP16 support
Ampere (RTX 30 series): 3rd gen — added TF32 and BF16, doubled throughput vs. Turing
Ada Lovelace (RTX 40 series): 4th gen — added FP8 support, significant throughput increases
Blackwell (RTX 50 series): 5th gen — FP4 support, further throughput improvements

For local AI in 2026, Ampere (RTX 3000 series) or newer is recommended. Ada Lovelace Tensor Cores have meaningful advantages for quantized inference at INT8 and FP8.

Precision Formats and Throughput

Tensor Cores operate on multiple precision formats. Lower precision = higher throughput:

FP16/BF16: ~2x standard CUDA core throughput
INT8: ~4x FP16 throughput
INT4: ~8x FP16 throughput (Ada and newer)
FP8: comparable to INT8, with better dynamic range

Most Q4-quantized local models benefit from INT8 and INT4 Tensor Core paths, which is why quantized inference on an RTX 4090 is dramatically faster than naive FP16 computation.