Tensor Cores
Specialized compute units inside NVIDIA GPUs designed for matrix multiplication — the core operation in neural network inference.
Tensor Cores are dedicated matrix math units built into NVIDIA GPUs starting with the Volta architecture (2017). They're purpose-built for the matrix multiplication operations that dominate neural network computation, and they're the reason modern AI inference on NVIDIA GPUs is so fast.
Why Tensor Cores Matter for Local AI
LLM inference is fundamentally a sequence of matrix multiplications. Every token generated requires multiplying large weight matrices by activation vectors. Standard CUDA cores can do this computation, but Tensor Cores do it roughly 8–16x faster at the same clock speed and power, by operating on entire matrices simultaneously rather than individual floating-point values.
When you see benchmark numbers like "127 t/s" on an RTX 4090, Tensor Cores are doing most of the heavy lifting. Without them, inference speeds would drop to a fraction of those numbers.
Tensor Core Generations
Each GPU generation brings improved Tensor Cores:
- Turing (RTX 20 series): 1st and 2nd generation Tensor Cores, INT8/FP16 support
- Ampere (RTX 30 series): 3rd gen — added TF32 and BF16, doubled throughput vs. Turing
- Ada Lovelace (RTX 40 series): 4th gen — added FP8 support, significant throughput increases
- Blackwell (RTX 50 series): 5th gen — FP4 support, further throughput improvements
For local AI in 2026, Ampere (RTX 3000 series) or newer is recommended. Ada Lovelace Tensor Cores have meaningful advantages for quantized inference at INT8 and FP8.
Precision Formats and Throughput
Tensor Cores operate on multiple precision formats. Lower precision = higher throughput:
- FP16/BF16: ~2x standard CUDA core throughput
- INT8: ~4x FP16 throughput
- INT4: ~8x FP16 throughput (Ada and newer)
- FP8: comparable to INT8, with better dynamic range
Most Q4-quantized local models benefit from INT8 and INT4 Tensor Core paths, which is why quantized inference on an RTX 4090 is dramatically faster than naive FP16 computation.