FP16 (Half Precision) — Local AI Glossary | CraftRigs

FP16 (16-bit floating-point) is the default precision format for deploying neural network models. It cuts model storage size in half compared to the FP32 (32-bit) format used during training, while preserving enough numerical precision for inference to be indistinguishable from FP32 for most tasks.

Memory Impact

The math is straightforward: 2 bytes per parameter. A 7B model at FP16 = 14GB. A 13B model = 26GB. A 70B model = 140GB. This is why FP16 alone doesn't work on consumer hardware for larger models — VRAM requirements exceed what single GPUs offer. Quantization schemes like Q4_K_M are what make 70B models accessible on 40–48GB setups.

FP16 vs. FP32

During training, FP32 is preferred for numerical stability — the wider dynamic range (±3.4 × 10³⁸) prevents gradient underflow. But for inference, the model weights are fixed. The dynamic range of FP16 (±65,504) is sufficient for running a forward pass through a transformer, and the 2x memory savings is worth it.

Most inference software defaults to FP16 when you download an unquantized model. If you have the VRAM, FP16 produces the highest quality output, since no precision is lost relative to the original trained weights.

Hardware Acceleration

NVIDIA Tensor Cores in Turing and newer GPUs natively accelerate FP16 matrix multiplication, making FP16 inference significantly faster than FP32 on modern hardware. This hardware path is why FP16 became the standard inference format.

FP16 vs. BF16

BF16 (Brain Float 16) uses the same 16 bits but allocates them differently: same exponent range as FP32, but less mantissa precision. BF16 handles extreme values better (no overflow), which is why it replaced FP16 as the training format on newer hardware. For inference, both formats produce similar quality on most models.