BF16 (Brain Float 16) — Local AI Glossary | CraftRigs

BF16 (Brain Float 16, or bfloat16) is a 16-bit floating-point format developed by Google Brain. Like FP16, it stores weights in 16 bits (2 bytes per parameter), but it allocates those bits differently: 8 bits for the exponent (matching FP32's range) and 7 bits for the mantissa (vs. FP16's 10-bit mantissa).

Why BF16 Replaced FP16 for Training

The critical advantage: BF16 can represent the same range of values as FP32 (up to ~3.4 × 10³⁸), while FP16 tops out at 65,504. During training, gradients can spike to large values. FP16 overflows and produces NaN (not-a-number) errors. BF16 handles this gracefully, which is why modern training runs on TPUs (which originated BF16) and newer NVIDIA GPUs (Ampere and beyond) use BF16.

BF16 vs. FP16 for Inference

For inference, the practical difference is small. Both formats use 2 bytes per parameter. BF16 has slightly less mantissa precision than FP16, which can cause marginally more rounding error on operations that require fine-grained precision. In practice, output quality differences are rarely perceptible.

The real distinction matters for hardware compatibility. NVIDIA GPUs before Ampere (RTX 20 series and older) don't have native BF16 Tensor Core support — they emulate it, which is slower. Ampere and Ada Lovelace (RTX 30 and 40 series) handle BF16 natively and match FP16 throughput.

Models Shipped in BF16

Many recent models (Llama 3, Mistral v3, Gemma 2) are distributed in BF16 format. If your GPU supports it natively, you run them as-is. If not, inference software typically converts to FP16 automatically, with no user action needed.

Relationship to Quantization

Whether a model starts in FP16 or BF16 matters mostly if you're quantizing yourself. The GGUF quantization workflow (used by llama.cpp and Ollama) converts the source model to lower bit-depth formats. Starting from BF16 vs. FP16 produces nearly identical quantized outputs in practice.