TL;DR — BF16 uses half the VRAM of FP32 with negligible accuracy loss. RTX 40-series runs it natively at full speed. RTX 30-series and older should use FP16 instead — same memory savings, better hardware support. Only use FP32 if you're debugging or running a niche model that explicitly requires it.
What Are BF16 and FP32?
FP32 (32-bit float) stores each model parameter in 4 bytes. BF16 (brain float 16) stores the same parameter in 2 bytes, using a different bit layout designed specifically for neural networks.
That's the practical summary. Here's the technical difference: FP32 uses 1 sign bit, 8 exponent bits, and 23 mantissa bits. BF16 uses 1 sign bit, 8 exponent bits, and 7 mantissa bits. Same dynamic range — the exponent determines how large or small a number can be — half the mantissa precision.
Think of it like image compression. FP32 is the full-resolution original. BF16 is the same image at half resolution but with the same color depth. You lose some fine detail in the mantissa, but the tonal range — the exponent — stays intact. And in neural networks, range is what matters. The fine detail you lose in the mantissa doesn't measurably affect outputs.
Why Data Types Matter for Your Local LLM Build
This is the single biggest lever on VRAM consumption before you start quantizing. Most guides skip it entirely.
The math is simple: every billion parameters costs 4 GB in FP32 and 2 GB in BF16.
- Llama 3.1 8B in FP32: 32 GB VRAM — won't fit on any consumer GPU
- Llama 3.1 8B in BF16: 16 GB — fits on an RTX 4080, tight
- Llama 3.1 8B Q4_K_M: 4.9 GB — fits on an RTX 3060 with room to spare
That's the difference between needing a $1,600 RTX 4080 and fitting on a $300 used card. The dtype setting isn't a subtle optimization — it's the difference between the model loading and not loading.
BF16 vs FP32: VRAM by Model Size
Consumer GPU That Fits
RTX 4080 (BF16)
RTX 4080 (BF16)
Not viable without quantization FP32 versions of 7B+ models don't fit on any single consumer GPU. BF16 gets you 7B-8B models on high-end consumer cards. You still need quantization for most builds — but dtype is the first decision to make.
BF16 vs FP32: Speed by GPU Generation
Memory footprint is only half the story. Execution speed depends on whether your GPU has native BF16 hardware support.
RTX 40-series (Ada Lovelace): Native BF16 tensor cores. BF16 ops run at the same speed as FP16 — no conversion overhead, no performance penalty. This is the tier where BF16 is the clear default choice.
RTX 30-series (Ampere): Native BF16 tensor core support at full speed. BF16 and FP16 run at equivalent throughput on Ampere. Use FP16 for slightly better software compatibility with older inference tools (some early llama.cpp builds had more FP16 testing), but BF16 has no hardware performance penalty on RTX 30-series.
RTX 20-series and older: No native BF16 support. Software conversion handles it, with a further speed penalty. Use FP16 on these cards — same 2 bytes per parameter, better hardware utilization.
Tip
The rule of thumb: RTX 40-series → BF16. RTX 30-series and older → FP16. Both give you the same memory footprint. The speed difference only matters on Ampere and below.
How BF16 and FP32 Actually Work
Both formats represent floating-point numbers — real numbers with a decimal point — but they make different tradeoffs between range and precision.
FP32 has 23 mantissa bits. That's enough precision to represent extremely fine gradations between values. This matters a lot during training, when tiny gradient updates need to accumulate accurately across thousands of steps. It matters much less during inference, where you're just running weights that were already trained.
BF16 cuts the mantissa to 7 bits but keeps the same 8-bit exponent as FP32. The dynamic range stays identical. Google Brain developed BF16 specifically for deep learning because neural networks are more sensitive to underflow/overflow (range) than to fine-grained precision (mantissa bits).
FP32 (Full Precision)
4 bytes per parameter. High enough precision to prevent gradient underflow during training. For local inference, it's almost always pure VRAM waste. The only legitimate use cases are numerical debugging, very small models where 2 extra GB doesn't matter, or a niche model that was specifically validated only in FP32.
BF16 (Brain Float 16)
2 bytes per parameter. Developed by Google Brain for deep learning. Same dynamic range as FP32 with half the memory. Native hardware support on RTX 40-series, A100, H100, and Apple Silicon M1+. Default dtype for Llama 3, Mistral, Qwen 2.5, Gemma 2 — these models were trained and validated at BF16. Running them in FP32 upcasts to a precision they were never validated at.
FP16 (Half Precision)
Also 2 bytes per parameter. Same memory footprint as BF16, but uses 5 exponent bits and 10 mantissa bits instead of BF16's 8/7 split. The smaller exponent range makes FP16 prone to overflow and underflow during training — but for inference only, it's essentially equivalent to BF16 in output quality. The preferred choice on RTX 30-series and older where native BF16 hardware support is incomplete.
"BF16 Hurts Model Quality" — The Myth That Won't Die
This idea surfaces constantly in forums. It's wrong for inference, and here's why.
The accuracy gap between BF16 and FP32 is unmeasurable on standard benchmarks. MMLU, HellaSwag, and HumanEval scores for BF16 inference are within noise margin versus FP32 — we're talking fractions of a percentage point, not meaningful differences.
The models you're downloading were already trained in BF16 mixed precision. Llama 3, Mistral, Qwen 2.5 — their official weights are released in BF16. When you run them in FP32, you're not adding precision; you're upcasting to a format the model was never validated at. You're not running higher quality — you're running different.
Quantized models introduce far more precision loss than BF16-vs-FP32 ever would. Q4_K_M drops MMLU by roughly 1 point from the FP16 baseline. BF16 vs FP32 is a rounding error compared to that. And everyone runs quantized models without concern.
Where did the myth come from? BF16 was historically a training dtype, introduced in the late 2010s as a way to save memory during gradient computation. In that context, precision genuinely mattered — tiny miscalculations in gradients compound across training steps. People learned "BF16 = precision loss" in that era and never updated when inference became the dominant use case. The concern is real for training. It doesn't apply to running a finished model.
Note
If you see a forum post arguing for FP32 over BF16 for inference quality, ask when it was written. Pre-2021 advice about model dtypes is almost certainly training advice, not inference advice.
BF16 in Practice — RTX 4070 Running Llama 3.1 8B
Real test configuration: RTX 4070 (12 GB GDDR6X), Ollama 0.3.x with llama.cpp backend.
Llama 3.1 8B BF16: 16.1 GB VRAM required — doesn't fit on the 4070's 12 GB. It loads partially to CPU RAM and runs at around 8 tok/s decode. That's technically working, but painfully slow. Not a usable daily driver.
Llama 3.1 8B Q8_0: 8.6 GB VRAM — fits with 3.4 GB headroom. Around 42 tok/s decode. Output quality is nearly identical to BF16. This is the sweet spot for 12 GB cards: BF16-level quality at 53% of the VRAM.
Llama 3.1 8B Q4_K_M: 4.9 GB VRAM — 55 tok/s decode. Minor quality drop on complex multi-step reasoning, but faster and leaves room for longer context windows.
The practical conclusion for 12 GB cards: Q8_0 gives you BF16 quality. Run BF16 native only on RTX 4080 (16 GB) or RTX 4090 (24 GB) where the full model fits without compromises.
Quality vs FP32
N/A (won't fit)
~8 tok/s (CPU offload)
−0.2 MMLU
−0.9 MMLU
Warning
Running BF16 with CPU offload (partial VRAM fit) gives you the worst of both worlds — high memory use and slow speed. If your model doesn't fit in VRAM, quantize it. Don't rely on CPU RAM as an overflow.
Dtype by Budget Tier
$1,200 tier (RTX 3060 12 GB / RTX 4060 8 GB): These cards can't fit any 7B+ model in BF16 or FP16. Run Q4_K_M or Q8_0 quantized models. Dtype selection happens at the quantization layer, not at the FP32/BF16 setting.
$2,000 tier (RTX 4070 Ti 16 GB): Llama 3.1 8B BF16 fits at 16.1 GB — tight but workable with minimal context. More practical to run Q8_0 at 8.6 GB and leave room for context. The 4070 Ti has native Ada BF16 support, so when you do run BF16, it's at full speed.
$4,500+ tier (RTX 4090 24 GB): Run BF16 natively. 8B models load comfortably at 16 GB, leaving 8 GB for context. This is the first tier where "just run BF16" is the clean answer without VRAM math gymnastics.
Related Concepts
- Quantization — the step after dtype. INT4 and INT8 shrink models further than BF16 can, enabling large models on small GPUs.
- VRAM — what dtype determines how much of you consume. Every billion parameters costs 2 GB in BF16, 4 GB in FP32.
- Memory bandwidth — affects how fast BF16 vs FP32 ops execute. It's not just storage — higher bandwidth cards execute BF16 matrix multiplies faster regardless of tensor core generation.
- FP16 — the BF16 alternative for older GPUs. Same 2-byte footprint, slightly different numeric range tradeoffs, better hardware support on Ampere and Turing.