Quantization
Reducing a model's numerical precision to shrink its memory footprint with minimal quality loss.
Neural network weights are originally trained at high precision — typically FP16 or BF16, using 16 bits per value. Quantization compresses those values to fewer bits, reducing the model's size in memory. A 70B model at FP16 occupies ~140GB of VRAM. Quantized to Q4_K_M, the same model drops to around 40GB — a 72% reduction in size, with a modest drop in output quality.
This is what makes running large models on consumer hardware possible at all.
Common Quantization Levels
| Format | Bits per weight | 7B size | 13B size | 70B size |
|---|---|---|---|---|
| FP16 | 16 | ~14GB | ~26GB | ~140GB |
| Q8_0 | 8 | ~7GB | ~13GB | ~70GB |
| Q6_K | 6 | ~5.5GB | ~10GB | ~53GB |
| Q5_K_M | 5 | ~4.8GB | ~8.5GB | ~48GB |
| Q4_K_M | 4 | ~4.5GB | ~7.8GB | ~40GB |
| Q3_K_M | 3 | ~3.5GB | ~6GB | ~30GB |
| Q2_K | 2 | ~2.7GB | ~4.7GB | ~24GB |
Quality vs Size Trade-Offs
Higher quantization = smaller model = more noticeable quality loss. The "K" variants (Q4_K_M, Q5_K_M, etc.) use a method called K-quants that applies different precision levels to different layers, preserving quality better than naive uniform quantization at the same bit depth.
Practical recommendations:
- Q4_K_M — Best balance of size and quality. Default choice for most users.
- Q5_K_M / Q6_K — Noticeably better quality, modest size increase. Worth it for coding and reasoning tasks.
- Q8_0 — Near FP16 quality. Only useful if you have VRAM to spare.
- Q2_K / Q3_K_M — Significant quality degradation. Use only when VRAM is severely constrained.
When Quantization Hurts Most
Aggressive quantization (Q3 and below) tends to affect complex reasoning, instruction following, and factual accuracy more than it affects casual conversation. For coding, math, and structured outputs, higher quantization levels are worth the VRAM cost.
Why It Matters for Local AI
Without quantization, most local LLM use cases simply wouldn't exist on consumer hardware. Q4_K_M is the format that made running capable models on a 12–16GB GPU viable, and it remains the practical standard for most local inference setups.
Related guides: GGUF vs GPTQ vs AWQ vs EXL2: which quantization format should you use? — choosing between quantization formats for your hardware. Why your VRAM runs out mid-conversation: the KV cache explained — how quantization level affects available KV cache headroom. Use the VRAM calculator to see exactly how much VRAM different models and quantization levels require.