Quantization — Local AI Glossary | CraftRigs

Neural network weights are originally trained at high precision — typically FP16 or BF16, using 16 bits per value. Quantization compresses those values to fewer bits, reducing the model's size in memory. A 70B model at FP16 occupies ~140GB of VRAM. Quantized to Q4_K_M, the same model drops to around 40GB — a 72% reduction in size, with a modest drop in output quality.

This is what makes running large models on consumer hardware possible at all.

Common Quantization Levels

Format	Bits per weight	7B size	13B size	70B size
FP16	16	~14GB	~26GB	~140GB
Q8_0	8	~7GB	~13GB	~70GB
Q6_K	6	~5.5GB	~10GB	~53GB
Q5_K_M	5	~4.8GB	~8.5GB	~48GB
Q4_K_M	4	~4.5GB	~7.8GB	~40GB
Q3_K_M	3	~3.5GB	~6GB	~30GB
Q2_K	2	~2.7GB	~4.7GB	~24GB

Quality vs Size Trade-Offs

Higher quantization = smaller model = more noticeable quality loss. The "K" variants (Q4_K_M, Q5_K_M, etc.) use a method called K-quants that applies different precision levels to different layers, preserving quality better than naive uniform quantization at the same bit depth.

Practical recommendations:

Q4_K_M — Best balance of size and quality. Default choice for most users.
Q5_K_M / Q6_K — Noticeably better quality, modest size increase. Worth it for coding and reasoning tasks.
Q8_0 — Near FP16 quality. Only useful if you have VRAM to spare.
Q2_K / Q3_K_M — Significant quality degradation. Use only when VRAM is severely constrained.

When Quantization Hurts Most

Aggressive quantization (Q3 and below) tends to affect complex reasoning, instruction following, and factual accuracy more than it affects casual conversation. For coding, math, and structured outputs, higher quantization levels are worth the VRAM cost.

Why It Matters for Local AI

Without quantization, most local LLM use cases simply wouldn't exist on consumer hardware. Q4_K_M is the format that made running capable models on a 12–16GB GPU viable, and it remains the practical standard for most local inference setups.

Related guides: GGUF vs GPTQ vs AWQ vs EXL2: which quantization format should you use? — choosing between quantization formats for your hardware. Why your VRAM runs out mid-conversation: the KV cache explained — how quantization level affects available KV cache headroom. Use the VRAM calculator to see exactly how much VRAM different models and quantization levels require.