CraftRigs
CraftRigs / Glossary / Q4_K_M
Models & Quantization

Q4_K_M

A 4-bit quantization format for GGUF models that shrinks weights to roughly a quarter of their FP16 size while preserving most of the model's quality.

Q4_K_M is a 4-bit "K-quant" mixed-precision format used in GGUF model files. It is the default quantization most local AI builders reach for first because it strikes the cleanest balance between file size, VRAM footprint, and output quality.

How the Format Works

The "K" refers to the K-quants family introduced in llama.cpp, which groups weights into blocks and assigns per-block scales rather than uniform 4-bit rounding. The trailing "M" means medium — important tensors like attention and feed-forward projections get bumped to higher precision (typically Q6_K), while the bulk of the weights stay at 4 bits. The result is a model that lands near ~4.85 bits per weight on average instead of a flat 4.

Sizing and Hardware Implications

Because Q4_K_M cuts weights to roughly a quarter of FP16 size, it dictates which models fit on which cards. A 70B model at Q4_K_M needs roughly 43–47GB of VRAM to load — more than a single RTX 5090's 32GB or an RTX 4090's 24GB, which is why dual-GPU stacks exist for that tier. At the small end, a Pi 5 can run Llama 3.2 3B at Q4_K_M through Ollama at 4–6 tokens per second, and a Qwen3 reasoning setup fits comfortably inside roughly 7.3GB combined VRAM.

Tradeoffs vs Other Quants

Compared to Q5_K_M or Q6_K, Q4_K_M gives up a small amount of perplexity in exchange for ~20–30% less memory. Compared to flat Q4_0 or INT4, the K-quant scheme produces noticeably better outputs at the same nominal bit-width because critical layers are protected. Most leaderboard comparisons treat Q4_K_M as the practical floor where a model still "feels" like itself; drop below it and instruction following, math, and code generation start visibly degrading.

Why It Matters for Local AI

VRAM is the binding constraint for almost every local rig, and Q4_K_M is the format that decides whether a given model is reachable on your hardware at all. Picking it correctly is the difference between a 70B model running across two GPUs and a 70B model that won't load. For most builders sizing a first or second rig, "what's the Q4_K_M size?" is the right opening question before looking at anything else.