TL;DR — Bits-per-weight tells you how many bits store each model parameter. 16 bits = full quality at max VRAM, 4 bits = roughly half the VRAM with minor quality loss, 2-3 bits = fits on almost anything but noticeably dumber. For most builds, 4-bit (Q4_K_M in GGUF) is the best quality-per-GB tradeoff.
What Is Bits-Per-Weight?
Bits-per-weight is how many bits of storage are used per model parameter — lower means a smaller file, less VRAM, and faster decode speed, at some accuracy cost.
The technical version: quantization maps floating-point weights (typically 16-bit) to a lower-bit integer representation using a scale factor and zero point per group of weights. A "group" is usually 32 or 128 consecutive weights. Each group gets its own scaling factors to limit how much rounding error accumulates.
Think of it like streaming audio bitrate. FP16 is 320 kbps — full quality, big file. Q4 is 128 kbps — most people can't hear the difference on normal content. Q2 is 64 kbps — you start noticing compression artifacts in complex passages. The music is still recognizable, but something's off.
At Q4, your local LLM is still "the music." At Q2, complex reasoning tasks start sounding compressed.
Why Bits-Per-Weight Is the Most Important Spec to Know
It's the primary lever that determines whether a model fits in your GPU — before you pick a model size, before you optimize settings, before anything else.
A 7B model at 16-bit needs roughly 14 GB VRAM. At 4-bit, the same model needs 4.5 GB. The RTX 3060 that can't fit the 16-bit version runs the 4-bit version comfortably and still has room for context.
Real numbers for Llama 3.1 8B across quantization levels:
Quality vs FP16
−0.5 pts (imperceptible)
−0.9 pts (imperceptible)
−2.6 pts (noticeable on reasoning)
−6.9 pts (clearly degraded) These deltas are drawn from the arxiv quantization study on Llama 3.1 8B Instruct (2025). The absolute MMLU score depends on evaluation configuration (5-shot vs 0-shot, base vs instruct) — what matters is the gap between levels: Q4 to FP16 is negligible, Q3 to Q4 is where quality starts slipping, and Q2 is clearly compromised.
VRAM Per Billion Parameters by Bit Depth
If you want to estimate any model quickly:
| Bit Depth | VRAM per 1B Parameters |
|---|---|
| FP16 / BF16 | ~2.0 GB |
| Q8_0 | ~1.08 GB |
| Q5_K_M | ~0.71 GB |
| Q4_K_M | ~0.58 GB |
| Q3_K_M | ~0.45 GB |
| Q2_K | ~0.35 GB |
Multiply by parameter count, add ~1-2 GB for overhead and context, and you know whether your GPU can handle it before downloading anything.
A 13B model at Q4_K_M: 13 × 0.58 = 7.5 GB base + ~1.5 GB overhead = ~9 GB total. That fits on a 12 GB RTX 3060.
MMLU Quality at Each Bit Depth (Llama 3.1 8B, relative deltas)
- FP16 baseline: varies by eval config (69.4 on Meta's 5-shot; ~59 in some 0-shot studies)
- Q8_0: −0.2 pts (imperceptible, within benchmark noise)
- Q4_K_M: −0.9 pts (imperceptible in practice)
- Q3_K_M: −2.6 pts (noticeable on multi-step reasoning)
- Q2_K: −6.9 pts (clearly degraded on complex tasks)
The cliff is between Q3 and Q4. Above Q4, you're trading VRAM for imperceptible quality loss. Below Q3, you're trading quality for meaningful gains.
Warning
Q2_K is only worth considering if you have no other option — if you're trying to fit a 70B model on a card that can barely handle it. For 7B-13B models on 12 GB GPUs, you have better choices.
How Weight Quantization Works Under the Hood
Full-precision models store weights as FP16 floating-point numbers — each a 16-bit value capable of representing roughly 65,000 distinct values. Quantization maps those continuous floating-point values to a smaller set of integers.
At 4-bit, you have 16 possible values per weight. At 2-bit, you have 4 possible values. The more you compress, the more rounding error you introduce — and the more the model's internal representations drift from what it learned during training.
Modern quantization uses grouped quantization to limit the damage. Instead of using a single scale factor for the entire model, it divides weights into groups of 32 or 128 and assigns each group its own scale and zero point. Rounding errors stay contained within the group rather than cascading across layers.
K-quants — the "K" in Q4_K_M — go further. They use mixed-precision grouping where different layer types get different bit depths. Attention layers, which tend to be more quality-sensitive, stay at slightly higher precision. Less critical layers get compressed more aggressively. The result is better quality at the same average bit depth.
GGUF Format (llama.cpp / Ollama / LM Studio)
The most widely supported format for local inference. Runs on CPU, GPU, or mixed CPU+GPU offload across all major platforms. This is what you're using if you run models through Ollama or LM Studio.
The naming convention follows a pattern: Q{bits}K{size}. Q4_K_M means 4-bit, K-quant, Medium configuration — meaning some layers are stored at 5-6 bits while the average comes out to about 4.5 bits. The "M" (Medium) variant is the standard balance point; "S" (Small) is slightly smaller, "L" (Large) is slightly larger.
Q4_K_M is the de facto recommendation for most builds because it hits the best quality-per-GB in the format. There's a reason it's what most guides default to.
File sizes for Llama 3.1 8B in GGUF:
- Q4_K_M: 4.9 GB
- Q5_K_M: 5.7 GB
- Q8_0: 8.6 GB
AWQ Format (vLLM / TGI)
Activation-Aware Weight Quantization. Before quantizing, AWQ runs sample prompts through the model to figure out which weights matter most for output quality — those get protected during compression. Typically INT4 with 128-group size.
The result is slightly better quality than naive INT4 at the same bit depth. On pure CUDA setups (single GPU, no CPU offload), AWQ can deliver better throughput than GGUF. Preferred for vLLM-based serving.
The downside: less flexible for mixed CPU+GPU scenarios. If you're running Ollama on a 12 GB card and need some layers on CPU RAM, GGUF handles that gracefully. AWQ doesn't.
GPTQ and EXL2 Format (ExllamaV2 / AutoGPTQ)
GPTQ (GPT Quantization) optimizes quantization layer by layer, minimizing reconstruction error. ExllamaV2 extends this further with EXL2 format, which supports fractional bits-per-weight — you can run at 3.5 bpw, for example, instead of being locked to whole-number bit depths.
The performance advantage is real: ExllamaV2 with EXL2 at 4 bpw beats GGUF Q4 by 2-3× tok/s on the same RTX hardware. If you want maximum inference speed on a pure-CUDA single-GPU build, this is the path.
The tradeoff: setup complexity. It's not plug-and-play like Ollama. You're compiling the inference engine, managing model conversions, and running Python inference scripts directly. Worth it for dedicated inference servers — probably not for a general daily driver.
Note
For most people starting out: GGUF in Ollama or LM Studio. For pure-speed optimization on a dedicated RTX rig: ExllamaV2 with EXL2. AWQ fills the middle if you're building a vLLM server.
"Lower Bits Always Mean Garbage Outputs" — Not True
This gets repeated constantly and it's wrong below the Q3 line.
Q4_K_M is not garbage. On coding, writing, summarization, and most conversational tasks, it is indistinguishable from FP16 for human raters. The MMLU gap is 0.9 points — a statistical artifact, not something you'll notice in practice.
The quality cliff is at Q3, not Q4. Q3_K_M starts showing noticeable degradation on multi-step math and logic chains. Q2_K is genuinely compromised — you'll see it fall apart on anything requiring careful reasoning. That's the threshold to care about.
There's also a counterintuitive principle that flips the whole debate: a larger model at lower quantization often beats a smaller model at higher precision.
Llama 3.1 13B Q4_K_M fits in about 9 GB of VRAM and outperforms Llama 3.1 8B FP16 on most benchmarks — while using less VRAM than the 8B at full precision. You get a better model on hardware that can't fit the smaller model's uncompressed version. This is why quantization exists, and why "just run full precision" is often the wrong instinct.
Where did the "lower bits = garbage" reputation come from? Early quantization methods from 2021-2022 — first-generation GPTQ and naive INT4 — genuinely did degrade quality significantly. The tooling was immature. Modern K-quants and AWQ are substantially better, but the reputation stuck.
Tip
If your model's responses feel off on complex tasks, try stepping up one quantization level before assuming the model itself is the problem. The jump from Q3_K_M to Q4_K_M makes a real difference. Q4 to Q5 is usually marginal.
Picking the Right Quantization for Your GPU
The decision rule: fit the largest model at the highest bit depth that leaves ~2 GB VRAM headroom for context.
RTX 3060 (12 GB): Best option is Llama 3.1 8B Q8_0 at 8.6 GB — excellent quality, 3 GB headroom for context. Alternatively, Llama 3.1 13B Q4_K_M at ~8.3 GB gives you a bigger model at lower bits. The 13B Q4 will outperform the 8B Q8 on most complex tasks.
RTX 4070 (12 GB): Same model fit as the 3060, but expect 15-20% faster decode thanks to higher memory bandwidth. The VRAM math is identical — same recommendations apply.
RTX 4070 Ti (16 GB): Llama 3.1 8B FP16 technically fits at 16.1 GB, but it's dangerously tight. More practical: Llama 3.1 13B Q6_K at ~9.8 GB with comfortable headroom and excellent quality. Or Llama 3.1 8B Q8_0 at 8.6 GB for the 8B with near-full precision.
RTX 4090 (24 GB): Run Llama 3.1 8B FP16 with 8K context comfortably. Or push to Llama 3.1 70B Q4_K_M — but at ~42 GB that won't fit on a single card. For 70B models you need two GPUs or RAM offload with significant speed penalties.
Notes
Best quality for budget tier
Bigger model, same VRAM
15-20% faster decode
Quality sweet spot
Full precision, full speed Start with Q4_K_M as your default when trying a new model. If response quality seems degraded on complex tasks, step up to Q5_K_M or Q8_0. Only step down to Q3 if nothing else fits.
Related Concepts
- VRAM — the hard limit that bits-per-weight directly controls. More bits = more VRAM, no exceptions.
- GGUF — the file format that packages quantized models for llama.cpp, Ollama, and LM Studio. The most common format you'll encounter.
- ExllamaV2 — the inference engine that uses EXL2 fractional-bit quantization for maximum throughput on pure CUDA builds.
- Model accuracy benchmarks — MMLU and HellaSwag are the standard quality comparisons across quantization levels. Understand what they measure before treating the numbers as gospel.