INT4 (4-bit Integer)
4-bit integer quantization format — the practical minimum precision for running large language models on consumer hardware.
INT4 is a 4-bit integer quantization format — each model weight is stored using only 4 bits, compared to 16 bits for FP16. This 4x size reduction relative to FP16 is what makes running large models on consumer GPUs possible at all.
Memory Savings in Practice
The math: 0.5 bytes per parameter. A 7B model at INT4 = ~3.5GB. A 70B model = ~35GB. Compare to FP16: 14GB for 7B, 140GB for 70B. INT4 makes the difference between a 70B model requiring an A100 80GB vs. fitting in a 40GB dual-GPU setup.
In GGUF format (used by llama.cpp, Ollama, LM Studio), the common INT4 equivalent is Q4_K_M — which uses slightly more than exactly 4 bits per weight due to calibration data, but lands around 4.5 bits per weight on average.
Quality Trade-offs
INT4 is the point where quality loss becomes measurable. The reduction from 8 bits to 4 bits introduces enough rounding error that complex reasoning tasks (math, coding, structured output) see detectable degradation compared to FP16. Casual conversation, summarization, and writing tasks are largely unaffected.
Practical guidance: for coding and reasoning, prefer Q5_K_M or Q6_K if you have the VRAM. For general use, Q4_K_M is the standard and widely considered acceptable.
Hardware Acceleration
NVIDIA's Ada Lovelace (RTX 40 series) and Blackwell (RTX 50 series) GPUs include native INT4 Tensor Core paths, doubling INT4 throughput compared to what INT8 provides on the same hardware. This makes quantized model inference significantly faster on RTX 40 series than equivalent RTX 30 series cards, even at identical VRAM.
Ampere (RTX 30 series) handles INT4 through paired INT8 paths, which is still faster than software-emulated INT4 but doesn't match Ada's hardware throughput.
INT4 vs. Other Formats
The quantization ladder for local models: FP16 → Q8_0 → Q6_K → Q5_K_M → Q4_K_M (INT4 range) → Q3_K_M → Q2_K. INT4 (Q4_K_M) is the pragmatic floor — below it, quality degrades noticeably for most use cases. Above it, you're trading VRAM for incremental quality gains.