GPTQ — Local AI Glossary | CraftRigs

GPTQ (Generative Pre-trained Transformer Quantization) is a post-training quantization algorithm that shrinks large language model weights down to 4-bit (or 3-bit) precision, letting much bigger models fit inside a single consumer GPU's VRAM. It was one of the first quantization formats to make 13B–70B models genuinely usable on a single 24GB card.

How GPTQ Works

GPTQ quantizes weights one layer at a time using a small calibration dataset, solving for the rounding choices that minimize output error against the original FP16 model. The result is a GPU-native format optimized for tensor-core math, not CPU offload. Unlike GGUF, which targets llama.cpp and CPU/GPU split inference, GPTQ assumes you have enough VRAM to load the whole model on the GPU.

Where It Fits in the Quantization Landscape

GPTQ is one of four major formats builders pick between: GGUF, GPTQ, AWQ, and EXL2. vLLM supports GPTQ alongside AWQ and HuggingFace-native quantization, making it a common choice for production-style serving on CUDA or ROCm stacks. AWQ generally produces slightly better quality at the same bit-width, and EXL2 is faster on ExLlamaV2, so GPTQ has lost some ground — but its ecosystem support is broad and stable. LM Studio and Ollama lean GGUF-first, so GPTQ is more of a vLLM/text-generation-webui format in practice.

Why It Matters for Local AI

If you're running vLLM on a 16–24GB GPU and want maximum throughput without offloading to system RAM, GPTQ is one of the formats that makes a 13B or 34B model viable on a single card. It's less flexible than GGUF for mixed CPU/GPU rigs, but on a pure-GPU build — exactly the kind of rig CraftRigs readers are sizing — it delivers fast inference and predictable VRAM use. Pick GPTQ when your runtime is vLLM or ExLlama and the whole model fits in VRAM; pick GGUF when it doesn't.