AWQ — Local AI Glossary | CraftRigs

AWQ (Activation-aware Weight Quantization) is a 4-bit quantization format for LLM weights that identifies the salient weight channels driving most of the model's output and keeps them at higher precision while aggressively compressing the rest. For local rig builders, it's one of the main ways to squeeze a 70B-class model into consumer VRAM without the accuracy cliff you get from naive int4 rounding.

How AWQ Differs from Other 4-bit Formats

Unlike GGUF Q4_K_M, which is the default for llama.cpp and Ollama workflows, AWQ is built for GPU-only inference engines like vLLM. It uses calibration data to measure which weight channels matter most for activations, then scales those channels before quantizing — preserving model quality without per-group dequantization overhead at runtime. GPTQ takes a similar calibration-based approach, while EXL2 targets a different runtime entirely.

Tradeoffs and Runtime Support

AWQ shines on dense GPU inference: vLLM, for example, can run Llama 3.3 70B Instruct in AWQ but not in GGUF formats, while llama.cpp-based runtimes do the opposite. That means your quantization choice is partially dictated by which inference server you run. AWQ packs models smaller than FP16 by roughly 4x and typically holds within a percent or two of FP16 perplexity on common benchmarks. It does not compress the KV cache — only model weights are quantized at the AWQ stage, before inference starts.

Why It Matters for Local AI

If you're running a single high-end GPU (24GB+) and want to serve a 70B model with batched throughput, AWQ + vLLM is one of the few combinations that actually fits and stays fast. Pick GGUF if you're on llama.cpp, Ollama, or need CPU/VRAM offloading; pick AWQ when your stack is pure GPU and you care about concurrent requests. The format you choose determines which runtimes you can use, not just how much VRAM the weights consume.