CraftRigs
Architecture Guide

GGUF vs GPTQ vs AWQ vs EXL2: Which Quantization Format Should You Use?

By Georgia Thomas 6 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Quick Summary:

  • GGUF: Runs on everything — CPU, NVIDIA, AMD, Apple Silicon. Best hardware compatibility. Use K-quant variants (Q4_K_M, Q5_K_M). The universal default.
  • EXL2: Fastest tokens/sec on NVIDIA at equivalent quality. NVIDIA-only. Best choice if you have an RTX card and want maximum speed.
  • AWQ: Best format for vLLM multi-user deployments. Better quality than GPTQ at the same bit width. NVIDIA-focused.
  • GPTQ: Older NVIDIA-only format. Still widely available on HuggingFace. Use AWQ or EXL2 instead if you have a choice.

You're on HuggingFace looking at a model page. There are 30 files. Some say .gguf. Some say -GPTQ. Others -AWQ or -EXL2. The filenames look like Q4_K_M and 4bpw and gptq-4bit-128g. It's genuinely confusing if you haven't mapped out what each system is and what it runs on.

This is the practical guide. No academic theory — just which format to download for your hardware and use case.

Why Quantization Exists

Full-precision language models (FP16 or BF16) are large. Llama 3.1 70B at FP16 is roughly 140 GB. Almost nobody has that much VRAM. Quantization reduces the bit-depth of model weights — from 16-bit floats to 8-bit, 4-bit, or even 2-bit integers — which shrinks the model by 2-8x with modest quality loss.

The four main quantization formats (GGUF, GPTQ, AWQ, EXL2) are different solutions to the same problem: make large models fit in available memory while preserving as much quality as possible and maximizing inference speed.

They differ in: which hardware they run on, how they quantize (affecting quality), and which inference runtimes support them.

GGUF: The Universal Format

GGUF is the file format for llama.cpp — the most widely-used local inference engine. If you use Ollama or LM Studio, you're already running GGUF under the hood.

Hardware support:

  • NVIDIA (CUDA)
  • AMD (ROCm and Vulkan)
  • Apple Silicon (Metal)
  • Intel Arc (SYCL)
  • CPU-only (no GPU at all)
  • Mixed CPU+GPU (split layers across both)

This is GGUF's defining advantage. It's the only format that works on all hardware with a single file.

Quantization variants — the K-quants:

The modern GGUF files use K-quant variants, which are meaningfully better than older linear quantization at the same bit depth:

Quality

Noticeably degraded

Acceptable for large models

Good — recommended default

Very good

Excellent, near-lossless

Near-lossless Q4_K_M is the starting point for most users. It fits in the most GPU configurations and degrades quality minimally vs Q8_0.

When to go higher: If your model fits comfortably in VRAM at Q4_K_M with 4+ GB to spare for KV cache, consider Q5_K_M or Q6_K. The quality difference is noticeable on tasks requiring precise reasoning or code generation.

When to go lower: If you're trying to fit a 13B model in 8GB VRAM, Q3_K_M gets it close. If you're running 70B on a 48GB system (dual 24GB GPUs), Q4_K_M fits at ~40GB. Q3_K_M at 70B (~28GB) allows single 30GB+ GPU operation.

Works with: Ollama, LM Studio, llama.cpp directly. For a step-by-step download workflow, see our HuggingFace GGUF download guide. When choosing between runtimes to load your GGUF file, see Ollama vs LM Studio vs llama.cpp vs vLLM.

EXL2: Maximum NVIDIA Performance

EXL2 is the quantization format for ExLlamaV2, a CUDA-optimized inference library. It uses a different quantization approach — mixed-precision per-layer — that achieves better quality at a given file size compared to uniform quantization.

Hardware support: NVIDIA only (CUDA). No AMD, no Apple, no CPU fallback.

EXL2 naming convention: Files are named by bits-per-weight (bpw) — 4.0bpw, 4.65bpw, 6.0bpw, 8.0bpw. The non-integer values come from mixed-precision: some layers get 3-bit, others 5-bit, optimized per-layer based on sensitivity.

Speed comparison vs GGUF on RTX 4090 (Llama 3.1 13B):

Tokens/Sec

~90-100 t/s

~110-130 t/s

~60-70 t/s

~75-85 t/s EXL2 runs on ExLlamaV2 or its front-end TabbyAPI, which provides an OpenAI-compatible API. It's not supported by Ollama or LM Studio natively.

Pick EXL2 if: You have an NVIDIA RTX GPU, want the highest tokens/sec, and are comfortable running TabbyAPI or ExLlamaV2 directly. The setup is more involved than Ollama but the performance gain is real.

Worked example: 13B model on RTX 4060 Ti 16GB — EXL2 4.65bpw (~9.5 GB) gives ~15-20% more tokens/sec than GGUF Q4_K_M (~8 GB) while using similar VRAM.

AWQ: The vLLM Standard

AWQ (Activation-Aware Weight Quantization) is a 4-bit NVIDIA quantization method that determines which weights to quantize most aggressively based on activation magnitudes. The result is better quality than GPTQ at equivalent bit depth, especially on reasoning and instruction-following tasks.

Hardware support: Primarily NVIDIA (CUDA). Experimental AMD support in some frameworks.

Format: HuggingFace Safetensors format. AWQ models are listed as -AWQ or awq in their names, with common variants w4-g128-awq or similar.

Use with: vLLM (best support), HuggingFace transformers with AutoAWQ library, LLaMA-Factory.

Why AWQ for vLLM: vLLM has highly optimized AWQ CUDA kernels. It achieves nearly the same throughput as FP16 while cutting memory use in half. For multi-user deployments where you need to maximize concurrent users per GPU, AWQ in vLLM is the go-to.

Comparison to EXL2: EXL2 is faster for single-user sequential inference. AWQ in vLLM is faster for multi-user batched inference. Different optimization targets.

GPTQ: The Legacy Standard

GPTQ (Generative Pre-trained Transformer Quantization) was the dominant GPU quantization format before AWQ and EXL2 emerged. It's a 4-bit or 3-bit NVIDIA quantization using a layer-wise second-order approach.

Hardware support: NVIDIA (CUDA) only.

Why it still exists: Enormous catalog of pre-quantized models on HuggingFace. Many model families only have GPTQ versions available from the original release period (2023-2024). If the model you want doesn't have AWQ or EXL2 versions, GPTQ is the fallback.

Use with: auto-gptq library, transformers with GPTQ support, some vLLM builds.

Should you choose GPTQ today? Only if AWQ or EXL2 versions aren't available. Both offer better quality-per-bit. GPTQ is the format of last resort.

The Decision Table

Your SituationUse This Format
CPU-only inferenceGGUF Q4_K_M
AMD GPUGGUF Q4_K_M
Apple Silicon (Mac)

Worked Example: 13B Model on RTX 4060 Ti 16GB

The RTX 4060 Ti 16GB is a popular LLM card. Here's what to run for a 13B-class model:

Option A: GGUF Q4_K_M (Ollama or LM Studio)

  • File size: ~8 GB
  • Remaining VRAM for KV cache: ~7 GB
  • Context safe up to: ~8,192 tokens
  • Speed: ~80-100 t/s on RTX 4060 Ti 16GB
  • Setup: ollama pull qwen2.5-14b-instruct

Option B: EXL2 4.65bpw (TabbyAPI)

  • File size: ~9.5 GB
  • Remaining VRAM for KV cache: ~5.5 GB
  • Context safe up to: ~4,096-6,144 tokens
  • Speed: ~100-120 t/s on RTX 4060 Ti 16GB
  • Setup: Install TabbyAPI, download EXL2 model from HuggingFace

For most users, Option A is the right call. The 15-20% speed difference rarely matters in practice, and GGUF's setup is significantly simpler. Option B makes sense if you're doing high-volume generation and every second counts.

Choosing the Right Quantization Level Within a Format

Once you've picked the format, pick the quantization level by asking: "What's my VRAM budget for this model?"

For GGUF on a target model:

  1. Check the Q4_K_M file size
  2. Subtract from your total VRAM
  3. Remaining VRAM ÷ 0.3 GB ≈ max context tokens (very rough estimate)
  4. If remaining VRAM is less than 2 GB, go to Q3_K_M or reduce context

For EXL2:

  1. 4.65bpw is the practical default
  2. 6.0bpw if you have headroom and want better quality
  3. 3.0bpw if you're trying to fit a large model in limited VRAM

The format decision comes first. The quantization level decision is tuning within that format.

For understanding how VRAM fills up during inference — especially during long conversations — see our KV cache and VRAM guide.

gguf gptq awq quantization local-llm

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.