What is the difference between GGUF and GPTQ?

GGUF is a file format used by llama.cpp that runs on CPU, GPU, or mixed CPU+GPU across NVIDIA, AMD, and Apple Silicon. GPTQ is an older GPU-only quantization format primarily for NVIDIA, requiring the AutoGPTQ or transformers library. GGUF has largely replaced GPTQ for consumer use because of its hardware flexibility and superior quality K-quant variants.

Is EXL2 faster than GGUF on NVIDIA?

Yes, generally. EXL2 (ExLlamaV2) delivers higher tokens/sec than GGUF on NVIDIA GPUs at equivalent quality levels because ExLlamaV2's CUDA kernels are more optimized than llama.cpp's CUDA backend. The difference is typically 15-30% on RTX 30/40 series cards. EXL2 only runs on NVIDIA — no AMD, no Apple, no CPU fallback.

What quantization format should I use for vLLM?

AWQ is the recommended format for vLLM deployments. vLLM has native AWQ support with optimized kernels, and AWQ typically outperforms GPTQ at the same bit width in terms of output quality. GGUF is not natively supported in vLLM. For single-user llama.cpp or Ollama deployments, GGUF Q4_K_M remains the standard choice.

What does Q4_K_M mean in a GGUF filename?

Q4 means 4-bit quantization. K means K-quant method (higher quality than linear quantization at the same bit depth). M means Medium — the balance point in the K-quant family (S=small, M=medium, L=large, XL=extra large, referring to quality tier). Q4_K_M is the recommended starting point for most hardware — good quality, reasonable size.

Can I convert a GGUF model to EXL2 format, or do I need to download separate files?

You cannot directly convert GGUF to EXL2 — they are produced from different source formats. EXL2 files are generated from HuggingFace Safetensors format using the ExLlamaV2 quantization tool. To use EXL2, download pre-quantized EXL2 files from HuggingFace (look for repos with 'EXL2' or 'exl2' in the name, typically from community quantizers like turboderp or bartowski). You can also run the ExLlamaV2 quantization script yourself from the original model weights.

Does quantization format affect model quality or just speed?

Both. Format affects speed because different formats have different inference engine implementations (CUDA kernels vary in efficiency). Format also affects quality because quantization methods differ in how they choose which weights to compress most aggressively. EXL2 uses mixed-precision per-layer quantization, generally preserving quality better than uniform GPTQ at the same average bit depth. AWQ analyzes activation magnitudes to protect important weights. GGUF K-quants use a block-wise approach that also outperforms simple linear quantization. At the same bit depth, EXL2 and AWQ typically produce slightly better output than older GPTQ.

GGUF vs GPTQ vs AWQ vs EXL2: Which Quantization Format Should You Use?

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Quick Summary:

GGUF: Runs on everything — CPU, NVIDIA, AMD, Apple Silicon. Best hardware compatibility. Use K-quant variants (Q4_K_M, Q5_K_M). The universal default.
EXL2: Fastest tokens/sec on NVIDIA at equivalent quality. NVIDIA-only. Best choice if you have an RTX card and want maximum speed.
AWQ: Best format for vLLM multi-user deployments. Better quality than GPTQ at the same bit width. NVIDIA-focused.
GPTQ: Older NVIDIA-only format. Still widely available on HuggingFace. Use AWQ or EXL2 instead if you have a choice.

You're on HuggingFace looking at a model page. There are 30 files. Some say .gguf. Some say -GPTQ. Others -AWQ or -EXL2. The filenames look like Q4_K_M and 4bpw and gptq-4bit-128g. It's genuinely confusing if you haven't mapped out what each system is and what it runs on.

This is the practical guide. No academic theory — just which format to download for your hardware and use case.

Why Quantization Exists

Full-precision language models (FP16 or BF16) are large. Llama 3.1 70B at FP16 is roughly 140 GB. Almost nobody has that much VRAM. Quantization reduces the bit-depth of model weights — from 16-bit floats to 8-bit, 4-bit, or even 2-bit integers — which shrinks the model by 2-8x with modest quality loss.

The four main quantization formats (GGUF, GPTQ, AWQ, EXL2) are different solutions to the same problem: make large models fit in available memory while preserving as much quality as possible and maximizing inference speed.

They differ in: which hardware they run on, how they quantize (affecting quality), and which inference runtimes support them.

GGUF: The Universal Format

GGUF is the file format for llama.cpp — the most widely-used local inference engine. If you use Ollama or LM Studio, you're already running GGUF under the hood.

Hardware support:

NVIDIA (CUDA)
AMD (ROCm and Vulkan)
Apple Silicon (Metal)
Intel Arc (SYCL)
CPU-only (no GPU at all)
Mixed CPU+GPU (split layers across both)

This is GGUF's defining advantage. It's the only format that works on all hardware with a single file.

Quantization variants — the K-quants:

The modern GGUF files use K-quant variants, which are meaningfully better than older linear quantization at the same bit depth:

Quality

Noticeably degraded

Acceptable for large models

Good — recommended default

Very good

Excellent, near-lossless

Near-lossless Q4_K_M is the starting point for most users. It fits in the most GPU configurations and degrades quality minimally vs Q8_0.

When to go higher: If your model fits comfortably in VRAM at Q4_K_M with 4+ GB to spare for KV cache, consider Q5_K_M or Q6_K. The quality difference is noticeable on tasks requiring precise reasoning or code generation.

When to go lower: If you're trying to fit a 13B model in 8GB VRAM, Q3_K_M gets it close. If you're running 70B on a 48GB system (dual 24GB GPUs), Q4_K_M fits at ~40GB. Q3_K_M at 70B (~28GB) allows single 30GB+ GPU operation.

Works with: Ollama, LM Studio, llama.cpp directly. For a step-by-step download workflow, see our HuggingFace GGUF download guide. When choosing between runtimes to load your GGUF file, see Ollama vs LM Studio vs llama.cpp vs vLLM.

EXL2: Maximum NVIDIA Performance

EXL2 is the quantization format for ExLlamaV2, a CUDA-optimized inference library. It uses a different quantization approach — mixed-precision per-layer — that achieves better quality at a given file size compared to uniform quantization.

Hardware support: NVIDIA only (CUDA). No AMD, no Apple, no CPU fallback.

EXL2 naming convention: Files are named by bits-per-weight (bpw) — 4.0bpw, 4.65bpw, 6.0bpw, 8.0bpw. The non-integer values come from mixed-precision: some layers get 3-bit, others 5-bit, optimized per-layer based on sensitivity.

Speed comparison vs GGUF on RTX 4090 (Llama 3.1 13B):

Tokens/Sec

~90-100 t/s

~110-130 t/s

~60-70 t/s

~75-85 t/s EXL2 runs on ExLlamaV2 or its front-end TabbyAPI, which provides an OpenAI-compatible API. It's not supported by Ollama or LM Studio natively.

Pick EXL2 if: You have an NVIDIA RTX GPU, want the highest tokens/sec, and are comfortable running TabbyAPI or ExLlamaV2 directly. The setup is more involved than Ollama but the performance gain is real.

Worked example: 13B model on RTX 4060 Ti 16GB — EXL2 4.65bpw (~9.5 GB) gives ~15-20% more tokens/sec than GGUF Q4_K_M (~8 GB) while using similar VRAM.

AWQ: The vLLM Standard

AWQ (Activation-Aware Weight Quantization) is a 4-bit NVIDIA quantization method that determines which weights to quantize most aggressively based on activation magnitudes. The result is better quality than GPTQ at equivalent bit depth, especially on reasoning and instruction-following tasks.

Hardware support: Primarily NVIDIA (CUDA). Experimental AMD support in some frameworks.

Format: HuggingFace Safetensors format. AWQ models are listed as -AWQ or awq in their names, with common variants w4-g128-awq or similar.

Use with: vLLM (best support), HuggingFace transformers with AutoAWQ library, LLaMA-Factory.

Why AWQ for vLLM: vLLM has highly optimized AWQ CUDA kernels. It achieves nearly the same throughput as FP16 while cutting memory use in half. For multi-user deployments where you need to maximize concurrent users per GPU, AWQ in vLLM is the go-to.

Comparison to EXL2: EXL2 is faster for single-user sequential inference. AWQ in vLLM is faster for multi-user batched inference. Different optimization targets.

GPTQ: The Legacy Standard

GPTQ (Generative Pre-trained Transformer Quantization) was the dominant GPU quantization format before AWQ and EXL2 emerged. It's a 4-bit or 3-bit NVIDIA quantization using a layer-wise second-order approach.

Hardware support: NVIDIA (CUDA) only.

Why it still exists: Enormous catalog of pre-quantized models on HuggingFace. Many model families only have GPTQ versions available from the original release period (2023-2024). If the model you want doesn't have AWQ or EXL2 versions, GPTQ is the fallback.

Use with: auto-gptq library, transformers with GPTQ support, some vLLM builds.

Should you choose GPTQ today? Only if AWQ or EXL2 versions aren't available. Both offer better quality-per-bit. GPTQ is the format of last resort.

The Decision Table

Your Situation	Use This Format
CPU-only inference	GGUF Q4_K_M
AMD GPU	GGUF Q4_K_M
Apple Silicon (Mac)

Worked Example: 13B Model on RTX 4060 Ti 16GB

The RTX 4060 Ti 16GB is a popular LLM card. Here's what to run for a 13B-class model:

Option A: GGUF Q4_K_M (Ollama or LM Studio)

File size: ~8 GB
Remaining VRAM for KV cache: ~7 GB
Context safe up to: ~8,192 tokens
Speed: ~80-100 t/s on RTX 4060 Ti 16GB
Setup: ollama pull qwen2.5-14b-instruct

Option B: EXL2 4.65bpw (TabbyAPI)

File size: ~9.5 GB
Remaining VRAM for KV cache: ~5.5 GB
Context safe up to: ~4,096-6,144 tokens
Speed: ~100-120 t/s on RTX 4060 Ti 16GB
Setup: Install TabbyAPI, download EXL2 model from HuggingFace

For most users, Option A is the right call. The 15-20% speed difference rarely matters in practice, and GGUF's setup is significantly simpler. Option B makes sense if you're doing high-volume generation and every second counts.

Choosing the Right Quantization Level Within a Format

Once you've picked the format, pick the quantization level by asking: "What's my VRAM budget for this model?"

For GGUF on a target model:

Check the Q4_K_M file size
Subtract from your total VRAM
Remaining VRAM ÷ 0.3 GB ≈ max context tokens (very rough estimate)
If remaining VRAM is less than 2 GB, go to Q3_K_M or reduce context

For EXL2:

4.65bpw is the practical default
6.0bpw if you have headroom and want better quality
3.0bpw if you're trying to fit a large model in limited VRAM

The format decision comes first. The quantization level decision is tuning within that format.

For understanding how VRAM fills up during inference — especially during long conversations — see our KV cache and VRAM guide.