EXL2 — Local AI Glossary | CraftRigs

EXL2 is a GPU-first quantization format produced by the ExLlamaV2 inference engine. It's the format you reach for when the model fits entirely in VRAM and you want maximum tokens per second on an NVIDIA card.

How EXL2 Differs From GGUF and GPTQ

Unlike GGUF — which is built around CPU+GPU split inference and VRAM offloading — EXL2 assumes the whole model lives on the GPU. It's a successor to GPTQ in spirit, but with a key twist: EXL2 supports fractional and mixed bits-per-weight within a single model. Instead of a flat 4-bit or 8-bit quant, you can target an average like 4.65 bpw, with sensitive layers held at higher precision and tolerant layers pushed lower. The format is produced and consumed by ExLlamaV2, often paired with frontends like text-generation-webui or TabbyAPI.

Tradeoffs for Local Builds

EXL2's advantage is raw decode speed on CUDA hardware — it's typically the fastest option for single-user inference on a 3090, 4090, or 5090. The cost is portability: there's no meaningful CPU fallback, no Apple Silicon path, and ROCm support lags behind NVIDIA. If your model overflows VRAM by even a few hundred megabytes, you're better off on GGUF with llama.cpp than trying to force EXL2 to swap. AWQ and GPTQ remain more common on shared inference stacks; EXL2 lives mostly in the enthusiast local-LLM corner where the user owns the GPU and tunes the bpw to fit it exactly.

Why It Matters for Local AI

EXL2's fractional-bit targeting is what makes a 70B model fit on a single 24GB card without falling off a quality cliff — you pick the bpw that exactly fills your VRAM budget. For builders running dual-GPU rigs (e.g., 2×RTX 4090) on dense models, EXL2 squeezes more usable quality out of a fixed VRAM ceiling than fixed-bit formats. If your workflow is GPU-resident, single-user, and speed-sensitive, EXL2 is usually the right format; if you need flexibility across CPU, Mac, or mixed hardware, GGUF wins.