ExLlamaV2 — Local AI Glossary | CraftRigs

ExLlamaV2 is an inference library specifically optimized for NVIDIA GPUs, developed by turboderp. It's known for achieving the highest throughput of any local inference software on NVIDIA hardware, particularly for quantized models. ExLlamaV2 introduces its own quantization format (EXL2) that optimizes weight storage for its custom CUDA kernels.

Performance Characteristics

ExLlamaV2 consistently benchmarks faster than llama.cpp on NVIDIA hardware for single-user inference. On an RTX 4090, ExLlamaV2 often shows 10–20% higher tokens/second than the equivalent llama.cpp GGUF model at Q4, due to its hand-written CUDA kernels optimized for each generation of Tensor Cores.

The performance advantage is most pronounced for larger models where memory bandwidth is the bottleneck, and less significant for small models where the overhead difference is minimal.

EXL2 Format

ExLlamaV2's native quantization format (EXL2) uses a calibration-based approach to assign different bit depths to different weights based on their importance to model quality. This produces better quality at the same average bit depth compared to uniform quantization schemes.

EXL2 models are available for download on Hugging Face (look for repos with "EXL2" in the name) and can be quantized locally using ExLlamaV2's conversion tools.

Tabby ML and KoboldCpp Integration

ExLlamaV2 is the inference backend for Tabby ML (a self-hosted coding assistant) and one of the options in KoboldCpp. For applications that use Tabby ML, ExLlamaV2's performance advantage directly translates to faster autocomplete and code generation.

When to Use ExLlamaV2 vs. llama.cpp

Use ExLlamaV2 when:

You have NVIDIA hardware and want maximum throughput
You're running Tabby ML or KoboldCpp
You're willing to download or quantize EXL2 models

Use llama.cpp when:

You need cross-platform support (AMD, Apple Silicon, CPU)
You want the widest model compatibility via GGUF
You're using Ollama or LM Studio (which use llama.cpp internally)

ExLlamaV2 is NVIDIA-only. For AMD or Apple Silicon, llama.cpp or MLX are the alternatives.