ExllamaV2 is a C++/CUDA inference engine that runs 2-3× faster than Ollama at the same VRAM for GPTQ/EXL2 quantized models. The tradeoff is a harder setup with no GUI. If you have an RTX 4070 Ti or better and can tolerate a command-line workflow, the speed gains are worth it.
If you're at 20 tok/s on a 13B model and wondering whether you need more VRAM — you probably don't. You might just need a different inference engine.
What Is ExllamaV2? (Speed-First Local Inference)
ExllamaV2 is an open-source inference engine written in C++ with custom CUDA kernels, designed to extract maximum decode speed from consumer NVIDIA GPUs using GPTQ and EXL2 quantized models.
The technical edge comes from two things. First, hand-written CUDA attention kernels that are tuned specifically for the int4-weight, fp16-activation pattern that dominates LLM decode on consumer hardware. Second, the EXL2 format, which allows fractional bits-per-weight like 3.5 or 4.5 bpw — letting you squeeze more quality into a fixed VRAM budget.
Ollama is an automatic transmission — smooth, easy, handles everything for you. ExllamaV2 is a manual with a sport tune. More involvement, but noticeably faster when you know what you're doing.
Why ExllamaV2 Is Faster Than Ollama and vLLM
The numbers are significant enough that they change the feel of the hardware.
On an RTX 4070 (12 GB) running Mistral 7B at 4 bpw: Ollama hits ~52 tok/s; ExllamaV2 hits ~118 tok/s. That's a 2.3× improvement on identical hardware with a model at equivalent quality.
At 118 tok/s, responses stream in real time. At 52 tok/s, there's a noticeable rhythm lag on longer responses. Same GPU. Same model. Different engine.
ExllamaV2 vs Ollama — Speed Benchmarks
Test: Mistral 7B, 4 bpw EXL2/Q4_K_M, RTX 4070, 512-token output:
Tok/s
52
118 Test: Llama 2 13B, 4 bpw, RTX 4090, 512-token output:
Tok/s
85
110
195 The speed gap comes from custom CUDA kernels that bypass PyTorch overhead and hand-optimized attention routines. Ollama uses llama.cpp under the hood, which uses generalized CUDA implementations. Solid and portable — but not tuned for the specific memory access patterns of quantized LLM decode.
ExllamaV2 vs vLLM — When Each Wins
vLLM is a different tool for a different problem. It's optimized for multi-user high-throughput serving: continuous batching, PagedAttention, Kubernetes-friendly deployment, OpenAI-compatible API. If you're serving 10+ concurrent users, vLLM is the right choice.
ExllamaV2 wins on single-user decode speed — lower latency per request, lower per-request overhead.
The decision:
- Personal workstation, 1-3 users: ExllamaV2
- Team server, 5+ concurrent users: vLLM
- Just getting started, want a GUI: Ollama
How ExllamaV2 Works — The Technical Edge
Ollama wraps llama.cpp, which uses cuBLAS and generalized CUDA code. This works well and runs on most hardware. But it's not tuned for the specific computational patterns of GPTQ inference.
ExllamaV2's attention kernel is written specifically for the int4-weight, fp16-activation pattern — reducing compute overhead by roughly 30% compared to the generic approach. That 30% overhead reduction, combined with the more efficient EXL2 format, is where the 2-3× speed difference comes from.
Requires NVIDIA GPU with CUDA compute capability 7.0 or higher — RTX 20-series and newer. No AMD support.
EXL2 Format — Fractional Bits-Per-Weight
Standard quantization formats like GGUF Q4 assign exactly 4 bits to every weight. EXL2 assigns different precision to different layers based on sensitivity analysis — more bits to layers that affect output quality more, fewer bits to layers that don't.
Supported bpw: 2.5, 3.0, 3.5, 4.0, 4.5, 5.0. At the same total file size, EXL2 4.0 bpw typically beats GGUF Q4_K_M by 1-2 MMLU benchmark points because the per-layer calibration preserves quality more efficiently.
The tradeoff: EXL2 models are less widely available than GGUF. Bartowski and turboderp release EXL2 variants on Hugging Face, but you won't find EXL2 for every model the way you find GGUF. For mainstream models (Mistral, Llama, Qwen, Gemma), you'll find what you need. For niche fine-tunes, you may have to work with GGUF.
Note
EXL2 quantization is calibrated on a training dataset, which means the quantization process itself is more involved than GGUF. You can re-quant models yourself using the ExllamaV2 conversion scripts, but for most users downloading pre-quantized models from bartowski or turboderp is the practical path.
Custom CUDA Kernels and Why They Matter
llama.cpp's CUDA implementation is generalized — it runs correctly on a wide range of GPU architectures and quantization formats. That generality has a cost: it can't optimize for the exact memory access pattern of a specific quantization format on a specific GPU generation.
ExllamaV2's kernels are written specifically for GPTQ/EXL2 on NVIDIA RTX hardware. The int4-weight pattern in decode has consistent, predictable memory access behavior — and the kernel exploits that. The result is roughly 30% less compute overhead per decode step compared to the generic implementation.
This is also why ExllamaV2 has a narrower hardware support matrix than Ollama. The specificity that makes it fast also makes it less portable.
Hardware Sweet Spots by VRAM Tier
Ollama Tok/s
Mistral 7B at 4.0 bpw (5.4 GB)
Llama 2 13B at 4.0 bpw (7.9 GB)
Nous-Hermes 34B at 4.5 bpw (18.2 GB)
ExllamaV2 is worthwhile at every tier, but the absolute gain matters more as you move up. At the 12 GB tier, going from 45 tok/s to 105 tok/s on a 13B model transforms the experience — 45 tok/s on a 13B model feels sluggish; 105 tok/s feels like a 7B.
Tip
The 12 GB tier is the sweet spot for ExllamaV2. An RTX 3060 12GB or RTX 4070 running Llama 2 13B at 4.0 bpw EXL2 gets you 13B model quality at speeds that feel like a 7B. That's the real value proposition — moving up a model size without losing response fluency.
"ExllamaV2 Is Too Complicated to Be Worth It" — Depends on Your Use
For Ollama users who just want to occasionally chat with a model, the setup overhead is real. ExllamaV2 has no GUI, requires a Python environment, and EXL2 models need to be sourced separately. Ollama installs in two commands and has a model library. For casual use, Ollama wins on convenience.
But the math changes quickly for power users.
If you spend 2+ hours per day chatting with local LLMs, doubling your tok/s through a one-time 30-minute setup is a high-ROI investment. That's cumulative hours per week where responses feel faster, conversations flow better, and longer context doesn't drag.
The "complicated" objection is also partly outdated. TabbyAPI — a project that wraps ExllamaV2 in an OpenAI-compatible API server — has a one-command install script. You get ExllamaV2's speed with an API interface that works with Open WebUI, SillyTavern, or any other frontend designed for OpenAI-compatible backends.
# TabbyAPI one-command install
git clone https://github.com/theroyallab/tabbyAPI && cd tabbyAPI
bash install.sh
That's it. The raw-CLI phase is optional now.
Warning
ExllamaV2 only supports NVIDIA GPUs (CUDA compute 7.0+, RTX 20-series and newer). If you're on AMD, the speed gains aren't available — stick with Ollama or llama.cpp directly, which have better ROCm support.
ExllamaV2 in Practice — Mistral 7B on RTX 4070
Hardware: RTX 4070 (12 GB GDDR6X), Ubuntu 22.04, CUDA 12.1, Python 3.11.
Setup path (TabbyAPI route):
- Clone the TabbyAPI repository
- Run
install.sh— handles Python dependencies and ExllamaV2 compilation - Download Mistral 7B EXL2 4.0 bpw from Hugging Face (bartowski or turboderp repos)
- Point Open WebUI at the TabbyAPI endpoint
Results on Mistral 7B:
Quality
Baseline
+1-2 MMLU pts ExllamaV2 uses slightly more VRAM (5.4 GB vs 4.9 GB) but delivers 2.3× the speed and marginally better quality. On a 12 GB card, the extra 500 MB is a non-issue.
With 6.6 GB of VRAM remaining after loading the model, you have comfortable room for a nomic-embed-text embedding model alongside it for RAG, plus 4K-8K context.
The recommended path for new ExllamaV2 users: TabbyAPI + Open WebUI. You get the ExllamaV2 backend with an Ollama-like chat interface. The speed improvement is immediate and the workflow isn't meaningfully more complex than Ollama after setup.
Related Concepts for Local AI Builders
Ollama — the beginner-friendly alternative and the right choice for most new builders. If you're just starting out, start here. Come back to ExllamaV2 when you've outgrown Ollama's performance ceiling.
GPTQ / EXL2 format — the quantization formats ExllamaV2 uses. Understanding what 4.0 bpw means, and how EXL2 differs from GGUF, is required reading before sourcing models.
Bits-per-weight and quantization — explains quality tradeoffs across quantization levels. EXL2's fractional bpw makes these tradeoffs more precise than GGUF's fixed levels.
vLLM setup guide — covers the multi-user serving use case where vLLM outperforms ExllamaV2 on total throughput.
The practical answer: if you're running an RTX 4070 or better and spend serious time with local LLMs daily, ExllamaV2 via TabbyAPI is worth the one-time setup. If you're new, still experimenting, or on a GPU below RTX 4070 (where the absolute speed gains are smaller), Ollama is the better starting point. The best inference engine guide has the full comparison across all major options.