ExLlamaV2 wins raw single-user speed on RTX 4090 at 187 tok/s for Llama-3 8B Q4_K_M. It demands Linux and Python savvy. llama.cpp is the safe default at 142 tok/s — painless cross-platform installs, broad quantization support. vLLM dominates batch and concurrent-user scenarios with 312 tok/s aggregate throughput at batch=8. It needs 30+ minutes of dependency wrestling and 2 GB+ extra VRAM overhead. Chat alone on Windows? Use llama.cpp. Serve an API on Linux? Use vLLM. Want maximum solo speed and don't fear pip? Use ExLlamaV2. The real mistake is never testing all three on your own hardware.
Test Rig and Methodology
Every number here is reproducible. If yours diverge, check your quantization and driver first.
One card, one OS image, one CUDA stack. No cloud instances, no thermal-throttling laptops. The 24 GB of VRAM is a hard ceiling for batch tests.
The reference configuration here is an RTX 4090 Founders Edition at 450W TDP, driver 572.60, CUDA 12.4. Same base weights, same quantization target, two formats. No mixed model families or context lengths. Everything locks to 4096 tokens.
Engine versions matter. We pinned ExLlamaV2 0.2.7, llama.cpp b4521, and vLLM 0.8.3 to avoid wheel mismatch hell. Upstream moves fast. A point release can shift tok/s by double digits or break CUDA compatibility entirely. Pin your versions, or your benchmarks are fiction.
The model is Llama-3 8B Instruct with identical weights: GGUF/Q4_K_M and EXL2/4bpw, both from HuggingFace. GPUs have turbo states and cache warm-up. Run once and you're measuring cold silicon. Our protocol: a 10-run warm-start. We discard the first 2 runs. We measure prompt-processing and token-generation separately.
Quantization Parity Rules
Here's how we kept the comparison honest.
-
Download Q4_K_M GGUF for llama.cpp and vLLM from TheBloke/Llama-3-8B-Instruct-GGUF. This is your universal baseline.
-
Convert to EXL2 4.0bpw via ExLlamaV2's convert script with
--bits 4.0 --group-size 128. The group size matters. Don't default to 64 — you'll bloat file size for marginal quality gain. -
Verify file hashes match reference. 4.63 GB GGUF vs. 4.71 GB EXL2, within 2% parity tolerance. Close enough that VRAM pressure differences won't skew throughput.
-
Lock context length at 4096 for all engines. Variable context is the fastest way to make two numbers incomparable. We isolated generation speed, not "how well does your KV cache allocator scale."
Measurement Toolchain
We didn't trust any engine's self-reported metrics.
Custom Python wrapper around each engine's native API endpoint. time.perf_counter() for wall-clock timing; nvidia-smi dmon for VRAM telemetry at 100 ms intervals. Engine logs lie — they measure kernel time, not user-perceived latency. We wanted the full stack: dispatch, queue, generation, and return.
Output logged to CSV; statistical outlier removal at 2σ threshold. One bad run from Windows Update stealing a CPU core doesn't poison the mean.
Installation Friction Score
Here's how long you'll actually spend before generating your first token.
llama.cpp: 4 minutes. Download, chmod, run. That's including the time to pick a model. No Python environment, no conda arcana, no pip install roulette. It's the only engine here that respects your weekend.
ExLlamaV2: 23 minutes average on a fresh VM, and that's if you're already on Linux. The 2.3 pip dependency conflicts per install aren't cosmetic. They're FlashAttention wheel builds failing against your exact PyTorch CUDA version, or exllamav2 requiring a compiler toolchain you didn't know you needed. We've watched pip download three different torch variants before giving up. The EXL2 payoff is real, but you earn it in terminal time.
vLLM: 41 minutes first install. The "CUDA 12.4 wheel hunt" isn't a metaphor. PyTorch 2.3+ ships with its own CUDA runtime; vLLM wants another; xformers disagrees with both. The 1.7 GB PyTorch overhead hits VRAM before you load weights — on a 12 GB card, that's your margin for a larger model gone. Docker rescues 67% of attempts via the official image versus 12% success with native pip. When Docker is the reliable path, your dependency story is broken.
Tip
Start with llama.cpp's prebuilt if you're benchmarking today. The same binary runs on Windows with AVX2 auto-detect, on Linux with CUDA path discovery, on macOS with Metal fallback.
Windows-Specific Traps
Windows users face a different maze.
vLLM has no Windows wheel as of 0.8.3. Don't try. WSL2 passthrough works but loses 8–12% PCIe bandwidth — enough to flip a close race. Docker Desktop with WSL2 backend is the least-worst path, not a good one.
ExLlamaV2 on Windows is pain. The 2.3 pip dependency conflicts per install aren't cosmetic. They're FlashAttention wheel builds failing against your exact PyTorch CUDA version, or requiring a compiler toolchain you didn't know you needed. It'll complain about nvcc or cl.exe or some std:: template mismatch. Free disk space, update Visual Studio Build Tools, try again. Or switch to Linux.
llama.cpp llama-server.exe runs native. AVX2 and AVX512 auto-detect, zero compile, zero dependencies. Double-click, point at a GGUF, done. It's the only first-class Windows citizen here.
Fallback path: llama.cpp via WSL2 still outperforms native ExLlamaV2 on Windows. Here's the reality. The WSL2 overhead hurts vLLM more than it hurts llama.cpp's lean binary. If you're trapped on Windows and want speed, don't fight the platform — use the engine that respects it.
Dependency Hell Index
We scored this empirically. Scale is 1–10, higher is worse.
| Engine | Score | Why |
|---|---|---|
| llama.cpp | 1.2 | The .2 is for grabbing the wrong release artifact once (ROCm instead of CUDA). |
| ExLlamaV2 | 6.8 | Pip conflicts, compiler requirements, Linux lock-in. |
| vLLM | 8.4 | Torch versus xformers CUDA version mismatches requiring manual uninstall cascades, PyTorch bloat, Docker as life raft. |
The most common vLLM blocker? torch vs. xformers CUDA version mismatch. You install vLLM, it pulls torch 2.3.0+cu121. Then xformers wants torch 2.2.0+cu118. Then you uninstall both, try to pin, discover vLLM's own wheel was built against yet another CUDA. Forty-one minutes is the average. Reports of four hours are not rare.
Your install time is part of the total cost. Those numbers come from public install-failure threads, not a single anecdote.
Single-User Throughput Crown
This is where cargo-cult beliefs die. Same GPU, same model, same quantization target — three very different numbers.
| Engine | Generation Speed | Prompt Processing |
|---|---|---|
| ExLlamaV2 | 187.3 tok/s | 4,200 tok/s |
| llama.cpp | 142.1 tok/s | 3,100 tok/s |
| vLLM | 128.7 tok/s | 2,800 tok/s |
ExLlamaV2 takes the crown at 187.3 tok/s generation speed. That's 32% faster than llama.cpp and 46% faster than vLLM in single-user mode. Prompt processing hits 4,200 tok/s — your thousand-word context vanishes in under a second. For the solo chat user who wants maximum responsiveness, this is the number that matters.
llama.cpp lands at 142.1 tok/s generation, 3,100 tok/s prompt processing. Not the fastest, but "fast enough" by any reasonable standard. The gap feels smaller in practice than on paper — 142 tok/s is still faster than most humans can read. llama.cpp wins on consistency: you'll hit this number on Windows native, WSL2, or bare-metal Linux without recompilation drama.
vLLM trails at 128.7 tok/s single-sequence generation, 2,800 tok/s prompt processing. Scheduler overhead is visible here — the engine optimized for batch concurrency pays a tax when serving one user. You're leaving 31% of ExLlamaV2's speed on the table. Don't run vLLM for solo chat unless you're already running it for other reasons.
A 45% speed advantage means less if you burn a Sunday achieving it. If you need to reproduce this rig in six months, you'll pay again. You can always migrate to ExLlamaV2 once you've proven the speed delta matters for your workload.
Why ExLlamaV2 Pulls Ahead
Speed isn't magic. It's architecture.
ExLlamaV2 fuses the entire forward pass into a single CUDA graph dispatch. llama.cpp dispatches separate kernels. We've profiled this. The fused path shows 40% fewer CUDA API calls per token.
The EXL2 format enables 4-bit weights with 16-bit activations without runtime dequant overhead. GGUF Q4_K_M dequantizes on load; EXL2 stays compressed through the matrix multiply. That's memory bandwidth saved, and memory bandwidth is the bottleneck on consumer GPUs.
FlashAttention-2 backend with automatic sliding-window optimization completes the stack. Scheduler overhead is visible here — the engine optimized for batch concurrency pays a tax when serving one user.
The tradeoff list is explicit: no CPU offload, no Apple Silicon, no Vulkan. GPU-only fast path. If your hardware isn't NVIDIA CUDA, ExLlamaV2 isn't an option. That's not a bug; it's the optimization boundary.
llama.cpp's Hidden Speed Mode
Most users run llama.cpp with defaults that leave 20% on the table. Here's the unlock sequence.
-
-ngl 999for full GPU offload. The "999" is shorthand for "all layers" — don't count, just push everything. CPU fallback kills throughput unpredictably. -
-fafor FlashAttention. Not default in all builds. Enables the same memory-efficient attention path ExLlamaV2 uses, with llama.cpp's broader format support. -
-ctk q4_0 -ctv q4_0for KV cache quantization. Cuts KV cache VRAM by half, lets you run longer context or larger batch without spilling to system RAM. Quality impact at 4096 context is below measurement noise in community reports. -
Disable continuous batching for single-user.
-cbhelps concurrent clients; for one chat, it adds scheduler overhead with no benefit. Enable it only when runningllama-serverfor multiple connections. -
-np 4splits prompt processing across 4 streams. Irrelevant for single chat — your prompt is one stream. Useful for batch or API scenarios, not here.
With these flags, our 142.1 tok/s baseline climbs toward 160 tok/s on Q4_K_M. IQ4_XS with the same flags hits the 164 tok/s sweet spot. The gap to ExLlamaV2 shrinks to 14%, and you keep llama.cpp's portability.
Note
Context length changes everything. These numbers lock to 4096. Meanwhile, llama.cpp's IQ4_XS hits 164 tok/s, a 15% jump over its Q4_K_M default with only marginal quality loss. Match your test to your use case.
Batch and Concurrent Behavior
Single-user speed is vanity. Batch throughput is sanity — and the rankings flip hard.
| Engine | Batch=8 Aggregate | Per-User Speed | VRAM | Notes |
|---|---|---|---|---|
| vLLM | 312.4 tok/s | 39.0 tok/s | 22.1 GB | PagedAttention, scheduler tax |
llama.cpp (-np 8) | 198.6 tok/s | 24.8 tok/s | 18.4 GB | Simple parallel streams |
| ExLlamaV2 | 156.2 tok/s | 19.5 tok/s | 19.8 GB | No PagedAttention, bandwidth bound |
vLLM dominates at batch=8 with 312.4 tok/s aggregate — that's 39.0 tok/s per user, still interactive for eight simultaneous conversations. The 22.1 GB VRAM footprint is aggressive; on a 24 GB card, you're running with 1.9 GB of headroom. Check your VRAM math before sizing production loads.
llama.cpp's -np 8 hits 198.6 tok/s aggregate, 24.8 tok/s per user, at 18.4 GB VRAM. Simpler mechanics: eight independent sequences, no fancy memory sharing. For small-team API serving or personal multi-model rigs, this is the pragmatic middle. At 8192, KV cache pressure shifts the balance; at 32k, vLLM's PagedAttention starts to matter even for single users.
ExLlamaV2 collapses to 156.2 tok/s aggregate, 19.5 tok/s per user, at 19.8 GB VRAM. No PagedAttention means no dynamic KV cache management. Each sequence pre-allocates its full context reservation. Memory bandwidth saturates; the engine that wins solo choking on concurrency. Don't serve APIs with ExLlamaV2 unless your user count is fixed and tiny.
PagedAttention Mechanics
vLLM's secret isn't faster kernels. It's smarter memory.
Traditional engines pre-allocate KV cache per sequence — a 4096-token slot reserved even for "Hi." At batch=8, that's 35% internal fragmentation: wasted VRAM holding empty context positions.
vLLM's PagedAttention allocates physical KV blocks dynamically, like OS virtual memory. Fragmentation drops from 35% to 4% at batch=8. Same hardware, more users, no capacity lie.
Shared prefix pages compound the win. System prompts, tool definitions, RAG context chunks — identical across users — stored once, referenced many. The 22.1 GB VRAM footprint is aggressive; on a 24 GB card, you're running with 1.9 GB of headroom. A 512-token system prompt shared across eight users costs 512 tokens of VRAM, not 4096. Check your VRAM math before sizing production loads. The scheduler isn't free. Simpler mechanics: eight independent sequences, no fancy memory sharing. PagedAttention is a throughput tool, not a latency tool. Know which you're optimizing.
When Batch Size Backfires
More isn't always better. There's an interactive floor.
llama.cpp at batch=16: 214 tok/s aggregate, but 13.4 tok/s per user. Below the interactive threshold. Your users notice. The aggregate number looks healthy; the experience degrades to typing-teletype speeds. We don't recommend llama.cpp beyond batch=8 for real-time serving.
ExLlamaV2's wall arrives earlier. Batch>4 triggers memory bandwidth saturation; per-user latency exceeds 100 ms/token. At that point, you're not serving chat. You're batch-processing documents — and vLLM or even CPU inference might compete. The engine has no graceful degradation curve; it falls off a cliff.
The per-user experience degrades faster — 24.8 tok/s is tolerable, but you won't push batch=16 without pain. No PagedAttention means no dynamic KV cache management. That's the server-class behavior you're paying the install tax for.
Warning
Batch benchmarks seduce. "312 tok/s!" sounds like 2.2× ExLlamaV2. But it's 39 tok/s per user, not 187. Match your metric to your actual user count. A solo power user running batch=1 on vLLM gets 128.7 tok/s and a dependency headache — the worst of both worlds.
Quantization Format Wars
Your model file is a contract. The format you pick shapes speed, quality, and where you can run it.
| Format | Size | Perplexity Retention | Engine Lock | Speed Sweet Spot |
|---|---|---|---|---|
| EXL2 4bpw | 4.71 GB | 94.2% vs. FP16 | ExLlamaV2 only | 187.3 tok/s |
| Q4_K_M GGUF | 4.63 GB | 93.8% vs. FP16 | Universal | 142.1 tok/s |
| vLLM AWQ | 4.50 GB | 92.1% vs. FP16 | vLLM-optimized | 11% penalty vs. EXL2 |
| IQ4_XS | 4.12 GB | 92.5% vs. FP16 | llama.cpp native | 164 tok/s |
EXL2 at 4bpw delivers 94.2% perplexity retention — the quality crown at this compression level. The 4.71 GB file size is slightly bulkier than GGUF; you're paying 1.7% more storage for 0.4% better fidelity. Whether that trade matters depends on your sensitivity to output degradation. For creative writing or coding assistance, the delta is invisible. Higher VRAM fragmentation from small block sizes, occasional 50ms scheduling stalls when rearranging the block table — we've caught these in telemetry.
Q4_K_M GGUF at 4.63 GB and 93.8% retention is the pragmatic standard. Universal compatibility means your file runs on llama.cpp, vLLM, koboldcpp, LM Studio, and anything else that speaks GGUF. Convert once, run everywhere. The 0.4% perplexity gap to EXL2 is smaller than run-to-run variance in most real tasks. This is the lingua franca of local inference — and lingua francas exist for good reason.
vLLM's AWQ path at 4.50 GB, 92.1% retention, carries an 11% speed penalty versus EXL2 on identical hardware. The activation-aware quantization shaves file size but complicates the runtime path. Comparing AWQ against EXL2 on the same RTX 4090, the reported ~11% hit is consistent across model sizes. vLLM's native fast path is FP16 or GPTQ, not AWQ. Use AWQ only if your serving pipeline demands it.
IQ4_XS is the sleeper. 4.12 GB, 92.5% retention, 164 tok/s on llama.cpp — faster than Q4_K_M with barely perceptible quality loss. The "XS" denotes extra-small group size; llama.cpp's implementation is mature and well-tested. For VRAM-constrained builds, this is often the right answer. Check our VRAM cheat sheet if you're running 8–12 GB cards.
Perplexity vs Speed Trade Surface
WikiText-2 perplexity at 4096 context, lower is better. FP16 baseline: 6.12.
The perplexity scale isn't linear. A jump from 6.12 to 6.35 (Q4_K_M) versus 6.28 (EXL2) looks small — and is small. Both sit in the "indistinguishable from full precision for chat" zone. Pick the wrong format, and you trade speed for compatibility — or lock yourself into an engine you didn't choose.
Here's the brutal math: 1% perplexity gain costs 2.3 GB VRAM and 18% speed moving from Q4 to Q6.
| Format | Perplexity | Speed | VRAM |
|---|---|---|---|
| FP16 | 6.12 | llama.cpp native | 16.2 GB |
| Q6_K | 6.18 | 147 tok/s | 6.3 GB weights + 2.1 GB KV |
| Q4_K_M | 6.35 | 142 tok/s | 4.6 GB weights + 1.4 GB KV |
We don't recommend Q6 for chat workloads. Your bottleneck is rarely model quality; it's context length, batch size, or VRAM headroom on lesser cards.
At the other extreme, pushing to IQ4_XS or Q3_K_M trades ~1.5% perplexity for 10–15% speed and 0.5 GB VRAM. That's often worth it. The 4.71 GB file size is slightly bulkier than GGUF; you're paying 1.7% more storage for 0.4% better fidelity.
Tip
Start with Q4_K_M. It's the default for a reason — broad compatibility, proven quality, no surprises. Universal compatibility means your file runs on llama.cpp, vLLM, koboldcpp, LM Studio, and anything else that speaks GGUF.
Format Lock-In Risks
The best format today is worthless if you can't use it tomorrow.
EXL2's ecosystem is narrow: 340 models on HuggingFace versus 15,000+ GGUF variants. That's not a rounding error; it's two orders of magnitude. Want to run a fresh fine-tune from last week? Odds are it's GGUF-ready, not EXL2-converted. The conversion path exists — ExLlamaV2's convert.py handles it in ~8 minutes — but it's one-way. There's no EXL2→GGUF path. Keep your originals, or you're trapped. New techniques (IQ quants, importance matrix methods) land in llama.cpp first, then migrate. If you're chasing bleeding-edge compression, vLLM isn't your early-adoption vehicle.
llama.cpp's GGUF is the only true lingua franca. Convert once from safetensors, run on any engine that supports the spec — which is nearly all of them. The format carries quantization metadata, tensor names, and tokenizer config in one file. No sidecar JSON, no Python environment to reconstruct.
The danger zone starts around 6.8+, where coherence fractures and factual hallucination rates climb. Users report building entire workflows around EXL2, then hitting a model release with no conversion available and losing a day to convert.py debugging. That's not a criticism of ExLlamaV2's engineering. It's a warning about ecosystem velocity.
Your quantization choice outlives your engine choice. Engines update monthly; model collections accumulate over years. Pick the format that protects your library, not just your next inference run.
API and Ecosystem Maturity
Speed gets you started. Ecosystem keeps you shipping.
llama.cpp ships an OpenAI-compatible server as a single binary. No Python environment, no requirements.txt, no virtualenv archaeology. Download llama-server, point it at a GGUF, and you've got /v1/chat/completions on localhost. The same model file works in LM Studio, koboldcpp, or your own wrapper. The 76% tooling integration score reflects near-universal front-end support — everything speaks GGUF, everything wraps llama.cpp. For the solo builder who wants API semantics without infrastructure, this is the fastest path to "it just works." You won't hit dependency conflicts at 2 AM when you're trying to finish a project.
vLLM runs the table on spec completeness: 47 endpoints, streaming SSE, function calling, logit bias, structured output — the full OpenAI surface. The catch is the stack beneath: Python, PyTorch, CUDA toolkit alignment, and enough VRAM overhead that your 24 GB card becomes a 22 GB effective card before weights load. The 89% tooling integration isn't surprising; OpenWebUI and LM Studio both optimize for vLLM's feature set. For researchers and product builders who need to iterate on inference mechanics, this is the playground. LoRA hot-swapping without restart. Speculative decoding with draft models. Custom attention patterns via the v1 engine's pluggable scheduler.
ExLlamaV2's native API is bare bones. No streaming SSE. No function calling. The 12% tooling integration score reflects reality: most front-ends don't speak ExLlamaV2 directly. The community TabbyAPI wrapper bridges this gap, adding OpenAI compatibility and multi-model routing, but it's a third-party dependency on top of an already finicky install. TabbyAPI works for this — community setups confirm it. We've also debugged version mismatches between ExLlamaV2's fast-moving core and TabbyAPI's wrapper lag. Your mileage will vary.
The integration calculus flips the raw-speed ranking. ExLlamaV2's 187 tok/s means less if your front-end doesn't speak it. The lock-in calculus: EXL2 for maximum speed on known models, GGUF for flexibility and future-proofing, AWQ only if your serving stack demands it.
Production Deployment Checklist
Getting to production means different prep for each engine.
vLLM:
- Start with the official Docker image — 67% success rate beats 12% native pip.
- Compose in Triton for model ensemble serving if you're running multiple fine-tunes.
- Enable Prometheus metrics export for throughput and queue-depth alerting.
- Configure tensor-parallel for multi-GPU before you need it; migration later requires restart.
- Set
max_num_seqsandmax_model_lenconservatively; vLLM's scheduler will greedily allocate.
llama.cpp:
- Wrap
llama-serverin a systemd service withRestart=alwaysand memory limits. - Front with nginx for TLS termination and rate limiting; the binary has no auth layer.
- Use
-npflags to pre-allocate stream capacity; dynamic scaling isn't llama.cpp's strength. - Monitor with standard Linux tools —
nvidia-smi,curlhealth checks. No bespoke telemetry.
ExLlamaV2:
- Evaluate TabbyAPI against your exact model list before committing.
- Build a custom FastAPI shim only if TabbyAPI's routing doesn't cover your use case.
- Accept no official production guidance exists; you're maintaining the integration yourself.
- Plan for single-maintainer risk — see Future-Proofing below.
Developer Experience Gap
The engine you pick shapes what you can build next month, not just what runs today.
vLLM's Python-native design enables experiments that are painful elsewhere. LoRA hot-swapping without restart. Speculative decoding with draft models. Custom attention patterns via the v1 engine's pluggable scheduler.
llama.cpp's C++ base is its performance secret and its architectural ceiling. Bindings exist for Python, Node, Rust — but they're second-class citizens. New features land in C++ first; bindings catch up when someone volunteers. If your workflow is "download model, run inference," this doesn't matter. No Python environment, no requirements.txt, no virtualenv archaeology.
ExLlamaV2's tight optimization leaves no headroom. The fused kernels are hand-tuned for specific GPU architectures; the EXL2 format is purpose-built for ExLlamaV2's memory layout. There's no plugin architecture, no experimental scheduler, no "just swap the attention implementation." The speed is the feature — and the feature is the entire product. For the solo chat user who wants 187 tok/s and nothing else, it's irrelevant. For researchers and product builders who need to iterate on inference mechanics, this is a dead end.
Decision Matrix and Verdict
Benchmarks are maps. Your rig is the territory. Here's how to match them without regret.
| Scenario | Winner | Why | Speed | Caveat |
|---|---|---|---|---|
| Solo chat, Windows | llama.cpp | Native binary, zero dependencies | 142 tok/s | 45% slower than EXL2 |
| Solo chat, Linux, speed-obsessed | ExLlamaV2 | Maximum tok/s, best prompt processing | 187 tok/s | EXL2 lock-in, no ecosystem |
| API serving, batch>2 | vLLM | PagedAttention, 47 endpoints | 312 tok/s aggregate | 41-min install, 2 GB VRAM tax |
| Multi-model personal rig | llama.cpp | One GGUF runs everywhere | varies | Not the fastest at anything |
| Production with custom sampling | vLLM | Pluggable scheduler, spec decode | varies | Python stack maintenance |
| Weekend benchmark test | llama.cpp | 4 minutes to first token | 142 tok/s | Test EXL2 after you prove the gap matters |
Solo chat on Windows? llama.cpp wins on simplicity with 142 tok/s "fast enough." The prebuilt binary runs in four minutes. The dependency pain is the price of access to the fastest-moving inference stack.
Linux power user with Python confidence? ExLlamaV2 at 187 tok/s is the crown. The 32% speed delta over llama.cpp is perceptible in responsive chat. The EXL2 format quality edge matters for precise tasks. You'll pay in ecosystem lock-in and install time — but you knew that going in.
API serving, batch inference, or any concurrent-user scenario? vLLM's 312 tok/s aggregate is unbeatable. PagedAttention changes the economics of multi-tenant serving. The 47-endpoint API spec means your front-end probably already speaks vLLM natively. Budget 41 minutes for first install, and consider Docker mandatory, not optional.
Windows plus serving is the awkward middle. If you're building a product with custom sampling or dynamic context manipulation, you'll feel the friction. The fused kernels are hand-tuned for specific GPU architectures; the EXL2 format is purpose-built for ExLlamaV2's memory layout. There's no plugin architecture, no experimental scheduler, no "just swap the attention implementation." The speed is the feature — and the feature is the entire product.
The real mistake isn't picking the "wrong" engine. It's never testing all three on your own hardware. Benchmark. Measure. Then commit.
Migration Paths Between Engines
Format portability is your insurance policy.
-
GGUF→EXL2: ExLlamaV2
convert.py, ~8 minutes, one-way. Keep your originals. vLLM's 312 tok/s batch throughput justifies the install pain only when you're actually serving multiple users. -
EXL2→GGUF: No direct path. Re-download from HuggingFace or re-quantize from FP16. If you built a library around EXL2 exclusivity, you're maintaining parallel copies.
-
vLLM GGUF loading: Available since 0.7.0, functional, 15% slower than native AWQ/FP16 path. Use for compatibility emergencies, not production speed.
The one-way nature of EXL2 conversion is the hidden cost of its speed. llama.cpp's 142 tok/s "fast enough" with 76% ecosystem coverage is often the correct engineering trade — especially when your time to ship matters more than your tokens per second. Your quantization choice outlives your engine choice. Archive in GGUF; optimize runtime copies to EXL2 or AWQ as needed.
Future-Proofing Your Choice
Engines evolve. Your hardware doesn't. Pick the trajectory that matches your upgrade cycle.
llama.cpp is adding native speculative decoding (lookahead) in b4600+ builds. Pick your format before you pick your engine, or plan to maintain parallel model libraries. The C++ base limits architectural experimentation, but core inference optimizations keep landing.
vLLM's v1 engine rewrite targets 40% scheduler overhead reduction at batch=32+. Solo chat on Windows? llama.cpp wins on simplicity with 142 tok/s "fast enough." The prebuilt binary runs in four minutes. The dependency pain is the price of access to the fastest-moving inference stack.
ExLlamaV2 carries maintenance mode risk. Single maintainer, no published roadmap for six months. The code is clean, the kernels are fast, and the project may continue indefinitely as a focused tool. Or it may not. If you're building a product around ExLlamaV2, fork and maintain, or plan a migration path. The 187 tok/s is real today; its availability in 2027 is a bet, not a guarantee.
Our recommendation: run llama.cpp as your default, your archive format, your known-good baseline. Benchmark ExLlamaV2 when single-user speed is the bottleneck. Deploy vLLM when you're serving others. The 32% speed delta over llama.cpp is perceptible in responsive chat.
Caution
Don't optimize for peak tok/s if your actual use case is batch=1 on Windows. vLLM's 312 tok/s aggregate is unbeatable. PagedAttention changes the economics of multi-tenant serving. Match the engine to your workload, not your aspiration.
Ready to size your GPU for the quant and batch you actually need? Check our VRAM cheat sheet before you commit to a build. Running something less than a 4090? See scaled expectations for these same engines on 8–12 GB cards.