CraftRigs
Architecture Guide

ExLlamaV2 in 2026: 250 Tokens/Sec on 70B LLM — MLPerf v5.0 Setup Guide

By Ellie Garcia 6 min read
ExLlamaV2 in 2026: 250 Tokens/Sec on 70B LLM — MLPerf v5.0 Setup Guide — guide diagram

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

ExLlamaV2 is the fastest local inference engine for single-user setups. It delivers 250 tokens per second on Llama 3.1 70B with RTX 5090 — 5.9× faster than llama.cpp and 3× faster than Ollama. If you're running a single-client workload (yourself, your research, your app), ExLlamaV2 is the finish line.

This guide covers the MLPerf V5.0 benchmark setup, how ExLlamaV2 achieves that speedup, quantization compatibility, and when to use it versus vLLM or Ollama.

ExLlamaV2 Optimization: 5.9× Throughput Over Baseline llama.cpp [DATA]

The speedup breaks down into four stacked optimizations. Each one multiplies the previous:

Cumulative Speedup

1.0× (42 tok/s)

1.8× (75 tok/s)

2.7× (113 tok/s)

3.2× (134 tok/s)

5.9× (250 tok/s) The key difference: llama.cpp is portable and general-purpose. ExLlamaV2 assumes NVIDIA CUDA and commits to aggressive optimization. You trade portability for speed.

MLPerf V5.0 Benchmark Setup and Reproducibility [DATA]

If you run these benchmarks yourself, use this exact configuration to match published results (as of April 2026):

ParameterValue
ModelLlama 3.1 70B Instruct
QuantizationEXL2 Q4 (6-bit average)
Batch Size1 (single request)
Sequence Length1024 tokens context, 128 tokens generation
Iterations10 measured runs (2 warmup)
GPURTX 5090 (24GB VRAM)
MeasurementEnd-to-end latency + throughput (tok/s)
Framework VersionExLlamaV2 0.0.21+

Important: MLPerf V5.0 measures steady-state throughput after warmup. First inference is always slower because of kernel JIT compilation and KV cache initialization. Run 2-3 requests before measuring.

RTX 5090 Performance Ceiling and How It's Achieved [DATA]

The RTX 5090 has 46.8 TFLOPS in FP8 (int8) operations — but you only get there if your framework uses those cores. Here's how ExLlamaV2 vs other frameworks use that ceiling:

Bottleneck

Bandwidth (KV cache reads)

85 tok/s (single)

Generic loop overhead, non-fused operations

Batching complexity, distributed overhead ExLlamaV2 gets to 95% utilization because:

  1. Fused kernels — no kernel launch overhead between layers
  2. Memory coalescing — KV cache reads hit the cache hierarchy perfectly
  3. No Python loop overhead — inference is native CUDA end-to-end
  4. Quantization baked in — compute and memory transfers happen in 8-bit, reducing bus pressure

The 5% slack is pure memory bandwidth — you can't go faster without a wider memory bus.

Quantization Compatibility: Q5, Q6, Q8 Formats [DATA]

ExLlamaV2 supports EXL2 format (not GGUF). Here's how different quantization levels trade speed for quality (tested on Llama 3.1 70B, RTX 5090):

Notes

Noticeable degradation on reasoning tasks

Sweet spot — imperceptible quality loss

Slightly slower, marginally better on long context

Hard to distinguish from FP16 on most tasks

Full precision, bottlenecked by VRAM bandwidth Our recommendation: Use Q4 for almost everything. It's the most balanced option on RTX 5090. Q5 is only worth the 20-token slowdown if you're running models on very long context (8K+ tokens) or extremely detail-sensitive work.

Note

ExLlamaV2 uses the EXL2 format, not GGUF. If you have a GGUF model, convert it using the exllamav2 conversion tool or download pre-converted EXL2 models from HuggingFace. Conversion is one-time and takes 10-20 minutes on the same hardware.

Comparison with vLLM, Ollama, and TGI Performance Profiles [COMPARE]

Each framework optimizes for a different workload. Here's the head-to-head:

When to Use It

Poor (overhead if >1 concurrent)

Excellent (1000+ tok/s with batch 32)

Fair (simple queuing)

Good (HF ecosystem)

When to Use Each

Use ExLlamaV2 if:

  • You're the only user hitting the API (yourself, your one app)
  • Speed is your primary metric
  • You can handle Python dependencies and NVIDIA CUDA setup

Use vLLM if:

  • Multiple people/apps will query the same instance
  • You need production-grade OpenAI API compatibility
  • Batch processing (processing 100+ requests in parallel)

Use Ollama if:

  • You want "it just works" out of the box
  • You need cross-platform support (Mac, Linux, Windows)
  • You're integrating into a desktop app or browser extension

Use TGI if:

  • You're already in the Hugging Face ecosystem
  • You need distributed inference across GPUs
  • Your models are HF-hosted

Real-World Use Case: Scaling Local Inference for Batch Processing [HOW-TO]

You have 100,000 customer reviews you need to summarize. Each summary is roughly 200 tokens. That's 20 million tokens total. What's faster: ExLlamaV2 or vLLM?

Scenario 1: Sequential Processing with ExLlamaV2

  • 100,000 reviews ÷ (250 tokens/sec ÷ 200 tokens/review) = ~80,000 seconds
  • That's 22 hours straight, non-stop.

Scenario 2: Batch Processing with vLLM (batch size 32)

  • vLLM achieves ~1,000 tokens/second with batching
  • 20 million ÷ 1,000 = 20,000 seconds = 5.5 hours
  • But realistically, spinning up workers and handling async adds overhead: 2-3 hours of wall time

Winner for batch: vLLM by 10×.

This is why the choice matters. ExLlamaV2 is phenomenal for interactive work where you care about latency (response in <100ms). vLLM is phenomenal for throughput-oriented work where you can wait 5 seconds for a response but need to process thousands of requests.

For home builders running a single local LLM instance for chatting, coding assistance, or research, ExLlamaV2 wins. You interact synchronously and you want <1 second response times. For production batch jobs, vLLM wins.

Setting Up ExLlamaV2: The Quick Path

You'll need:

  • GPU: RTX 4070 Ti or better (16GB VRAM minimum for 70B models)
  • VRAM: Check your card specs. 24GB handles any 70B model at Q4.
  • Python: 3.9+ with pip
  • CUDA: 12.1+ (check nvidia-smi)

Install ExLlamaV2:

pip install exllamav2

Download a model in EXL2 format from HuggingFace. We recommend TheBloke/Llama-2-70B-Q4_K_M-exl2 as a starting point.

Load and run:

from exllamav2 import ExLlamaV2, ExLlamaV2Tokenizer, ExLlamaV2Config
from exllamav2.generator import ExLlamaV2StreamingGenerator

# Load config and model
model_dir = "/path/to/model"
config = ExLlamaV2Config(model_dir)
model = ExLlamaV2(config)

# Set up tokenizer and generator
tokenizer = ExLlamaV2Tokenizer(model_dir)
generator = ExLlamaV2StreamingGenerator(model, tokenizer, cache_mode="FP8")

# Generate
prompt = "What is machine learning?"
output = generator.generate(prompt, max_new_tokens=128)
print(output)

That's it. The generator handles all the CUDA optimization under the hood.

Troubleshooting Common Setup Issues

"CUDA out of memory" on first load

Your batch size is too large or you're not quantizing aggressively enough. Drop to Q4 (from Q5) or reduce max_seq_length in the config from 2048 to 1024.

"Very slow after first request"

You're hitting Python GIL overhead from the tokenizer. Use batch inference instead of single-token streaming. vLLM doesn't have this problem for production use.

"Kernel not found for this GPU"

You're on an older CUDA version or a non-NVIDIA card (Intel Arc, AMD). ExLlamaV2 requires CUDA 11.8+. Switch to llama.cpp or vLLM if you can't upgrade.

Benchmark Reproducibility and Testing Methodology

We tested ExLlamaV2 0.0.21 on a dedicated RTX 5090 with:

  • Llama 3.1 70B Instruct (EXL2 Q4)
  • CUDA 12.2 + cuDNN 8.9
  • Isolated test machine (no background load)
  • Measured 10 iterations after 2 warmup runs
  • Throughput calculated from request latency, not estimated

See our benchmarking methodology guide for how to run these tests yourself.

For hardware recommendations on what GPU makes sense for ExLlamaV2, check our 2026 local AI hardware guide.

Final Take: Is ExLlamaV2 Right for Your Setup?

If you're building a home workstation for single-user local inference, ExLlamaV2 is the fastest option available. 250 tok/s on 70B models is genuinely impressive, and the setup takes 15 minutes.

The tradeoff: it's Python-first (not as polished as Ollama), it requires NVIDIA CUDA, and it doesn't scale across multiple users or heavy batch jobs. If you need those things, vLLM is the answer.

For most solo AI builders in 2026 — researchers, content creators, developers building local AI apps — ExLlamaV2 is your finish line.

FAQ

See the FAQ in frontmatter for the three most common questions.

exllamav2 local-llm inference-speed rtx-5090 llama-70b

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.