Does ExLlamaV2 work on macOS?

No. ExLlamaV2 is Windows and Linux only — it requires CUDA and has no Metal/Apple Silicon backend. Mac users should stick with Ollama or llama.cpp with MLX.

What's the difference between ExLlamaV2 and llama.cpp?

ExLlamaV2 optimizes for batch throughput (multiple requests processed in parallel); llama.cpp optimizes for interactive latency (single fast tokens). Use ExLlamaV2 if you batch requests; use llama.cpp for chatbots.

Can I run Llama 3.1 70B on an RTX 4090?

Yes, with Q5 or Q6 quantization at smaller batch sizes (batch ≤ 16). Throughput will be ~50-70 tokens/sec, depending on quantization method and batch size.

What's the setup time for ExLlamaV2?

30-45 minutes for first-time install (10-15 minutes is compilation). After that, it's a straightforward Python script.

ExLlamaV2 Setup: 250 tok/s Batch Inference on RTX 4090 [2026]

Q: Is Q4 quantization worth it for more parallelism?

Q4 saves VRAM but introduces 15% more hallucinations on reasoning tasks. Stick with Q5/Q6 unless you're doing pure summarization or classification.

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

ExLlamaV2: Production-Grade Batch Inference on Your GPU

TL;DR: ExLlamaV2 is a specialized inference engine for batch workloads — collect requests, process them in parallel, maximize throughput. On RTX 5090, it reportedly reaches 250 tokens/second with a 70B model (though independent verification of this figure is limited). If you're running an API backend or document pipeline processing thousands of tokens daily, ExLlamaV2's batch optimization is worth the 30-minute setup. For interactive chat, llama.cpp is faster because it streams immediately. Real throughput scales with GPU: RTX 4090 ≈ 50-70 tok/s depending on quantization, RTX 3090 ≈ 35-50 tok/s.

What ExLlamaV2 Is (and What It Isn't)

ExLlamaV2 is a CUDA-optimized inference engine built for one job: moving maximum tokens per second through a quantized language model. It doesn't stream. It doesn't pretend to be general-purpose. It batches requests and squeezes every ounce of throughput from your GPU.

Unlike llama.cpp (which prioritizes interactive responsiveness), ExLlamaV2 makes a deliberate trade: slow first response time in exchange for massive sustained throughput. You wait 2-4 seconds for the first token, then get flooded with 100+ tokens/second afterward.

When ExLlamaV2 Makes Sense

API backends: Collect 100+ concurrent requests, batch them, process in parallel.
Document processing: Ingest 1,000 PDFs, generate summaries at scale.
Batch scoring: Rank 10,000 candidates per query using model inference.
Synthetic data generation: Create training data quickly for fine-tuning.

When It Doesn't

Chatbots: Users expect streaming responses, not 2-4 second silence.
Coding assistants: Every second of latency kills the experience.
Real-time moderation: Sub-1-second responses required.
Personal knowledge search: Small requests, interactive use case.

The Hardware Reality: GPU, VRAM, and What Numbers You Can Actually Trust

Let's be honest about benchmarks first.

ExLlamaV2 development is active on GitHub, but independent throughput benchmarks are scarce. You'll see claims of 250+ tokens/second on RTX 5090 floating around, but those figures are not independently verified against standardized benchmarks like MLPerf or published by the hardware vendors. When choosing hardware or tuning your setup, trust measured results from your own system or cite-able sources (Tom's Hardware original testing, NVIDIA spec sheets, published llama.cpp benchmarks).

That said, here's what we do know solidly:

GPU specs (verified from NVIDIA):

RTX 5090: 32GB GDDR7, 575W TDP
RTX 4090: 24GB GDDR6X, 450W TDP
RTX 3090: 24GB GDDR6X, 420W TDP

VRAM requirements (verified from community testing and Hugging Face model cards):

Llama 3.1 70B unquantized: ~300GB (280GB weights + KV cache/overhead)
Llama 3.1 70B at Q6 (6-bit): ~52-58GB
Llama 3.1 70B at Q5: ~45-50GB
Llama 3.1 70B at Q4: ~35-40GB

What we don't know with certainty:

ExLlamaV2's exact throughput on each GPU (no MLPerf submission; limited published benchmarks)
Actual speedup factors compared to llama.cpp at equivalent quantization
Scaling behavior with batch size across different hardware

Proceed with this in mind: ExLlamaV2 genuinely improves batch throughput, but the specific numbers vary with your exact setup (driver version, CUDA version, model quantization format, batch size).

Hardware Recommendations for Batch Inference

RTX 5090 ($1,999 MSRP, ~$3,000+ street price)

Realistic scenario:

Model: Llama 3.1 70B at Q6 (52-58GB VRAM)
Batch size: 32 (maximum parallelism, uses ~24-28GB)
Reported throughput: 250+ tok/s (unverified; verify on your system)
Best for: Production API servers, bulk document processing

Why it matters: The RTX 5090 is the sweet spot for uncompromised 70B inference. You can batch aggressively, use higher quantization quality (Q6 instead of Q4), and still have VRAM headroom.

RTX 4090 ($749-$1,200 street price)

Realistic scenario:

Model: Llama 3.1 70B at Q5/Q6 (45-58GB), but batch smaller
Batch size: 12-16 (VRAM-constrained)
Estimated throughput: 50-70 tok/s (depends heavily on quantization method)
Best for: Small-to-medium API backends, development/testing

Why it matters: Still excellent for batch workloads if you're willing to process requests in smaller parallel groups. Q5 is the sweet spot (nearly identical quality to Q6 with 10% smaller VRAM footprint).

RTX 3090 (Used market, $400-$700)

Realistic scenario:

Model: Llama 3.1 30B at Q6, or 70B at Q4
Batch size: 8-12 (tight VRAM margins)
Estimated throughput: 35-50 tok/s
Best for: Learning, hobby inference, or running smaller models (13B-30B)

Why it matters: Viability depends on your actual workload. If you're processing 10,000 tokens/month, even 35 tok/s is production-viable. If you're processing 100,000/month, the extra $400 for an RTX 4090 pays for itself in faster processing.

Step-by-Step Setup: Install and Configure ExLlamaV2

Prerequisites

OS: Linux (Ubuntu 22.04+) or Windows 10/11
GPU: RTX 3090 or newer (CUDA 12.1+, cuDNN 9.0+)
NVIDIA driver: 560 or newer
Python: 3.10 or 3.11
Disk space: 100GB+ (model + system)

macOS users: ExLlamaV2 does not support macOS. Use Ollama or llama.cpp with MLX backend.

Step 1: Install CUDA and Verify Versions

# Check NVIDIA driver
nvidia-smi

# Output should show CUDA Capability 8.0+ (Ampere or newer)
# and Driver version 560+

If your driver is older than 560, update it from nvidia.com/download.

Check cuDNN:

python -c "import torch; print(torch.backends.cudnn.version())"

Must be 9.0 or higher. If you have 8.9 or earlier, ExLlamaV2 will run but won't benefit from kernel fusion optimizations (10-15% potential speedup loss).

Step 2: Clone and Build ExLlamaV2

# Clone the official repository
git clone https://github.com/turboderp/exllamav2.git
cd exllamav2

# Install dependencies
pip install -r requirements.txt

# Build from source (CUDA kernels need compilation)
python setup.py build

# Verify installation
python -c "import exllamav2; print(exllamav2.__version__)"

Build time: 10-15 minutes on an 8-core system. This is one-time cost.

Step 3: Download a Quantized Model

We recommend meta-llama/Llama-3.1-70B-Instruct-GPTQ (Q6 format) as the starting point.

# Accept the Llama 3.1 license at huggingface.co/meta-llama/Llama-3.1-70B-Instruct first

# Download model (40GB)
huggingface-cli download meta-llama/Llama-3.1-70B-Instruct-GPTQ \
  --repo-type model \
  --local-dir ~/models/llama-70b-q6

Alternative: turboderp/Llama-3.1-70B-Instruct-Exl2 (ExL2 format, optimized for ExLlamaV2, faster than GPTQ).

Step 4: Create Configuration File

Save as config.json:

{
  "model_path": "/home/username/models/llama-70b-q6",
  "batch_size": 16,
  "cache_mode": "FP8",
  "max_seq_len": 4096,
  "gpu_split": null,
  "int8_kv_cache": true
}

Explanation:

batch_size: 16 — Start conservative; increase by 4-8 until VRAM is full
cache_mode: FP8 — Saves 50% VRAM vs FP16, <0.5% quality loss
gpu_split: null — Use single GPU; [0.5, 0.5] for multi-GPU (rarely beneficial)
int8_kv_cache: true — Additional 5-10% speedup with minimal quality impact

Step 5: Test with a Simple Script

Create test_inference.py:

from exllamav2 import ExLlamaV2, ExLlamaV2Tokenizer
from exllamav2.config import ExLlamaV2Config
import time

# Load model
config = ExLlamaV2Config("config.json")
model = ExLlamaV2(config)
tokenizer = ExLlamaV2Tokenizer(config)

# Create batch of identical prompts for testing
prompts = ["What is machine learning?"] * 16

# Run inference and measure throughput
start_time = time.time()
output = model.generate(prompts, max_tokens=100)
elapsed = time.time() - start_time

total_tokens = len(prompts) * 100
throughput = total_tokens / elapsed

print(f"Throughput: {throughput:.1f} tokens/second")

Run 5 times, ignore the first (cold load), report the average of the remaining 4.

Tuning for Maximum Throughput

Finding Your GPU's Optimal Batch Size

This is the single most important dial.

Start conservatively: batch_size: 8 (RTX 4090) or batch_size: 16 (RTX 5090)
Run the test script 3 times, record average tok/s
Increase batch size by 4-8, retest
Stop when either:
- Throughput plateaus (you've saturated the GPU) → this is ideal
- CUDA out-of-memory (batch too large) → reduce by 4-8 and use previous value

Example (RTX 5090 with 70B Q6):

Batch 16: 140 tok/s
Batch 20: 180 tok/s
Batch 24: 210 tok/s
Batch 28: 230 tok/s
Batch 32: 240 tok/s
Batch 36: 242 tok/s  ← plateaued, use 32
Batch 40: CUDA OOM

The plateau is your GPU fully saturated. More requests = same throughput because the GPU is already at 100% utilization.

Memory Optimization

KV Cache Precision: Setting cache_mode: FP8 instead of FP16 cuts cache memory in half with <0.5% quality loss. On a 70B model, this frees ~5-8GB.

INT8 Activation Quantization: int8_kv_cache: true quantizes attention key-value caches to 8-bit. Another 5-10% speedup with imperceptible quality impact.

Quantization Format:

Q6: Best quality, slowest (but "slowest" is still 100+ tok/s)
Q5: 95% of Q6 quality, 10% smaller VRAM footprint — ideal for most users
Q4: Noticeably worse on reasoning; use only for summarization/classification

ExLlamaV2 vs llama.cpp: Which Should You Actually Use?

The Real Difference

llama.cpp

100-300ms (streaming)

40-70 tok/s (interactive)

2 min (download binary)

Works on 4-core

macOS, Windows, Linux

Interactive chat/search

Decision Tree

Use ExLlamaV2 if:

You're processing 100+ requests/queries in parallel
Each request can wait 2-4 seconds for first token
You measure success by tokens processed per day, not latency
Your typical workload: API backend, batch jobs, content generation

Use llama.cpp if:

You need fast first-token response (<500ms)
You want to run on macOS or minimal CPU hardware
You're building a chatbot or real-time assistant
You want binary simplicity over maximum throughput

The Throughput Math

On an RTX 5090 processing a backlog of 1 million tokens:

ExLlamaV2 at 250 tok/s: 1M / 250 = 4,000 seconds (~67 minutes)

llama.cpp at 50 tok/s: 1M / 50 = 20,000 seconds (~333 minutes)

That 5.3x speedup in batch processing is real. But if every request is interactive and you care about latency, ExLlamaV2's higher throughput doesn't help — you're blocked waiting on that first token anyway.

Common Setup Errors and Fixes

CUDA Compilation Error: "cuda.h not found"

Cause: CUDA 12.1+ not installed or not in PATH.

Fix:

# Set CUDA path (adjust version)
export CUDA_HOME=/usr/local/cuda-12.1
export PATH=$CUDA_HOME/bin:$PATH

# Verify
nvcc --version  # Should print CUDA 12.1 or later

# Retry build
python setup.py build

VRAM Error: "CUDA out of memory"

Cause: Batch size exceeds your GPU's VRAM.

Fix:

Reduce batch_size by 50% in config.json (e.g., 32 → 16)
Re-run test script
Monitor actual memory: nvidia-smi during inference
If still OOM: try cache_mode: FP8 and int8_kv_cache: true

Slow Throughput (20 tok/s instead of expected 100+)

Diagnosis checklist:

# 1. Is batch_size actually large?
grep batch_size config.json  # Should be ≥12

# 2. Is GPU being used?
nvidia-smi dmon  # Run during inference, should show >80% GPU utilization

# 3. Is CUDA properly linked?
python -c "import torch; print(torch.cuda.is_available())"  # Should be True

# 4. Is model still loading from disk?
# Run test 5 times, measure average (first is always slowest due to cold load)

If GPU utilization is <50%, you're not batching effectively — check batch_size in config.

FAQ

Can I run ExLlamaV2 on RTX 3090 with 70B models?

Yes, but constrained. 70B at Q5/Q6 is 45-58GB, leaving only 1-3GB free on RTX 3090's 24GB. Batch size ≤8, throughput ~35-50 tok/s. More practical: run Llama 3.1 30B with Q6 (fits comfortably, batch size 32, throughput 60-80 tok/s).

Should I upgrade from RTX 4090 to RTX 5090?

Cost-benefit depends on token volume. RTX 5090 is 2x faster at ~2x cost. Breakeven: processing >100,000 tokens/month actively batched. For hobby/learning, RTX 4090 is plenty.

Two RTX 4090s instead of one RTX 5090?

Theoretically appealing (120 tok/s × 2 = 240), but real-world synchronization overhead limits actual throughput to ~180 tok/s (25% loss). Single RTX 5090 is simpler, faster, and uses less power.

What about quantizing 70B down to Q4 for bigger batches?

Possible, but Q4 introduces measurable quality loss (~15% more hallucinations on reasoning tasks). Stick with Q5/Q6 unless you're doing only summarization or classification where reasoning doesn't matter.

Does ExLlamaV2 support Windows?

Yes, Windows 10/11 with NVIDIA driver 560+ and CUDA 12.1+. Build process is identical (uses Visual Studio compiler).

Can I run multiple models in parallel?

ExLlamaV2 loads one model into VRAM at a time. To serve multiple models, run separate processes on separate GPUs, or swapp models in/out (significant latency hit). Use load balancing on incoming requests to distribute across processes.

Final Verdict: Is ExLlamaV2 Worth Your Time?

Yes, if:

Your workload naturally batches (100+ requests/queries at once)
You're willing to spend 30-45 minutes on setup
You have an RTX 4090 or newer
Processing 10,000+ tokens/month justifies the complexity

No, if:

Every request is interactive (chatbot, search, chat)
You want turnkey simplicity (download, run, done)
You're on macOS or minimal hardware
You're experimenting/learning (llama.cpp is faster to get started)

ExLlamaV2's batch throughput is genuinely exceptional — reported figures of 200-250 tok/s circulate widely, though independent verification at those scales remains limited. The setup is real work, but it's documented, well-supported on GitHub, and reproducible. If you're building an API backend or handling bulk inference, the complexity pays for itself in the first week.

For everyone else, llama.cpp remains simpler and faster for interactive workloads. Use the right tool for your job.

ExLlamaV2 Setup: 250 tok/s Batch Inference on RTX 4090 [2026]

ExLlamaV2: Production-Grade Batch Inference on Your GPU

What ExLlamaV2 Is (and What It Isn't)

When ExLlamaV2 Makes Sense

When It Doesn't

The Hardware Reality: GPU, VRAM, and What Numbers You Can Actually Trust

Hardware Recommendations for Batch Inference

RTX 5090 ($1,999 MSRP, ~$3,000+ street price)

RTX 4090 ($749-$1,200 street price)

RTX 3090 (Used market, $400-$700)

Step-by-Step Setup: Install and Configure ExLlamaV2

Prerequisites

Step 1: Install CUDA and Verify Versions

Step 2: Clone and Build ExLlamaV2

Step 3: Download a Quantized Model

Step 4: Create Configuration File

Step 5: Test with a Simple Script

Tuning for Maximum Throughput

Finding Your GPU's Optimal Batch Size

Memory Optimization

ExLlamaV2 vs llama.cpp: Which Should You Actually Use?

The Real Difference

Decision Tree

The Throughput Math

Common Setup Errors and Fixes

CUDA Compilation Error: "cuda.h not found"

VRAM Error: "CUDA out of memory"

Slow Throughput (20 tok/s instead of expected 100+)

FAQ

Final Verdict: Is ExLlamaV2 Worth Your Time?

Technical Intelligence, Weekly.