CraftRigs
articles

When Will TurboQuant Land in Ollama? Current Status and What to Watch (March 2026)

By Charlotte Stewart 8 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.


**TurboQuant is not what most people think it is.** It doesn't help you fit a 70B model onto a 12GB GPU. It compresses the KV cache — the memory used for context, not weights — and does it well enough (4.9x at TQ3) that it fundamentally changes long-context inference. As of March 2026, it's available today if you're willing to build from source. Ollama integration doesn't have a confirmed date, but the path is clearer than it was two weeks ago.

**TL;DR:** TurboQuant compresses your KV cache 4–5x, not your model weights — meaning 5x more context on the same GPU, not bigger models on smaller hardware. It's usable now via community forks of llama.cpp (CLI-only, build from source). The mainline llama.cpp PR (#21089) is under active review with a realistic weeks-not-months timeline for CPU support. Ollama is downstream and won't move until llama.cpp merges first — no committed date, no assigned maintainer. If you need TurboQuant today, switch to a fork. If you're waiting for Ollama, you're waiting on two merge events, not one.

---

## What TurboQuant Actually Does (It's Not What You've Seen Claimed)

Most coverage of TurboQuant describes it as "2-3x VRAM reduction" or "fit 70B models on 12GB GPUs." That framing is wrong, and it's creating real confusion about who should care.

TurboQuant is an online vector [quantization](/glossary/quantization) algorithm that compresses the **KV cache** — the key/value vectors your model stores during inference for each token it processes. It does not touch model weights. Those are compressed at the Q4_K_M or AWQ stage, before inference starts. The KV cache is different: it grows dynamically as context grows, one vector per token per layer, stored in FP16 by default.

This distinction matters enormously. If you're running Llama 3.1 70B Q4_K_M, your model weights occupy roughly 40GB. TurboQuant doesn't change that number. What it changes is how much [VRAM](/glossary/vram) your context window consumes after the model loads.

At short context (8K tokens), the KV cache on a 70B model is relatively small. At 128K tokens — the kind of context you need for large document analysis, multi-file codebases, or long agentic conversations — the KV cache can exceed the model weights themselves. That's the problem TurboQuant solves.

### TQ3 and TQ4: Two Modes, Different Trade-offs

Google's ICLR 2026 paper defines two quantization modes:

- **TQ3:** 3.0625 bits per element, 4.9x compression vs FP16 KV cache. At 3.5 bits per element, quality matches FP16 baseline on LongBench.
- **TQ4:** 4.0625 bits per element, 3.8x compression. Quality is essentially indistinguishable from FP16 on models 3B and larger.

The technique uses a randomized Hadamard transform to decorrelate vectors before quantizing — this is why it achieves near-optimal distortion at 3 bits where naive quantization fails. No calibration data required, no preprocessing pass. It runs online, quantizing each K/V vector the moment it's produced.

You can stack TurboQuant on top of standard weight quantization. Q4_K_M weights plus TQ3 KV cache is a valid and useful combination. Compound the VRAM savings.

---

## Current Integration Status: Where TurboQuant Lives (March 2026)

There's a fork ecosystem and a mainline PR. They're different things with different timelines.

### Community Forks (Working Today)

Several forks are actively maintained and usable right now:

Last Updated


March 26, 2026


March 2026


March 2026


March 2026


March 25, 2026
All require building from source. CLI flags: `--cache-type-k turbo3 --cache-type-v turbo3`. No Modelfile support, no Ollama UI integration, alpha stability. If you hit a bug, you file an issue and wait.

The most accessible path right now is Apple Silicon — TheTom's fork has Metal support working, and unified memory means you're getting the full benefit immediately. CUDA forks are slightly further behind in polish, but functional.

> [!NOTE]
> These forks are experimental research-grade code. Don't use them in production pipelines you depend on. For personal experimentation and testing, they're fine — but expect rough edges.

### Mainline llama.cpp: PR #21089

This is the one to watch. PR #21089 by `eluszlik` is the first serious upstream attempt at merging TurboQuant into mainline llama.cpp. It adds `tbq3_0` and `tbq4_0` as new ggml types, wires through the full type system (quantize/dequantize, vec_dot, KV/graph handling, tooling, tests), and is under active review — 26 files, Copilot automated review running.

**Separate from this:** ggml-org/llama.cpp issue #20977 (the feature request) has 136 upvotes and is the community rally point. Discussion #20969 has additional implementation context. Both opened March 25, 2026.

CPU-only support comes first. CUDA GPU support will lag by additional weeks after the CPU merge. This is standard practice for llama.cpp — get the reference implementation right, then optimize for hardware.

If you're on CraftRigs benchmarking our own inference rigs, we're tracking this PR. [Our quantization explainer](/guides/quantization-explained/) has background on how ggml types work if PR #21089 is confusing to read.

### Ollama: Under the Surface, Moving Slowly

Ollama issue #15051 ("native ollama-go-engine: TurboQuant+RotorQuant implementation") was opened March 25, 2026. Current status: 13 upvotes, no PR submitted, no assigned maintainer, no milestone.

This isn't surprising. Ollama is architecturally downstream of llama.cpp — it wraps llama.cpp's inference engine, doesn't rewrite it. Before Ollama can ship TurboQuant, llama.cpp has to merge it first. Then Ollama's team needs to bump their llama.cpp dependency, test across community hardware, and ship a release.

That's two merge events and a full release cycle, not one. Anyone quoting Q2 2026 as a confirmed Ollama timeline is guessing. The honest answer is: after llama.cpp merges PR #21089, and after Ollama updates its dependency, and after testing completes. The "weeks not months" assessment that's circulating online applies to llama.cpp CPU support only — not Ollama.

> [!WARNING]
> Don't make hardware purchasing decisions based on unconfirmed Ollama TurboQuant timelines. The feature request exists. No PR has been submitted. No maintainer has committed to a ship date.

---

## Real-World Numbers: What TurboQuant Enables

Here's what TQ3 actually unlocks in concrete terms, using a 70B model as the example:

On a GPU with 34GB free for KV cache after model loading:

| KV Cache Format | Usable Context
|---|---|
| FP16 (default) | ~109K tokens |
| Q8 KV cache | ~218K tokens |
| TQ3 KV cache | ~536K tokens |

That's not a small delta. For anyone doing document analysis, running large codebases through local AI, or building long agentic pipelines, this is the difference between possible and practical.

### Who Actually Benefits

**High-VRAM GPU users (RTX 4090 32GB, RTX 5090 32GB) running 32B–70B models** — they already fit the model but hit context walls. TurboQuant gives them 4-5x more usable context on the same card.

**Apple Silicon (M3 Max/Ultra, M4 Max/Ultra, 64–128GB unified memory)** — unified memory means KV cache competes directly with model weights for the same pool. TurboQuant eases that competition significantly. And the fork works today.

**Long-context agent workflows** — document analysis, multi-file code review, multi-turn conversations exceeding 32K tokens. This is the use case TurboQuant was designed for.

**Who doesn't benefit much:** if you're running Llama 3.1 8B or Qwen 14B at 8-16K context on a 24GB GPU, the KV cache is tiny relative to available VRAM. TurboQuant adds complexity with minimal gain. For hardware guidance on building for this use case, see [our RTX 5070 vs 4070 comparison](/comparisons/rtx-5070-vs-4070/).

---

## Timeline: When You Can Actually Use It

Stability


Alpha


Beta


Beta


Stable
The "weeks not months" estimate comes from PR #21089's active review state. It's possible. It's also possible the review surfaces blockers that stretch the timeline. GPU support adds another lag on top of CPU.

### Decision Tree: Switch Now or Wait?

**Switch to a llama.cpp fork today if:**
- You regularly process 50K+ token contexts
- You're on Apple Silicon — TheTom's fork is your smoothest path
- You're comfortable building from source and handling alpha bugs
- Long-context performance is blocking your actual work right now

**Wait for Ollama if:**
- You depend on Ollama's UI and Modelfile workflow
- Your context needs are under 32K tokens (standard Q8 KV cache handles this fine)
- Stability matters more than being first — and you're not currently bottlenecked

**Don't confuse the two problems.** TurboQuant doesn't help if you're trying to fit a larger model onto your GPU. If you're on an RTX 5070 Ti (16GB) hoping TurboQuant will get 70B models running, that's not what this does. That's a weight quantization and CPU-offload conversation. See [our Ollama setup guide](/guides/ollama-setup-complete/) for current options on mid-range hardware.

---

## What Could Speed or Slow Integration

**Speeders:**
- PR #21089 merges without major revision requests → CPU support lands, CUDA forks can rebase cleanly
- A GPU (CUDA) implementation PR gets submitted independently and parallels the CPU track
- Google drops official reference code on schedule (expected Q2 2026), accelerating fork quality
- vLLM merging TurboQuant (PR #38280 is already in Phase 1) creates cross-ecosystem pressure on Ollama

**Slowers:**
- Performance regression discovered on specific model architectures during review
- Community hardware incompatibilities (quantization bugs often surface on edge-case GPU configs)
- Maintainer bandwidth — llama.cpp merges are bottlenecked by a small core team
- Ollama's dependency update cycle — even after llama.cpp merges, Ollama has its own ship schedule

CraftRigs will update this article when PR #21089 status changes or a CUDA merge PR appears. Don't rely on a static snapshot — check the PR directly.

---

## FAQ

**Is TurboQuant available in Ollama?**
No. As of March 29, 2026, Ollama issue #15051 has 13 upvotes, no assigned maintainer, and no submitted PR. Ollama is downstream of llama.cpp — it needs llama.cpp to merge first, then a dependency bump, then testing. No committed timeline exists.

**What does TurboQuant actually compress?**
The KV cache — the memory your model uses to store context during inference. Not model weights. A 70B model at Q4_K_M still occupies roughly 40GB for its weights whether TurboQuant is in use or not. TurboQuant's benefit is dramatically extended context on the same hardware, not the ability to load larger models.

**Can I use TurboQuant today without waiting for Ollama?**
Yes. TheTom/llama-cpp-turboquant works on Apple Silicon. Madreag/turbo3-cuda and spiritbuun/llama-cpp-turboquant-cuda work on NVIDIA GPUs. All require building from source. CLI flags: `--cache-type-k turbo3 --cache-type-v turbo3`. Alpha stability — not for production pipelines.

**Which hardware benefits most from TurboQuant?**
RTX 4090 and RTX 5090 owners (32GB) running 32B–70B models who are hitting context walls, and Apple Silicon users with 64GB+ unified memory. If you're running 7B–13B models at under 32K context, the complexity overhead isn't worth it — standard KV quantization handles that range fine.

**Does TurboQuant help fit bigger models on 12GB or 16GB GPUs?**
No. Model weight VRAM is untouched. Llama 3.1 70B Q4_K_M needs ~40GB for weights. TurboQuant won't change that. If you're trying to run 70B on an RTX 5070 Ti (16GB), you're looking at CPU offloading — which is a different and slower approach than TurboQuant.
turboquant ollama llama-cpp kv-cache quantization

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.