CraftRigs
CraftRigs / Glossary / TurboQuant
Memory & Storage

TurboQuant

A Google Research KV cache compression technique that extends usable context length 4-5x on consumer GPUs without retraining the model.

TurboQuant is a KV cache compression method from Google Research that lets a 16–24GB consumer GPU hold 4-5x longer context windows than a standard FP16 cache would allow. It targets the exact bottleneck most local LLM builders hit first: running out of VRAM when the conversation or document gets long, not when the model itself loads.

What It Actually Compresses

TurboQuant operates on the KV cache — the per-token key/value tensors attention layers accumulate as a model reads input and generates output. The model weights are untouched, so this is orthogonal to weight quantization like Q4_K_M or AWQ. You can stack TurboQuant on top of an already-quantized model and get compression on both axes: smaller weights and a smaller per-token memory footprint as context grows.

Why It's Not a Magic VRAM Reset

The "Already Released, and Not What You Think" framing matters here. TurboQuant doesn't shrink the model, doesn't speed up tokens per second on short prompts, and doesn't eliminate the need for VRAM. It pays off specifically in long-context workloads — RAG pipelines, multi-document summarization, long agent traces — where the KV cache, not the weights, is what pushes a 24GB card into VRAM offloading. On short chats the difference is negligible. The memory-stock selloff that followed the paper assumed every workload looked like long-context, which overstates the demand impact.

Why It Matters for Local AI

For a builder on a single 3090, 4090, or 7900 XTX, TurboQuant is the difference between fitting a 32k-token document into native context versus chunking it through a RAG pipeline. It pushes the practical ceiling of consumer hardware closer to what a datacenter GPU offers for context length, without forcing an upgrade to a Blackwell card or a multi-GPU rig. Treat it as a context extender, not a model accelerator — and check whether your runtime (llama.cpp, ExLlamaV2, vLLM) actually supports it before planning a build around the headline numbers.