The part of TurboQuant that matters for your Mac isn't what most posts are saying.
Since Google published the paper on March 24, 2026, the headlines have focused on the 6x memory reduction and the 8x attention speedup on H100 GPUs. What those headlines gloss over: TurboQuant is a KV cache compression algorithm — it doesn't touch model weights. It doesn't help you fit a 70B model where a 34B fit before. That's not what it does.
What it does do is extend how much context you can run before unified memory runs out. On a 36GB M3 Max already running a 34B model, that's the constraint you actually hit. Not the model size — the context wall.
TL;DR: TurboQuant on MLX extends effective context windows by up to 6x on the same hardware, with near-zero quality loss. Community implementations for MLX landed within days of Google's ICLR 2026 paper. It won't help you fit bigger models — but if you're hitting memory walls at 32K+ contexts on your current Mac, this is worth testing before you spend $2K on more RAM.
What Just Landed: TurboQuant Comes to MLX
Google Research published TurboQuant on March 24, 2026 — it's headed to ICLR 2026 and was co-authored with NYU. The core claim: compress LLM KV caches to 3-4 bits per coordinate with no retraining, no calibration data, and no measurable accuracy loss on standard benchmarks.
The official Google code release is still pending (expected Q2 2026). But the paper was enough. Within 24 hours, three separate community implementations appeared targeting MLX natively on Apple Silicon:
- rachittshah/mlx-turboquant — PolarQuant implementation, 3-5x compression, designed to upstream into ml-explore/mlx-lm
- sharpner/turboquant-mlx — up to 5.5x compression, two modes: hardware-accelerated (V2) and paper-correct Lloyd-Max codebook (V3)
- helgklaizar/turboquant_mlx — asymmetric compression (TurboQuant for keys, lightweight PolarQuant for values), OpenAI-compatible server
None of these are merged into the official mlx-lm repository yet. You're installing from GitHub forks, not pip install mlx-lm. That matters for how you evaluate stability.
Note
The 8x attention speedup Google reports is measured on NVIDIA H100 GPUs with 4-bit attention kernels. Apple Silicon doesn't use the same kernel path. Expect real speedups on long contexts — community reports show -26% generation wall time at extended context lengths — but don't assume the H100 numbers translate directly.
Why the Timing Matters
M4 Max and M4 Pro launched in late 2025. A lot of people bought them specifically to run large models. And the feedback in r/LocalLLaMA has been consistent: the models fit, the generation speed is good, but long-context sessions — large codebases, lengthy documents, extended agentic runs — eat through unified memory fast. TurboQuant targets exactly that.
What TurboQuant Actually Does (And What It Doesn't)
This is the part that's getting muddled in most coverage.
When you load a model, two things live in memory: the model weights and the KV cache. The weights are fixed — they're loaded once and stay there. For Llama 3.1 70B at quantization INT4 (AWQ), that's roughly 35 GB. You either have enough unified memory for the weights or you don't. TurboQuant does nothing to that number.
The KV cache is different. It's the memory that stores the keys and values for every token in your context window, and it grows linearly as the conversation or document gets longer. On Llama 3.1 70B, the KV cache consumes roughly 320 KB per token at FP16 precision. That math adds up fast:
KV Cache (TurboQuant 3-bit)
~0.4 GB
~1.7 GB
~3.3 GB
~6.7 GB Based on Llama 3.1 70B architecture (80 layers, 8 KV heads, head_dim 128, FP16). TurboQuant column assumes 6x compression from Google's ICLR 2026 paper.
The 64K context row is where this lands. Without compression, a 70B model on a 64GB M4 Max has its weights consuming ~35 GB, leaving ~29 GB of headroom. Running 64K context eats 20 GB of that. You have 9 GB left for runtime overhead and nothing else. With TurboQuant, that same 64K context takes 3.3 GB — suddenly you have 26 GB of breathing room, and 128K context becomes feasible.
Warning
If you're on a 36GB M3 Max hoping TurboQuant will let you run 70B models: it won't. Llama 3.1 70B at INT4 requires ~35 GB for weights alone. The 36GB M3 Max's practical ceiling is 34B models, and TurboQuant doesn't change that. What it does change: your 34B model can now run 64K+ context windows that previously forced OOM errors.
The Context Wall Is the Actual Problem
Here's what the memory constraint looks like in practice on a 36GB M3 Max running Llama 3.1 34B.
The 34B model at Q4 uses roughly 18 GB for weights. That leaves ~16 GB for KV cache and runtime overhead. For Llama 3.1 34B (smaller architecture, approximately 48 layers, 8 KV heads), the KV cache runs about 192 KB per token at FP16. Which means:
- At 8K context: ~1.5 GB KV cache — fine, no issue
- At 32K context: ~6 GB KV cache — tight but workable
- At 64K context: ~12 GB KV cache — you're right at the edge, generation slows or crashes
With 6x TurboQuant compression, that 64K context drops to ~2 GB. You go from "barely works or doesn't" to "comfortable," and 128K context becomes realistic on hardware that previously couldn't touch it.
For power users analyzing large codebases, processing lengthy research documents, or running multi-hour agentic sessions with extended memory — this is the actual unlock. Not fitting a bigger model. Keeping a big model in a longer conversation.
Quality and Speed: What the Community Is Reporting
Google's formal benchmarks (on LongBench and needle-in-a-haystack retrieval tasks) show TurboQuant matching or outperforming KIVI baseline at 3-bit compression with near-zero accuracy loss. The paper backs this with formal distortion bounds within roughly 2.7x of the information-theoretic limit.
The first Apple Silicon-specific test I've seen: a community demo using flovflo/turboquant-mlx-qwen35-kv on Qwen 3.5-35B via MLX showed 100% exact match on needle-in-a-haystack at every quantization level, across context lengths from 8,500 to 64,000 tokens. One result. One model. But it's a clean one — that test specifically probes whether the model can retrieve information from arbitrary positions in a long context, which is exactly what KV cache compression can break.
On generation speed: the sharpner/turboquant-mlx V2 hardware-accelerated path reports up to 5.5x KV cache compression. The flovflo demo showed a -26% reduction in generation wall time at longer context lengths — most likely because less memory pressure means fewer thermal slowdowns and better memory bandwidth utilization on the Apple Neural Engine.
Tip
Start with the sharpner/turboquant-mlx V2 path for the hardware-accelerated Metal implementation. If quality is your priority over speed (coding, reasoning tasks), try V3 with the Lloyd-Max codebook — it's closer to the paper's intended algorithm. Compare outputs on a few prompts before trusting either for production use.
Base generation speed (tokens/second at short contexts) won't change meaningfully — that's bottlenecked by weight loading bandwidth, which TurboQuant doesn't touch. On M4 Max with 64GB, you're getting roughly 20-30 tok/s for 70B Q4 models via MLX. That stays the same. What changes is how fast those sessions degrade when they get long.
Who Should Actually Install This Right Now
Install it if you're a power user hitting context walls. Specifically: M3 Max 36GB users running 34B models who regularly hit OOM at 32K+ context, and M4 Max 64GB users running 70B models who want to push toward 64K-128K context reliably. If either of those describes your workflow — document analysis, codebase review, multi-session agentic work — the community implementations are worth testing.
Don't install it if you're running 8B-13B models for daily use. Context isn't your bottleneck, and you're adding complexity for zero practical gain.
Don't install it expecting to upgrade your model tier. If you're on 36GB and want to run 70B, TurboQuant won't help. You're still looking at a hardware upgrade or moving to a different approach like speculative decoding.
Not a beginner feature. This requires familiarity with MLX, working with GitHub forks, and some tolerance for "this is a community port of a 5-day-old paper." All three repos are moving fast. Expect API changes.
For a broader guide to what models fit which Apple Silicon chips before adding any compression layer, see our MLX Llama 3.1 benchmark guide.
How to Get Started
The quickest path is helgklaizar/turboquant_mlx if you want OpenAI-compatible server mode (drop-in for tools that already speak that API), or sharpner/turboquant-mlx if you want the closest thing to the paper's original algorithm.
Installation is pip-based from the GitHub repos. You'll need an MLX-compatible model in SAFETENSORS format — anything you'd use with standard mlx-lm works. The cache type swaps in at inference time with no model re-download or conversion required.
Full setup walkthroughs are in each repo's README, and the mlx-examples documentation at github.com/ml-explore/mlx-examples covers the base inference API these forks extend.
Once Google releases official code (expected Q2 2026), expect this to land in mlx-lm proper with official benchmark numbers on Apple Silicon. Until then, these are high-quality community ports — solid for testing and exploration, but not something you'd run in production without evaluating on your specific workload.
What This Means for the Mac AI Landscape
Apple Silicon has always had the unified memory efficiency story. What it lacked was tooling parity — the GGUF ecosystem on NVIDIA has had quantization options and KV cache tricks for years. TurboQuant on MLX closes a real gap.
The Mac mini M4 Pro with 36GB (~$1,399) was already a legitimate 34B-model machine. With TurboQuant, it becomes a legitimate 34B-model machine with real long-context capability — which is a different beast. Processing 100-page documents, maintaining long coding sessions, running overnight agentic tasks that accumulate substantial context: all of these become much more viable.
The M4 Max with 64GB stays where it is for model size — you can already run 70B. But you can now run 70B with 100K+ context windows without OOM risk, which changes the class of problems you can actually solve.
And no, it doesn't close the gap with NVIDIA on raw throughput. An RTX 5090 still runs 70B faster. But for Mac-only workflows where silence, energy efficiency, and a shared-memory architecture are features rather than trade-offs, TurboQuant makes the calculus more favorable at every chip tier.
Check our which Mac for local AI guide — some of the upgrade recommendations there shift meaningfully once you factor in TurboQuant's context extension.
FAQ
Does TurboQuant let you run larger models on Apple Silicon? No — and this is the most common misconception in current coverage. TurboQuant compresses the KV cache, not model weights. A Llama 3.1 70B model at INT4 quantization still requires roughly 35 GB for its weights. If your Mac doesn't have the memory for those weights, TurboQuant doesn't change that. The 36GB M3 Max stays a 34B machine. TurboQuant changes how much context that machine can handle, not which models fit.
What does TurboQuant actually improve on Apple Silicon? Context length and memory headroom during long sessions. Without TurboQuant, the KV cache for a 70B model at 64K tokens consumes roughly 20 GB on top of model weights. With 6x compression, that drops to about 3.3 GB — which means a 64GB M4 Max can now comfortably push 128K context windows where it previously ran out of memory around 40-50K tokens.
Is TurboQuant in the official mlx-lm release? Not as of March 2026. Three community implementations exist on GitHub, and at least one PR to upstream into ml-explore/mlx-lm was submitted in late March. Google's official code release is expected around Q2 2026. The community ports are functional but moving fast — treat them as experimental until they merge into the main repo.
Which Apple Silicon chips benefit most from TurboQuant? 36GB M3 Max and 36GB M4 Pro users see the biggest practical change. Those chips have the tightest memory headroom relative to their model ceilings, so context compression buys the most headroom. The 64GB M4 Max also benefits meaningfully, especially if you run 70B models at 32K+ context regularly. If you're on a 96GB or 128GB Mac, you were probably already managing fine — TurboQuant helps at the margins but isn't as transformative.
Does TurboQuant cause quality loss? Minimal, based on available data. Google's ICLR 2026 paper reports near-zero accuracy loss on LongBench and needle-in-a-haystack retrieval, with formal distortion bounds. A community MLX demo on Qwen 3.5-35B showed 100% exact match on needle-in-a-haystack across context lengths from 8,500 to 64,000 tokens. Official Apple Silicon benchmarks are still pending — that demo is the best data point we have right now, and it's one model on one test.