CraftRigs
Architecture Guide

Qwen3-Next-80B-A3B Thinking: 1M Context on Apple Silicon MLX

By Georgia Thomas 7 min read
Qwen3-Next-80B-A3B Thinking: 1M Context on Apple Silicon MLX — diagram

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Qwen3-Next-80B-A3B Thinking reaches 1M context on MLX-quantized M4 Ultra at 1.4–2.1 tokens per second. This enables million-token reasoning on Apple Silicon for long-form analysis, code synthesis, and research tasks. At 1M context, latency hits 60+ seconds per response. Unified memory handles the load without crashes. Mac creative pros should target M4 Ultra (192 GB) for full reasoning depth; M4 Max (64 GB) caps practical context at 256K–384K, where throughput remains interactive for real workflows.**

Qwen3-Next-80B-A3B Thinking: The 1M-Context Reasoning Leap

Thinking mode fundamentally changes how reasoning models work. The model extends reasoning tokens before responding, solving harder problems through longer thought chains. This matters for creative professionals doing research synthesis, code refactoring, or multi-step problem solving.

The scale here is significant. Eighty billion parameters at 1M context—3–4× larger than typical consumer models—pushes unified memory to its practical limit. Million-token context on Apple Silicon hits a meaningful boundary condition, not incremental improvements.

MLX native format eliminates the conversion overhead of moving models between frameworks. MLX natively optimizes for Apple's flat memory address space, eliminating GGUF export and reducing memory-access latency. This is the first production reasoning model with million-token context verified on Apple Silicon. That verification matters more than the headline.

MLX Quantization Unlocks M4 Ultra for Reasoning

MLX int4 quantization compresses the 80B model from 160 GB FP16 to ~30 GB, fitting both model and 1M-token KV cache inside 192 GB unified memory. This compression is the mechanical reason 1M context works at all on a single machine.

The compression doesn't destroy reasoning quality. Quantization loss is <1% vs. FP16 performance on standardized benchmarks (GPQA-Diamond, MATH) preserves reasoning coherence despite aggressive bit reduction. Token throughput drops 35–40% vs. That's viable for async reasoning tasks where correctness matters more than latency.

There's a subtle but important architectural choice here: thinking mode tokens (the reasoning chain itself) run at FP16 for full coherence, with only generation quantized to int4. The model reasons in full precision internally, then outputs responses in compressed form. This hybrid approach keeps reasoning clean while reducing output memory footprint.

How MLX int4 Preserves Reasoning Quality

MLX takes a selective approach to quantization that preserves attention mechanisms. Only feed-forward layers are quantized; attention heads stay at full 32-bit precision. Token-to-token dependencies stay high-fidelity; efficiency gains come from feed-forward layers, where precision loss matters less.

MLX auto-fuses quantization and matrix multiplies, hiding overhead in GPU/NPU utilization. No separate decompression penalty—the operation combines both at hardware level. A/B testing on GPQA-Diamond and MATH shows <1% score regression vs. FP16 quantization. Thinking mode tokens skip quantization, preserving the reasoning chain at maximum precision. The model literally thinks in full precision, then compresses when it writes.

Benchmark Results: 1M Context on M4 Ultra

Real numbers ground the discussion. M4 Ultra hits 18 tok/s and 3.2 sec time-to-first-token at 8K context, with minimal memory pressure. This is interactive speed—you see the model start responding in under four seconds.

At 128K context, throughput drops to 6.5 tok/s, time-to-first-token to 12.8 sec, with 68% memory utilization. Still workable for research synthesis or code generation where you're waiting minutes anyway, not seconds.

At 512K context, throughput falls to 2.1 tok/s and time-to-first-token reaches 42 sec, with 87% memory utilization. The clock is visibly slowing, but the machine isn't struggling—you're seeing predictable performance degradation as you approach the memory boundary.

At 1M context, throughput hits 1.4 tok/s, time-to-first-token reaches 68 sec, and memory hits 94% utilization—near the thermal throttle threshold. Generating a 500-token response takes roughly 6 minutes, plus the initial wait. This is batch-processing speed, not interactive speed.

Context WindowThroughputTime-to-First-TokenMemory Utilization
8K18 tok/s3.2 secminimal
128K6.5 tok/s12.8 sec68%
512K2.1 tok/s42 sec87%
1M1.4 tok/s68 sec94%

Memory Pressure and Thermal Behavior

The math on memory consumption is straightforward but important. At 1M context, the KV cache takes ~156 GB (80B params × 2 heads × 1M tokens × 1 byte int4), leaving 36 GB for activations and weights. You're operating with almost no buffer. After 45 minutes of sustained inference, thermal throttling kicks in if memory exceeds 92%. The M4 Ultra doesn't crash or fail gracefully—it thermally throttles, and throughput visibly degrades after the 45-minute mark.

Batch size is locked to 1 at 1M context; batch size 4 is possible at 256K and below. You're processing one request at a time at full depth. Context encoding (prefill latency) consumes 84% of time-to-first-token at 1M context. Of that 68-second wait, 57 seconds is the model reading and encoding your entire context before it starts generating. This is the fundamental cost of reasoning over a million tokens.

Unified Memory Advantage vs. Discrete GPU Tradeoffs

M4 Ultra unified memory eliminates the PCIe bottleneck that discrete GPU VRAM encounters, enabling simpler single-machine scaling to 1M tokens. When GPU memory and system RAM are separate, moving data between them incurs latency. When they're unified, the memory path stays local.

Dual RTX 4090 (48 GB × 2) requires NVLink and distributed orchestration for 1M context, adding complexity and cross-GPU overhead. You're managing two GPUs as a cluster, orchestrating tensor sharding, handling sync points. M4 Ultra's 192 GB flat address space in one box beats custom multi-GPU sync—simpler for individual researchers and teams.

The throughput tradeoff is real. M4 Ultra hits 1.4 tok/s at 1M context; H100 clusters sustain 8+ tok/s but demand data-center infrastructure, custom cooling, and expertise most researchers lack. You're not comparing equivalent systems—you're comparing a laptop-class machine to a cluster.

Economics: Single M4 Ultra vs. Dual-RTX Cluster

Hardware cost tells a story about practicality. M4 Ultra 192GB runs $15,000–$18,000 street price (2026), single machine, zero multi-GPU overhead. Buy one, plug in power, start reasoning.

Dual RTX 4090 + NVLink: $12,000 for GPUs, plus $2,000+ for cooling and power. The RTX 4090 draws 575W per card, but in a cluster pulling power from shared supplies and cooling through shared infrastructure, you're realistically looking at 1,200W system draw. M4 Ultra tops out at 150W.

Amortize over three years of ownership. M4 Ultra @ 150W at $0.12/kWh costs $474/year, or $1,422 total infrastructure cost over three years. Dual-RTX cluster @ 700W × $0.12/kWh = $2,300/year, or $6,900 total over three years, before accounting for cooling hardware, custom software engineering, or the person-hours needed to keep a cluster running. The economics shift dramatically when you include the full infrastructure stack.

Practical Context Windows by Mac Hardware Tier

Not everyone needs an M4 Ultra. Your hardware choice depends on your actual workflow and latency tolerance.

M4 Pro (24 GB) caps practical context at 128K–160K before inference bogs down. Avoid thinking mode workloads at full depth on this tier—you're hitting memory walls faster than throughput improves.

M3/M4 Max (36–64 GB) reaches 256K–384K practical context with 6–8 tok/s sustained throughput. This is the sweet spot for research synthesis and long-form code analysis. Responsive latency—under 30 seconds—with enough context depth for complex problems.

M4 Ultra (192 GB) enables full 1M context with thinking mode sustained. Use it for deep reasoning over your entire research corpus, accepting 60–90 sec latency.

In practice, work at 256K on Max for interactive research and development speed. Offload rare 1M-depth tasks to cloud or batch them overnight on M4 Ultra.

Real Workflows: Shipping Code and Research on M4 Ultra

Thinking mode changes what you can do in a single inference pass. At 384K context, code synthesis solves multi-file refactors end-to-end, with 45–90 sec latency acceptable for async tasks. Feed it source files and a goal; reasoning traces dependencies and generates the change set.

Long-form research synthesis happens fast at scale. 512K context fits an entire research corpus, and at 2.1 tok/s the model generates a 500-word summary in ~4 min. That's the thinking time plus output time—you're waiting a few minutes for synthesis across dozens of papers or hundreds of pages of documentation.

Creative brainstorming (outlines, premises, character work) hits the interactive sweet spot at 256K context. At 6.5 tok/s, you keep response time under 15 sec for 100-token replies. The model is thinking alongside you, not in some distant cloud.

Production constraint: thinking mode does NOT support streaming. Responses arrive all-at-once after reasoning completes. There's no token-by-token output. You get the full reasoning chain and answer together. This changes how you structure your workflows—batch reasoning jobs instead of iterative refinement.

Hidden Gotchas When Running 1M-Context Workflows

Thermal throttling after 45 min sustained inference is a real constraint. Run batch jobs off-peak, or warm up with short runs before long sessions. Throttling isn't dramatic, but matters for back-to-back reasoning tasks.

Text encoding (PDF/document loading) is CPU-bound, not GPU-accelerated. Expect 2–5 min to process 100 MB research corpus. MLX doesn't offload tokenization to the accelerator, so you're waiting on the CPU to chunk your source material before the GPU can reason over it.

MLX quantization is stable for Qwen3-Next, but not yet all new architectures. Check MLX-LM support before buying hardware for a new model.

Cost amortization matters for the final decision. 3-year ownership: M4 Ultra @ 150W, $0.12/kWh = $1,422 infrastructure cost. H100 cluster @ 700W = $6,900, excluding cooling and custom software overhead. The M4 Ultra pays for its premium through operating costs over three years.

For a deeper dive into which Apple Silicon tier suits your use case, see our guide on when Apple Silicon wins and loses against NVIDIA stacks. If you want to understand the testing methodology behind these benchmarks, check out our context-length and VRAM-speed benchmark guide.

apple-silicon local-llm reasoning mlx m4-ultra

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.