TL;DR: Memory bandwidth is the single variable that predicts LLM inference speed on Apple Silicon. Every generation — M1, M2, M3, M4 — follows the same rule: double the bandwidth, double the tokens per second. The M4 Max (546 GB/s) is the current single-die sweet spot. The M3 Ultra (819 GB/s) is the fastest shipping option. And one surprising finding: the M3 Pro (150 GB/s) is actually slower than the M2 Pro (200 GB/s) for LLM inference, despite being a newer chip.
Methodology: How We Benchmark
To make these numbers meaningful, every chip was tested with the same conditions:
- Framework: MLX via LM Studio (for generation speed). llama.cpp via Ollama for cross-reference.
- Models: Llama 3 8B (Q4), Qwen 2.5 32B (Q4), and Llama 3.3 70B (Q4) where memory allows
- Metric: Token generation speed (tok/s) — output tokens, not prompt processing (prefill)
- Context: 4,096 tokens unless otherwise noted
- Environment: macOS Sequoia, latest MLX, all background apps closed, plugged into power
- Quantization: Q4_K_M (GGUF) or 4-bit (MLX native) — the standard quality/size tradeoff
All numbers represent sustained generation speed after the first few tokens (which are slower due to prompt processing).
The Baseline: M1, M2, M3, M4 Base Chips
The base chips are the entry-level Apple Silicon in MacBook Air and Mac Mini. They share a key limitation: low memory bandwidth and small maximum RAM.
Memory bandwidth progression:
Max RAM
16 GB
24 GB
24 GB
32 GB Llama 3 8B Q4 performance:
- M1 (16 GB): 18–24 tok/s
- M2 (24 GB): 22–28 tok/s
- M3 (24 GB): 25–33 tok/s
- M4 (32 GB): 30–35 tok/s
What they can handle: 7B–8B models are the practical ceiling. 13B Q4 can squeeze into 16 GB (M1) or 24 GB (M2/M3) but leaves almost no room for context. The M4's 32 GB option opens up 13B Q8 comfortably.
Verdict: The base chips are fine for experimenting with small models. The M4 base with 32 GB ($999 Mac Mini) is the minimum serious recommendation — it runs 8B models well and can stretch to 13B. For a broader comparison of every Mac model and which one to buy, see the full Mac comparison guide.
The Mid-Tier: M Pro Variants Hit 30B Territory
The Pro chips add meaningful bandwidth and RAM capacity, making 13B–30B models practical.
Memory bandwidth progression:
Max RAM
32 GB
32 GB
36 GB
64 GB Llama 3 8B Q4 performance:
- M1 Pro (32 GB): 28–32 tok/s
- M2 Pro (32 GB): 32–37 tok/s
- M3 Pro (36 GB): 35–39 tok/s
- M4 Pro (48–64 GB): 38–42 tok/s
Qwen 2.5 32B Q4 performance:
- M1 Pro (32 GB): Cannot fit (~18 GB model in 32 GB, too tight with OS)
- M2 Pro (32 GB): Tight — ~8–10 tok/s if it fits
- M3 Pro (36 GB): ~10–12 tok/s (slightly more headroom)
- M4 Pro (48 GB): ~12–18 tok/s (comfortable)
The M3 Pro anomaly: The M3 Pro has lower bandwidth (150 GB/s) than the M2 Pro (200 GB/s). Apple redesigned the memory subsystem to reduce transistor count, and LLM inference — which is purely memory-bandwidth-bound — suffers as a result. In raw tok/s on the same model, an M2 Pro can be faster than an M3 Pro. If you're buying used specifically for LLM work, an M2 Pro 32 GB may outperform an M3 Pro 18 GB on models that fit.
Verdict: The M4 Pro is the clear winner in this tier — 273 GB/s bandwidth and up to 64 GB RAM make it the first Pro chip that can meaningfully run 30B+ models and even stretch to 70B. The Mac Mini M4 Pro 48 GB ($1,799) is the best value in this tier. For a detailed comparison of M4 Pro vs M4 Max, see the full chip comparison.
The High End: M Max Variants Reach 70B Territory
The Max chips are where serious LLM work begins. Enough bandwidth and memory to run 70B models at interactive speeds.
Memory bandwidth progression:
Max RAM
64 GB
96 GB
Llama 3 8B Q4 performance:
- M1 Max: 38–43 tok/s
- M2 Max: 42–48 tok/s
- M3 Max (400 GB/s): 45–52 tok/s
- M4 Max (546 GB/s): 52–59 tok/s
Llama 3.3 70B Q4 performance (where memory allows):
- M1 Max (64 GB): Cannot run — 64 GB too tight for 70B Q4 (~40 GB) with OS overhead
- M2 Max (96 GB): ~6–8 tok/s
- M3 Max (96–128 GB, 400 GB/s): ~8–9 tok/s
- M4 Max (128 GB, 546 GB/s): ~11–12 tok/s
The M4 Max's 546 GB/s is the biggest single-generation bandwidth jump in the Max lineup. Going from M3 Max (400 GB/s) to M4 Max (546 GB/s) is a 37% bandwidth increase, which translates directly to 30–40% faster token generation. This is the most meaningful Max-to-Max upgrade for LLM users in the chip's history.
Verdict: The M4 Max 16-core (546 GB/s, 128 GB) is the best single-die chip for LLMs in 2026. The Mac Studio M4 Max 128 GB (~$3,950) is the recommended configuration. For a step-by-step guide to running Llama 70B on one of these machines, see the Llama 70B Mac setup guide.
The Extreme: M Ultra for Maximum Scale
The Ultra chips fuse two Max dies together, doubling everything — bandwidth, GPU cores, and maximum memory.
Memory bandwidth:
Max RAM
128 GB
192 GB
512 GB
Llama 3 70B Q4 performance:
- M1 Ultra (128 GB): ~10–13 tok/s
- M2 Ultra (192 GB): ~12–15 tok/s
- M3 Ultra (192–512 GB): ~14–18 tok/s
What M3 Ultra 512 GB unlocks:
The M3 Ultra Mac Studio with 512 GB is in a class by itself among non-datacenter hardware:
- DeepSeek-R1 671B Q4 (~350 GB): Fits and runs at ~5–7 tok/s
- Llama 3.1 405B Q4 (~210 GB): Fits with room for context
- Apple's own claim: "LLMs with over 600 billion parameters entirely in memory"
This is extreme territory — the 512 GB Mac Studio starts around $10,000 — but there's literally no other sub-$20,000 consumer device that can run a 671B model in memory.
Verdict: Unless you need to run 200B+ parameter models, the M Ultra chips are overkill for LLM inference. The M4 Max at 546 GB/s handles 70B models well. The Ultra is for researchers, AI companies, and people with large budgets who need the absolute maximum local model size.
Memory Bandwidth as the Primary Variable
Here's the pattern that holds across every M-series chip:
Formula: approximate tok/s ≈ memory_bandwidth / model_size_in_memory × adjustment_factor
For Llama 3 70B Q4 (~40 GB):
Measured
6–8 tok/s
8–9 tok/s
11–12 tok/s
14–18 tok/s The formula isn't perfect — overhead from KV cache operations, attention computation, and framework efficiency eats into the theoretical maximum — but it predicts real-world performance within 15–25%. This means you can estimate performance on any Apple Silicon chip if you know its bandwidth and the model size.
This is the most important takeaway from this entire benchmark series: When choosing a Mac for LLMs, ignore CPU core counts, GPU core counts, and Neural Engine TOPS. Look at two numbers: memory bandwidth (determines speed) and maximum RAM (determines which models fit). Everything else is noise.
When to Upgrade: Chip vs. RAM
If you're deciding between a better chip or more RAM on the same chip, the answer depends on what you're trying to do:
Upgrade the RAM if:
- You're limited by model size (can't fit the model you want to run)
- You want higher quantization quality (Q8 instead of Q4)
- You need longer context windows
- Example: M4 Pro 24 GB → M4 Pro 48 GB opens up 32B models
Upgrade the chip if:
- Your models fit fine but generation speed is too slow
- You're running models at 5–8 tok/s and want 10–15 tok/s
- Example: M4 Pro 48 GB → M4 Max 48 GB doubles your tok/s
Both are worth considering:
Going from M4 Pro 48 GB ($1,799 Mac Mini) to M4 Max 128 GB Mac Studio (~$3,950) gives you 2x bandwidth AND 2.7x the memory. That's the biggest practical upgrade in the Mac lineup for LLM work.
The golden rule: RAM determines what you can run. Bandwidth determines how fast you can run it. Get enough RAM first, then optimize for bandwidth.
If you're also weighing Mac against a high-end PC, the M4 Max vs RTX 4090 comparison breaks down how Apple Silicon stacks up against discrete GPU performance. For those considering a dual-GPU PC build to run 70B models, see the $3,000 dual-GPU LLM rig guide.
All benchmarks in this article reflect hardware and software available as of February 2026. Prices are subject to change. Performance numbers are based on community benchmarks and verified testing using MLX (LM Studio) and llama.cpp (Ollama). Individual results may vary based on model format, context length, system configuration, and background processes.
Related: