TL;DR: DeepSeek R1's distilled variants (7B through 70B) run on standard consumer hardware. The full 671B MoE model is a beast that needs 300GB+ of memory. For most people, the 32B distill on an RTX 3090/4090 is the sweet spot — strong reasoning at a reasonable hardware cost.
The DeepSeek R1 Family
DeepSeek R1 isn't one model. It's a family with distilled variants at different sizes, plus the full MoE (Mixture of Experts) monster. Each hits a different hardware tier:
- DeepSeek R1 1.5B — tiny, runs on anything
- DeepSeek R1 7B — distilled from Qwen 2.5 7B
- DeepSeek R1 14B — distilled from Qwen 2.5 14B
- DeepSeek R1 32B — distilled from Qwen 2.5 32B
- DeepSeek R1 70B — distilled from Llama 3.3 70B
- DeepSeek R1 671B — the full MoE model, 671 billion parameters
The distilled versions are dense models (not MoE), which means they behave like any other model of that size. The 671B is MoE with 256 experts, 8 active per token — a fundamentally different beast.
VRAM Requirements Per Variant
All numbers below use GGUF format at Q4_K_M quantization, which gives the best quality-to-size ratio for most users. Benchmarks as of March 2026.
DeepSeek R1 1.5B
- Q4_K_M: ~1.2GB
- Runs on: literally anything — integrated graphics, Raspberry Pi, your phone
- Speed: 80+ t/s on any modern GPU
- Verdict: Good for testing and lightweight tasks. Not for serious reasoning work.
DeepSeek R1 7B
- Q4_K_M: ~4.5GB VRAM
- Q8_0: ~8GB
- Runs on: Any 8GB GPU — RTX 3060, RTX 4060, even older GTX cards
- Speed: 40-60 t/s on RTX 4060, 25-35 t/s on RTX 3060
- Verdict: Solid entry point for R1-style reasoning on budget hardware. Punches above its weight class thanks to the distillation from a larger model.
DeepSeek R1 14B
- Q4_K_M: ~9GB
- Q8_0: ~15GB
- Runs on: 12-16GB GPU — RTX 3060 12GB, RTX 4060 Ti 16GB
- Speed: 30-45 t/s on RTX 4060 Ti 16GB
- Verdict: The quality jump from 7B to 14B is noticeable, especially for math and code reasoning. If you have 16GB of VRAM, this is a no-brainer over the 7B.
DeepSeek R1 32B
- Q4_K_M: ~20GB
- Q8_0: ~34GB
- Runs on: 24GB GPU — RTX 3090, RTX 4090
- Speed: 20-30 t/s on RTX 4090, 12-18 t/s on RTX 3090
- Verdict: This is the one most people should run. The 32B distill retains most of the full R1's reasoning ability and fits on a single 24GB card at Q4. Best bang for your buck in the entire R1 lineup.
DeepSeek R1 70B
- Q4_K_M: ~42GB
- Q8_0: ~74GB
- Runs on: 2x 24GB GPUs, or Mac with 64GB+ unified memory
- Speed: 15-25 t/s on 2x RTX 4090, 8-12 t/s on Mac M4 Max 64GB
- Verdict: Meaningfully better than 32B for complex multi-step reasoning, but the hardware cost doubles. Worth it if you're doing serious research or math work. For everyone else, 32B is close enough.
For multi-GPU builds, see our $3,000 dual-GPU LLM rig guide.
DeepSeek R1 671B (Full MoE)
- Q4_K_M: ~350GB
- Q2_K: ~200GB
- Runs on: Mac Studio M4 Ultra 192GB (Q2, barely), server-grade multi-GPU rigs, or heavy CPU offloading with 512GB+ system RAM
- Speed: 2-5 t/s on Mac Studio Ultra 192GB at Q2, faster on multi-GPU server rigs
- Verdict: Not practical for consumer hardware at any useful quality. If you need the full 671B, use the API. Running it locally is a flex, not a workflow.
Which Variant Should You Actually Run?
Here's the decision tree:
8GB VRAM or less: Run the 7B. It's fast, capable, and fits easily.
12-16GB VRAM: Run the 14B at Q4. Clear upgrade from 7B and you won't need to offload anything.
24GB VRAM (RTX 3090/4090): Run the 32B at Q4_K_M. This is the sweet spot for the entire R1 family on consumer hardware. You get 80-90% of the full model's reasoning ability in a package that fits on one card.
48GB+ (Mac or dual-GPU): Run the 70B. The quality uplift over 32B justifies the hardware if you already own it. Don't buy 48GB+ hardware specifically for R1 70B unless reasoning quality is mission-critical.
128GB+ Mac or server: You can technically run the 671B at aggressive quantization. But honestly, the 70B distill gives you 90% of the quality at a fraction of the resource cost.
Best Quantization for Each Tier
Not all quants are created equal. Here's what to pick:
- If it fits at Q8_0 in your VRAM: Run Q8. Near-lossless quality, and you won't leave performance on the table.
- If Q8 is too tight: Q5_K_M is the next step down. Almost no perceptible quality loss for reasoning tasks.
- Standard recommendation: Q4_K_M. This is where most people land. It's the community default for a reason — good quality, manageable size.
- Tight on VRAM: Q3_K_M works but you'll notice degradation on math-heavy prompts. Acceptable for chat and general coding.
- Don't go below Q3 unless you're just experimenting. Q2 quants lose too much of R1's reasoning precision — the whole point of running this model.
Context Length Matters
DeepSeek R1 supports 128K context, but longer context eats more VRAM for the KV cache (key-value cache — the memory the model uses to track your conversation). At 32B Q4_K_M:
- 4K context: ~20GB VRAM
- 8K context: ~21GB VRAM
- 32K context: ~24GB VRAM (maxing out a 24GB card)
- 128K context: won't fit on 24GB
If you need long context on the 32B, either drop to Q3 or use a Mac with 48GB+. For most chat and coding tasks, 8K context is plenty.
The Apple Silicon Angle
DeepSeek R1's distilled models run exceptionally well on Apple Silicon thanks to unified memory and llama.cpp optimizations. The M4 family is particularly strong:
- Mac Mini M4 (16GB): Runs 7B at Q8 or 14B at Q4 — 20-30 t/s
- Mac Mini M4 Pro (48GB): Runs 32B at Q8 or 70B at Q3 — 15-25 t/s
- MacBook Pro M4 Max (48-128GB): Runs 32B at Q8 easily, 70B at Q5 on the 64GB model — 20-35 t/s thanks to higher bandwidth
Check our Apple Silicon LLM benchmarks for exact numbers.
Bottom Line
- Best for most people: DeepSeek R1 32B on an RTX 3090 or RTX 4090 at Q4_K_M
- Budget pick: DeepSeek R1 14B on a 16GB GPU
- Mac pick: DeepSeek R1 32B at Q8 on 48GB unified memory
- Skip: The 671B on consumer hardware — use the API instead
DeepSeek R1's distilled models are some of the best reasoning models you can run locally. The 32B variant in particular hits a rare combination of quality and accessibility. If you're building a local AI rig and care about reasoning, this is the model to optimize for.
For help choosing the right GPU, see our complete VRAM guide and best GPUs for local LLMs.