Hardware Requirements for DeepSeek R1: Local Setup Guide

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

TL;DR: DeepSeek R1's distilled variants (7B through 70B) run on standard consumer hardware. The full 671B MoE model is a beast that needs 300GB+ of memory. For most people, the 32B distill on an RTX 3090/4090 is the sweet spot — strong reasoning at a reasonable hardware cost.

The DeepSeek R1 Family

DeepSeek R1 isn't one model. It's a family with distilled variants at different sizes, plus the full MoE (Mixture of Experts) monster. Each hits a different hardware tier:

DeepSeek R1 1.5B — tiny, runs on anything
DeepSeek R1 7B — distilled from Qwen 2.5 7B
DeepSeek R1 14B — distilled from Qwen 2.5 14B
DeepSeek R1 32B — distilled from Qwen 2.5 32B
DeepSeek R1 70B — distilled from Llama 3.3 70B
DeepSeek R1 671B — the full MoE model, 671 billion parameters

The distilled versions are dense models (not MoE), which means they behave like any other model of that size. The 671B is MoE with 256 experts, 8 active per token — a fundamentally different beast.

VRAM Requirements Per Variant

All numbers below use GGUF format at Q4_K_M quantization, which gives the best quality-to-size ratio for most users. Benchmarks as of March 2026.

DeepSeek R1 1.5B

Q4_K_M: ~1.2GB
Runs on: literally anything — integrated graphics, Raspberry Pi, your phone
Speed: 80+ t/s on any modern GPU
Verdict: Good for testing and lightweight tasks. Not for serious reasoning work.

DeepSeek R1 7B

Q4_K_M: ~4.5GB VRAM
Q8_0: ~8GB
Runs on: Any 8GB GPU — RTX 3060, RTX 4060, even older GTX cards
Speed: 40-60 t/s on RTX 4060, 25-35 t/s on RTX 3060
Verdict: Solid entry point for R1-style reasoning on budget hardware. Punches above its weight class thanks to the distillation from a larger model.

DeepSeek R1 14B

Q4_K_M: ~9GB
Q8_0: ~15GB
Runs on: 12-16GB GPU — RTX 3060 12GB, RTX 4060 Ti 16GB
Speed: 30-45 t/s on RTX 4060 Ti 16GB
Verdict: The quality jump from 7B to 14B is noticeable, especially for math and code reasoning. If you have 16GB of VRAM, this is a no-brainer over the 7B.

DeepSeek R1 32B

Q4_K_M: ~20GB
Q8_0: ~34GB
Runs on: 24GB GPU — RTX 3090, RTX 4090
Speed: 20-30 t/s on RTX 4090, 12-18 t/s on RTX 3090
Verdict: This is the one most people should run. The 32B distill retains most of the full R1's reasoning ability and fits on a single 24GB card at Q4. Best bang for your buck in the entire R1 lineup.

DeepSeek R1 70B

Q4_K_M: ~42GB
Q8_0: ~74GB
Runs on: 2x 24GB GPUs, or Mac with 64GB+ unified memory
Speed: 15-25 t/s on 2x RTX 4090, 8-12 t/s on Mac M4 Max 64GB
Verdict: Meaningfully better than 32B for complex multi-step reasoning, but the hardware cost doubles. Worth it if you're doing serious research or math work. For everyone else, 32B is close enough.

For multi-GPU builds, see our $3,000 dual-GPU LLM rig guide.

DeepSeek R1 671B (Full MoE)

Q4_K_M: ~350GB
Q2_K: ~200GB
Runs on: Mac Studio M4 Ultra 192GB (Q2, barely), server-grade multi-GPU rigs, or heavy CPU offloading with 512GB+ system RAM
Speed: 2-5 t/s on Mac Studio Ultra 192GB at Q2, faster on multi-GPU server rigs
Verdict: Not practical for consumer hardware at any useful quality. If you need the full 671B, use the API. Running it locally is a flex, not a workflow.

Which Variant Should You Actually Run?

Here's the decision tree:

8GB VRAM or less: Run the 7B. It's fast, capable, and fits easily.

12-16GB VRAM: Run the 14B at Q4. Clear upgrade from 7B and you won't need to offload anything.

24GB VRAM (RTX 3090/4090): Run the 32B at Q4_K_M. This is the sweet spot for the entire R1 family on consumer hardware. You get 80-90% of the full model's reasoning ability in a package that fits on one card.

48GB+ (Mac or dual-GPU): Run the 70B. The quality uplift over 32B justifies the hardware if you already own it. Don't buy 48GB+ hardware specifically for R1 70B unless reasoning quality is mission-critical.

128GB+ Mac or server: You can technically run the 671B at aggressive quantization. But honestly, the 70B distill gives you 90% of the quality at a fraction of the resource cost.

Best Quantization for Each Tier

Not all quants are created equal. Here's what to pick:

If it fits at Q8_0 in your VRAM: Run Q8. Near-lossless quality, and you won't leave performance on the table.
If Q8 is too tight: Q5_K_M is the next step down. Almost no perceptible quality loss for reasoning tasks.
Standard recommendation: Q4_K_M. This is where most people land. It's the community default for a reason — good quality, manageable size.
Tight on VRAM: Q3_K_M works but you'll notice degradation on math-heavy prompts. Acceptable for chat and general coding.
Don't go below Q3 unless you're just experimenting. Q2 quants lose too much of R1's reasoning precision — the whole point of running this model.

Context Length Matters

DeepSeek R1 supports 128K context, but longer context eats more VRAM for the KV cache (key-value cache — the memory the model uses to track your conversation). At 32B Q4_K_M:

4K context: ~20GB VRAM
8K context: ~21GB VRAM
32K context: ~24GB VRAM (maxing out a 24GB card)
128K context: won't fit on 24GB

If you need long context on the 32B, either drop to Q3 or use a Mac with 48GB+. For most chat and coding tasks, 8K context is plenty.

The Apple Silicon Angle

DeepSeek R1's distilled models run exceptionally well on Apple Silicon thanks to unified memory and llama.cpp optimizations. The M4 family is particularly strong:

Mac Mini M4 (16GB): Runs 7B at Q8 or 14B at Q4 — 20-30 t/s
Mac Mini M4 Pro (48GB): Runs 32B at Q8 or 70B at Q3 — 15-25 t/s
MacBook Pro M4 Max (48-128GB): Runs 32B at Q8 easily, 70B at Q5 on the 64GB model — 20-35 t/s thanks to higher bandwidth

Check our Apple Silicon LLM benchmarks for exact numbers.

Bottom Line

Best for most people: DeepSeek R1 32B on an RTX 3090 or RTX 4090 at Q4_K_M
Budget pick: DeepSeek R1 14B on a 16GB GPU
Mac pick: DeepSeek R1 32B at Q8 on 48GB unified memory
Skip: The 671B on consumer hardware — use the API instead

DeepSeek R1's distilled models are some of the best reasoning models you can run locally. The 32B variant in particular hits a rare combination of quality and accessibility. If you're building a local AI rig and care about reasoning, this is the model to optimize for.

For help choosing the right GPU, see our complete VRAM guide and best GPUs for local LLMs.