CraftRigs
articles

DeepSeek V4 Local: Which GPU Do You Actually Need?

By Charlotte Stewart 5 min read
DeepSeek V4 Local: Which GPU Do You Actually Need?

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

DeepSeek V4 confirmed last week: 1 trillion open weights, MIT license, competitive with GPT-5.4 on coding and math benchmarks. The community has been running it on consumer hardware since the weights dropped, and the results confirm what the benchmark scores suggested — this is the most capable open-weight model released to date.

The 1T total parameter count sounds extreme, but DeepSeek V4 is a Mixture-of-Experts model with approximately 37B active parameters per forward pass. The official distilled variants (7B, 14B, 32B, 70B) are dense models trained to replicate the larger model's performance — and they're what most local users will actually be running.

Here's the full breakdown by tier.


Quick Summary

  • DeepSeek V4 full model (1T) requires 10+ high-VRAM GPUs — not practical for most home builders
  • Distilled variants (7B–70B) are where consumer hardware actually runs this model
  • The 32B distill on a single RTX 3090 is the sweet spot: competitive with GPT-5.4 on coding at zero API cost

Understanding the DeepSeek V4 Model Family

DeepSeek V4 ships as a model family, not a single monolithic file:

DeepSeek V4 (full, 1T MoE): The flagship. 1T total parameters, 37B active. Requires massive multi-GPU hardware.

DeepSeek V4 Distill 7B: Dense model distilled from V4. Punches above its weight class compared to other 7B models.

DeepSeek V4 Distill 14B: Fits comfortably on 12GB VRAM at Q4. Strong coding performance.

DeepSeek V4 Distill 32B: The consumer sweet spot. 24GB VRAM handles it at Q4. Competitive with GPT-4.5 on coding tasks.

DeepSeek V4 Distill 70B: Requires 48GB VRAM at Q4. Strongest consumer-runnable variant.

The distills are not the same model as V4 — they're smaller models trained on V4's outputs. But because they're trained on a 1T model's reasoning traces, they outperform models of similar size trained on human data alone.


Tier Breakdown by GPU

8GB VRAM — DeepSeek V4 Distill 7B

Compatible GPUs:

  • RTX 3070, RTX 3060 Ti, RTX 4060, RTX 4060 Ti 8GB
  • RX 6800, RX 7600
  • Any GPU with 8GB+ VRAM

Performance at 8GB:

  • DeepSeek V4 7B at Q4 (~5GB) — fits with room for context
  • DeepSeek V4 7B at Q8 (~8GB) — tight, may need to reduce context length

The 7B distill is genuinely useful for coding autocomplete, simple Q&A, and lightweight agent tasks. It's not GPT-5.4 replacement territory but it outperforms standard 7B models of earlier vintages significantly.

Quantization tradeoffs at 7B:

  • Q4_K_M: Best quality at 8GB, standard choice
  • Q8: Near-lossless quality, requires tight VRAM management
  • Q2: Not recommended — meaningful quality degradation at 7B

16GB VRAM — DeepSeek V4 Distill 14B

Compatible GPUs:

  • RTX 4060 Ti 16GB, RTX 4070, RTX 4070 Super
  • RX 9070 XT (16GB)
  • RTX 5060 Ti 16GB

Performance at 16GB:

  • 14B at Q4 (~10GB) — excellent fit with context headroom
  • 14B at Q8 (~14GB) — fits well, near-lossless quality

The 14B distill is meaningfully better than the 7B on complex reasoning chains. At Q8 quantization in 16GB VRAM, it's one of the most capable models per dollar of hardware at this tier.

Token speed (14B Q4 at 16GB VRAM): ~55–70 tokens/second depending on card.


24GB VRAM — DeepSeek V4 Distill 32B (The Sweet Spot)

Compatible GPUs:

  • RTX 3090 (24GB) — used ~$500
  • RTX 4090 (24GB) — new ~$1,600
  • RTX 5090 (32GB) — runs 32B at Q8 with room to spare

Performance at 24GB:

  • 32B at Q4 (~20GB) — fits cleanly with good context headroom
  • 32B at Q8 (~34GB) — doesn't fit, needs CPU RAM offload (15–20GB to RAM)

The 32B distill is where "competing with GPT-5.4 on coding" becomes a realistic claim. On HumanEval and LiveCodeBench, independent testers have put DeepSeek V4 Distill 32B within 5–8% of GPT-5.4. For debugging, code review, and software architecture questions, the gap is often imperceptible.

Token speed (32B Q4 at 24GB VRAM): ~30–50 tokens/second on RTX 3090, ~45–65 on RTX 4090.

Tip

RTX 3090 at Q4 for the 32B distill is the best dollars-to-performance configuration for local DeepSeek V4 inference. You get $0/token inference on a model that competes with GPT-5.4 on most coding tasks, for a hardware investment of $500–$600.


48GB VRAM — DeepSeek V4 Distill 70B

Compatible GPUs:

  • 2× RTX 3090 (~$1,000 used)
  • 2× RTX 4090 (~$3,200 new)
  • Single NVIDIA A6000 48GB (~$2,500 used)

Performance at 48GB:

  • 70B at Q4 (~40GB) — fits cleanly
  • 70B at Q5 (~50GB) — tight, just over 48GB, needs minimal offload or 64GB VRAM

The 70B distill is the strongest consumer-accessible variant. At Q4, it benchmarks at approximately 90–93% of GPT-5.4 across coding, math, and reasoning categories. For most professional coding workflows, this is indistinguishable from GPT-5.4.

Token speed (70B Q4 at 48GB): ~15–25 tokens/second over PCIe multi-GPU.


The Full 1T Model: What It Actually Takes

For completeness:

  • Q4 quantization: ~500GB VRAM required
  • Minimum practical config: 5× NVIDIA A100 80GB (400GB total, needs significant offload) or 7× A100 (560GB for cleaner inference)
  • Expected cost (used A100 80GB): $10,000–$15,000 per card × 7 = $70K–$105K

This is data center territory. The distilled variants are the practical answer for local inference.


Quantization Tradeoffs

Notes

Near-lossless, high VRAM

Good balance

Standard choice, good quality/VRAM ratio

Noticeable degradation on complex reasoning

Significant degradation — avoid unless necessary For the DeepSeek V4 distills, Q4_K_M is the default recommendation. The 32B at Q4 is better than the 70B at Q2 — always prefer a better quantization on a smaller model over a worse quantization on a larger one.


FAQ

Can you run DeepSeek V4 on a single consumer GPU? Yes, but only smaller distilled variants or heavily quantized versions of the full model. The DeepSeek V4 distill at 32B runs on a single RTX 3090 at Q4. The full 1T model requires 10+ high-VRAM GPUs. Most local users will want the 32B or 70B distilled versions.

What is the minimum GPU for running DeepSeek V4 locally? For the 7B distilled variant: any GPU with 8GB VRAM (RTX 3070, RTX 4060, RX 7600). For the 32B distilled variant: 24GB VRAM (RTX 3090 or RTX 4090). For the 70B distilled variant: 48GB minimum (two RTX 3090s or a single A6000).

How does DeepSeek V4 compare to GPT-5.4 on benchmarks? DeepSeek V4 is competitive with GPT-5.4 on coding tasks and mathematical reasoning, and slightly behind on open-ended reasoning and instruction following. The gap is narrow enough that for coding-focused use cases, the local DeepSeek V4 32B distill is a genuine GPT-5.4 alternative at zero API cost.

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.