CraftRigs
Architecture Guide

DeepSeek V4 Hardware Requirements: Running 1T Parameters Locally

By Ellie Garcia 6 min read
DeepSeek V4 Hardware Requirements: Running 1T Parameters Locally — guide diagram

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Running DeepSeek V4's 1 trillion parameters locally requires dual RTX 4090 minimum at Q2 quantization, but you'll be severely constrained — only ~12 tokens per second (tok/s) and no room for 1M context. For actual usable inference, you need 3×H100 or better. If you're doing fewer than 300 inferences per month with long contexts, Claude API ($3 per 1M-token query) beats the hardware cost.


DeepSeek V4: 1T Parameters, 32B Active, 1M Context

DeepSeek V4 is a mixture-of-experts (MoE) model with 1 trillion parameters but only 32 billion active per token — meaning inference is far cheaper than the raw parameter count suggests. It trains on 20 trillion tokens and specializes in reasoning-heavy workloads. The real challenge isn't the model itself; it's the 1M-context window. That KV cache alone eats 64GB of VRAM, before the model weights show up.

Model Specifications

MetricValue
Total Parameters1 trillion (1T)
Active Parameters per Token32 billion (3.2% utilization)
Context Window1 million tokens (1M)
Training Data20 trillion tokens
Quantization SupportFP8, Q4, Q3, Q2, Q1
ArchitectureMixture-of-Experts (MoE) with top-2 routing

The MoE architecture is the key design. Unlike dense models where all parameters activate on every token, DeepSeek V4 routes each token through only two expert modules. This dramatically reduces memory bandwidth and compute compared to a hypothetical 1T dense model — but you still have to load all 1T parameters into VRAM before you can route anything.


Dual RTX 4090 Minimum Setup with Tensor Parallelism

Dual RTX 4090 is the bare-minimum consumer-grade setup. Each GPU has 24GB VRAM, totaling 48GB. Tensor parallelism (TP-2) splits the model weights across both GPUs, so theoretically you can fit ~48GB of model data. The problem: the math gets tight when you add the KV cache.

VRAM Allocation on Dual RTX 4090

Q1

60GB

30GB

~20GB

64GB

84GB You cannot fit 1M context on dual RTX 4090 at any quantization. At Q2, you're looking at 100GB total per GPU, and you have 24GB. You'd need to reduce context to around 8,000 tokens to stay within VRAM limits.

Warning

Marketing materials often cite the dual RTX 4090 as "capable" of running DeepSeek V4. This is technically true at Q2 with tiny context windows (under 16K tokens), but it's not 1M-context capable. Expect inference speeds of 8-12 tok/s at best, and your context window is functionally limited to what fits in the remaining 8-10GB of VRAM after model weights load. This is not a practical setup for the model's intended 1M-context use case.

Practical Setup: Dual RTX 5090

The RTX 5090 (48GB per GPU) is the realistic consumer path to 1M-context capability. With 96GB total and TP-2 parallelism, you can load:

  • Model weights at Q2: 60GB total (30GB per GPU) ✓
  • Activations: ~20GB (10GB per GPU) ✓
  • KV cache for ~256K context: 16GB (8GB per GPU) ✓
  • Remaining headroom: ~6GB per GPU — enough for batch size 2

This gives you a sustainable ~256K context window with inference speeds around 15 tok/s. For the full 1M context, you'd still need to add a third H100 or RTX 6000 for KV cache overflow, which defeats the purpose of the consumer setup.


Memory Bandwidth Requirements and PCIe Bottleneck Mitigation

The 1M context window isn't just a storage problem — it's a bandwidth problem. Every inference token requires reading the KV cache from VRAM, and at 1M context depth, that's a lot of data moving off the GPU.

Communication Bottleneck Analysis

With tensor parallelism (TP-2), the two GPUs must synchronize model layers and share intermediate activations. The typical pattern:

  • PCIe bandwidth needed: 240 GB/s (for 1M context with batch size 1)
  • PCIe 5.0 peak bidirectional: 256 GB/s
  • Utilization: 93%

You're right at the edge. A single unexpected access pattern or a batch size of 2 and you're bottlenecked. NVLink (available on H100+ and future NVIDIA datacenter GPUs) offers 900 GB/s, which is why enterprise setups don't have this problem.

Tip

If you're building for DeepSeek V4 at scale (300+ inferences/month), prioritize GPUs with NVLink or native multi-GPU interconnect over raw VRAM. PCIe 5.0 is a ceiling, not a floor, for 1M-context workloads.

For consumer setups, the practical workaround is reducing context to 256K or 512K tokens, where PCIe bandwidth stays comfortably under 70% utilization.


Cost Comparison: Local Inference vs Claude API for 1M-Context Tasks

Let's be honest: running DeepSeek V4 locally isn't cheaper than Claude API unless you're running it a lot.

Break-Even Calculation

Winner

Claude API (10x cheaper)

Claude API (2x cheaper)

Local (break-even)

Local (5x cheaper) Assumptions:

  • Dual RTX 4090 power draw: 800W, 24/7 operation, $0.12/kWh electricity
  • DeepSeek V4 processing time: ~2 minutes per 1M-context inference
  • Claude API pricing: $3 per 1M-token inference
  • Hardware amortization (3-year lifespan): not included in monthly cost

The picture changes if you add hardware cost. A dual RTX 4090 setup costs ~$3,600 upfront. Even at 300 inferences per month, you're paying $20 per inference in hardware amortization alone (3-year spread). Claude API wins until you hit roughly 500-600 inferences per month.


Practical Recommendation: When to Build Local vs Use API

Use Claude API if:

  • You run fewer than 300 long-context inferences per month
  • You value speed (Claude API response time is faster than 2-minute local processing)
  • You need reliability over cost (no hardware failures, no setup complexity)
  • Your contexts are under 100K tokens (smaller context = lower API costs)

Build Local DeepSeek V4 if:

  • You're processing 500+ inferences per month with reasoning-heavy workloads
  • You need privacy (no third-party API calls, offline operation)
  • You have existing compute infrastructure (spare NVIDIA GPUs lying around)
  • You're building a product with razor-thin margins where per-token cost matters

Hybrid approach (our recommendation):

Use Claude API for one-off research and prototyping. Build local DeepSeek V4 only when you've validated that you actually need this specific model at this specific scale. Start with dual RTX 5090 (dual RTX 4090 if budget is tight), accept a 256K context window instead of 1M, and monitor your inference volume for 3 months. If you're hitting 300+ inferences monthly, the hardware investment pays for itself. If you're at 50, you're wasting money on electricity.


FAQ

Can I run DeepSeek V4 on a single RTX 4090?

No. At Q1 quantization (the most aggressive), model weights alone are 60GB. With activations and KV cache, you'd need 80-100GB total. A single RTX 4090 maxes out at 24GB. You need distributed inference (multiple GPUs) or cloud deployment.

What quantization level should I use?

Q2 is the sweet spot. It preserves model quality (benchmarks show <5% accuracy loss vs FP8) while fitting into consumer VRAM. Q3 and below are overkill for this model; Q1 is too aggressive and starts to lose reasoning capability.

Will tensor parallelism slow down my inference?

Yes, but not by much. PCIe communication adds ~15-20% overhead on consumer setups with good GPU-to-GPU connection. With multi-GPU NVLink (available on H100+), the overhead drops to 5%. Inference speed is memory-bandwidth-limited anyway, so the communication cost is secondary.

Is the 1M context window actually useful?

Rarely. Most inference tasks top out at 64K-128K context. The 1M window is marketing—it's technically possible but impractical on consumer hardware. For document analysis or research, 256K context is plenty and dramatically cheaper to run.

What's the timeline for consumer GPUs that can do this comfortably?

The RTX 6000 Ada (expected late 2026) will have 48GB VRAM and improved PCIe bandwidth. That should make 512K context practical on dual-GPU setups. For now, dual RTX 5090 is the best consumer option.


Check out the ultimate guide to local LLM hardware in 2026 for broader GPU recommendations across all model sizes. If you want to benchmark your own setup once you build it, learn how to benchmark local LLM performance with standardized tools.

deepseek-v4 local-llm 1t-moe tensor-parallelism hardware-guide

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.