CraftRigs
Architecture Guide

Run DeepSeek V3.2 Locally: Hardware Tiers and CPU Offload [2026]

By Georgia Thomas 6 min read
Run DeepSeek V3.2 Locally: Hardware Tiers and CPU Offload [2026]

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

TL;DR: DeepSeek V3.2 (671B MoE) runs on consumer hardware — but only with the right quantization. Q2_K fits in ~38 GB combined VRAM, IQ3_S gives better MoE routing quality at a similar footprint, and Q4_K_M requires dual RTX 4090s or better. A 24 GB GPU + 64 GB RAM hybrid gets you 3-5 tok/s on CPU offload — slow but coherent for offline work. Skip this if you need real-time coding assistance; it's for overnight research, long-context summarization, and privacy-critical offline tasks.


What DeepSeek V3.2 Actually Requires (The Real Memory Math)

You saw 671B parameters and closed the tab. That's the wrong number to fixate on.

DeepSeek V3.2 uses a Mixture-of-Experts (MoE) architecture with 256 routed experts and 1 shared expert. At inference time, only ~37B parameters are active per forward pass — not 671B. The total parameter count matters for storage, not for active memory requirements.

Here's what actually determines whether V3.2 runs on your hardware: the GGUF quantization level and your combined VRAM + RAM capacity.

Expected tok/s

3-5 (hybrid), 8-12 (dual GPU)

Dual RTX 3090 with system RAM assist, or dual RTX 4090 (48 GB)

10-15 (dual 4090)

15-25 (H100/A100) The "minimum" column assumes you're willing to use CPU offload and accept slower inference. The "recommended" column targets usable speeds for actual work.

MoE Architecture — Why V3.2's Memory Footprint Is Smaller Than 671B Implies

MoE routing works like this: for each token, the model selects a small subset of experts to process it. V3.2 activates 8 experts per token from its pool of 256, plus the shared expert. This means:

  • Storage: You need the full 671B weights on disk (~1.3 TB at FP16, ~190-450 GB quantized)
  • Active memory: Only the active experts, attention weights, and KV cache live in VRAM/RAM during inference
  • Practical result: A Q2_K V3.2 needs roughly the same active memory as a dense 70B model at Q4_K_M

The catch: MoE routing quality degrades faster than dense models at aggressive quantization. Q2_K on V3.2 isn't equivalent to Q2_K on Llama 3.1 70B — the expert selection mechanism suffers more from weight compression. IQ3_S exists specifically to address this, using importance matrix quantization to preserve routing decisions.


The Three Hardware Tiers for V3.2

Tier 1: The Hybrid Build (24 GB GPU + 64 GB RAM, ~$800-1,200 used, as of April 2026)

You've got one RTX 3090 or 4090 and a high-RAM workstation. Every guide tells you to "just use the API" because V3.2 won't fit.

It fits. It's slow. It's completely viable for specific workflows.

Community benchmarks from r/LocalLLaMA show consistent 3-5 tok/s on a single RTX 3090 (24 GB) with 64 GB system RAM, running Q2_K with 40% CPU offload via llama.cpp. LM Studio users report similar numbers with default CPU offload settings.

  • 3-5 tok/s is unusable for interactive coding — you'll wait 30 seconds for a 150-token response
  • Acceptable for: overnight document analysis, batch summarization, privacy-critical offline inference, long-context research where you'd otherwise wait for API rate limits
  • Requires llama.cpp with NUMA-aware CPU offload: -ngl 25 (25 layers on GPU), rest on CPU

The hybrid approach shines for context lengths above 32K tokens. API providers often throttle or charge heavily for 128K+ context; your local hybrid build handles it at the same 3-5 tok/s regardless of context length.

Exact config for hybrid tier:

llama.cpp flags: -m deepseek-v3-2-Q2_K.gguf -ngl 25 -c 32768 --threads 16
Hardware: RTX 3090 24 GB + 64 GB DDR4/DDR5, PCIe 3.0/4.0 x16
Expected: 3.2-4.8 tok/s prompt processing, 3.5-5.2 tok/s generation

Tier 2: The Dual 3090 Build (48 GB VRAM, ~$1,400 used, as of April 2026)

You want usable speeds without datacenter GPU prices. Two used RTX 3090s look tempting but tensor parallelism overhead makes you hesitate.

8-12 tok/s at Q2_K, 6-10 tok/s at IQ3_S — genuinely usable for daily coding and research.

Dual RTX 3090 builds with NVLink (or even without, on PCIe) consistently hit these numbers in community testing. The key is proper layer splitting: llama.cpp's -sm row (split mode row) works better for MoE than layer-wise splitting.

  • No NVLink on 3090s means higher inter-GPU communication overhead — expect 10-15% speed loss vs. ideal
  • Power consumption: 700W+ under load, requires 1000W+ PSU
  • IQ3_S at 48 GB is tight — you'll need 32-64 GB system RAM as buffer for context cache

IQ3_S vs. Q2_K is the decision most builders get wrong. IQ3_S uses ~25% more memory but preserves expert routing quality measurably better. For coding tasks where V3.2's reasoning matters, IQ3_S is worth the speed hit. For summarization and extraction, Q2_K is fine.

Exact config for dual 3090 tier:

llama.cpp flags: -m deepseek-v3-2-IQ3_S.gguf -ngl 999 -sm row -c 65536
Hardware: 2× RTX 3090 24 GB, 64 GB system RAM, 1000W+ PSU
Expected: 6.5-9.5 tok/s generation, 128K context viable

Tier 3: The Dual 4090 Build (48 GB VRAM, ~$3,200 new/used, as of April 2026)

You want Q4_K_M quality without enterprise hardware. Dual RTX 4090s are the ceiling of consumer builds.

10-15 tok/s at Q4_K_M — quality indistinguishable from API inference for most tasks.

Dual RTX 4090 builds with proper cooling hit 12-15 tok/s consistently. The Ada Lovelace architecture's memory bandwidth advantage (1,008 GB/s vs. 936 GB/s) matters less than the larger L2 cache for MoE's scattered memory access patterns.

  • Q4_K_M at 380 GB file size needs 90-100 GB combined memory — 48 GB VRAM + 64 GB RAM minimum
  • 4090s have no NVLink, so tensor parallelism overhead is higher than 3090s with NVLink bridge
  • Power: 900W+ under load, 1200W PSU recommended

This is the tier where you start questioning the economics. At $3,200 for GPUs alone, plus supporting hardware, you're approaching used A100 40GB territory — which gives you 80 GB VRAM in a single address space, no tensor parallelism complexity.


CPU Offload Deep Dive: When 3-5 tok/s Is Worth It

Most guides dismiss CPU offload because it's slow. They're not wrong about the speed — they're wrong about the use cases.

When CPU offload makes sense:

Real Example

Process 500 research papers, review results in morning

Summarize entire books without per-token pricing anxiety

Legal/medical documents that can't touch cloud APIs

Verify V3.2's performance on your specific tasks before investing in hardware The memory hierarchy reality:

CPU offload isn't "falling back to RAM" — it's a deliberate tiered storage strategy. Modern DDR5-5600 provides ~90 GB/s bandwidth, versus PCIe 4.0 x16's ~32 GB/s. The bottleneck isn't RAM speed; it's PCIe latency for weight loading and the CPU's inference throughput.

llama.cpp's CPU backend uses AVX-512 and AMX (on Intel) or AVX-512 (on AMD Zen 4+) for respectable throughput. A 16-core Zen 4 or 13th-gen Intel can hit 2-3 tok/s on pure CPU for smaller models; the hybrid approach keeps the hot path on GPU where MoE routing decisions happen.

Tuning for hybrid performance:

Critical llama.cpp flags:
-ngl N           # Layers on GPU — experiment between 20-30 for 24 GB cards
--threads N      # Physical cores, not hyperthreads — typically 8-16
--mlock          # Prevent swapping on RAM-resident weights
--no-mmap        # Force full load into RAM — slower startup, faster inference
-c 32768         # Context size — larger contexts need more RAM for KV cache

IQ3_S vs. Q2_K: The Quantization Decision That Matters

Here's the benchmark nobody shows: MoE routing quality at aggressive quantization.

Q4_K_M

~380 GB

8.4

98%

78%

92% The practical difference: Q2_K's routing errors compound across 64+ layers. A misrouted token early in the sequence degrades subsequent expert selections. IQ3_S's importance matrix quantization specifically protects the gating network weights that determine expert selection.

Our recommendation:

  • Q2_K for summarization, extraction, classification — tasks where exact reasoning matters less than coverage
  • IQ3_S for coding, math, multi-step reasoning — the routing accuracy improvement is worth 25% more memory and ~20% speed cost
  • Q4_K_M only if you've got the VRAM — the quality jump from IQ3_S is smaller than the memory jump suggests

The Bottom Line

DeepSeek V3.2's 671B parameter count is a storage requirement, not a memory sentence. With Q2_K quantization and CPU offload, it runs on hardware you might already own. With dual RTX 3090s, it runs fast enough for daily work. With dual 4090s, it matches API quality at local control.

The guides telling you V3.2 needs $20,000 in datacenter GPUs are selling cloud services or haven't checked the quantization math. The guides dismissing CPU offload haven't tried overnight batch processing at 128K context.

Pick your tier based on speed requirements, not parameter count. The model is more accessible than it looks.

deepseek-v3-2 local-llm moe cpu-offload quantization rtx-3090 multi-gpu llama-cpp

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.