CraftRigs
Architecture Guide

DeepSeek V4 on Ascend 950PR: Can CUDA GPUs Still Run It? [2026]

By Georgia Thomas 6 min read
DeepSeek V4 on Ascend 950PR: Can CUDA GPUs Still Run It? [2026]

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

TL;DR: DeepSeek V4 was trained on Huawei's Ascend 950PR cluster — but your CUDA GPU can still run it. Training hardware doesn't dictate inference hardware; quantized weights are format-agnostic. The real blocker is scale: V4's ~1T parameters require ~280 GB VRAM at Q4_K_M quantization, making V3 the smarter local inference target for now. Your rig isn't obsolete. The headlines are just noisy.


What DeepSeek V4 Is — Ascend 950PR Training and the CUDA-Free Claim (April 2026)

You've been running DeepSeek V3 locally for months. It's been solid — 671B parameters, ~37B active per token, GGUF-quantized down to something your 2×RTX 4090 build handles without breaking a sweat. Then V4 drops with a headline that stops you cold: trained entirely on Huawei Ascend, no NVIDIA silicon involved.

The immediate panic is understandable. If they trained it on Ascend, do you need Ascend to run it? Did your CUDA stack just become a paperweight for frontier models?

No. Here's what actually happened, and why your GPUs are fine.

DeepSeek V4 is a ~1 trillion parameter Mixture-of-Experts (MoE) model — roughly 50% larger than V3's 671B total. The training cluster used Huawei's Ascend 950PR accelerators, not H100s or B200s. This is genuinely significant: it's the first competitively benchmarking frontier model trained without NVIDIA infrastructure at its core. DeepSeek's stated motivation was export restriction resilience — building capability that doesn't depend on CUDA hardware they might lose access to.

But training and inference are different problems with different constraints. Training requires massive parallel throughput, optimized all-reduce communication across thousands of accelerators, and tight integration with a specific software stack. Inference — especially local inference — is about VRAM capacity, memory bandwidth, and efficient quantization. The weights don't care where they were born.

Ascend 950PR Specs — What Huawei Built to Replace NVIDIA (Brief)

The Ascend 950PR is Huawei's training-focused flagship. Here's how it stacks up against what NVIDIA's shipping:

H100 SXM5

80 GB HBM3

~3.35 TB/s

~989 TFLOPS

The Ascend 950PR lags H100 on raw bandwidth and interconnect speed — TrendForce's analysis suggests training throughput is roughly 70-80% of an equivalent H100 cluster for the same accelerator count. But DeepSeek's training efficiency (their signature MoE optimizations, aggressive data parallelism) closed enough of that gap to produce a competitive model. The point isn't that Ascend beats NVIDIA head-to-head. It's that it's viable for frontier training without CUDA dependency.

That viability matters geopolitically. For your local rig, it matters barely at all.

DeepSeek V4 Architecture — MoE Structure and Active Parameter Count (Brief)

V4 follows V3's MoE blueprint: total parameters are massive, but only a subset "activate" per token. DeepSeek hasn't published final V4 architecture details as of April 2026, but the ~1T total parameter count with similar expert-routing density suggests roughly 50-60B active parameters per forward pass.

This matters for inference speed — active parameters determine compute per token — but not for memory requirements. The full weight file still loads into VRAM (or gets memory-mapped from system RAM). MoE doesn't magically shrink your GGUF file.

For context: V3's Q4_K_M quant runs about 380 GB raw, compressing to ~180-200 GB effective depending on implementation. V4 at 1T parameters scales roughly linearly: expect ~560 GB raw weights, ~280 GB at Q4_K_M, and ~140 GB at more aggressive Q5 or Q6 quants that sacrifice quality.


Why Your CUDA GPU Still Runs V4 — The Training/Inference Distinction

The confusion flooding tech media conflates two separate questions: What trained this model? and What can run this model? The first is interesting for supply chain analysts and geopolitical watchers. The second is what you actually care about.

You've invested thousands in a multi-GPU CUDA build. Headlines suggest that investment is stranded on the wrong side of a hardware divide. You're wondering if you need to start pricing Ascend inference cards, or if V4 is simply off-limits for local runs.

Your CUDA stack runs V4 just fine. The weights are released in standard formats — PyTorch checkpoints, Safetensors, and GGUF quantizations through the usual channels (llama.cpp, koboldcpp, text-generation-webui). There's no proprietary Ascend runtime required for inference.

DeepSeek V3 trained partly on NVIDIA clusters, partly on AMD MI300X, and runs identically on CUDA, ROCm, Apple Silicon, and Qualcomm NPUs. Training heterogeneity doesn't fragment inference compatibility. V4 continues this pattern — the training hardware choice was operational necessity, not technical lock-in.

The actual blocker isn't silicon nationalism. It's arithmetic. 1T parameters at usable quantization needs more VRAM than most personal builds have.

So what does it take to run V4 locally? And when is V3 still the better call?


VRAM Reality Check — What V4 Actually Requires

Let's stop talking about whether your GPU can run V4 and start talking about whether it should.

Notes

Theoretical only — no consumer or prosumer build

Data center territory, 8×A100 80 GB minimum

The RTX 5090's 32 GB VRAM (confirmed April 2026 launch) helps, but four of them still only gets you 128 GB. You'll be memory-mapping 150+ GB from system RAM, which tanks token generation speed. Expect 2-5 tokens/second with heavy CPU offload versus V3's 15-30 t/s on the same hardware.

This isn't a CUDA limitation. It's a parameter count limitation. An Ascend 910B inference card (24 GB HBM2e) would face identical constraints. The MoE architecture helps inference speed — less compute per token — but doesn't reduce memory footprint.


Build Recommendations — V3 vs. V4 for Local Inference

Here's what to actually do with this information.

Keep running V3. At 671B parameters, it's the sweet spot for high-quality local inference. A 2×RTX 4090 build (48 GB total) runs Q4_K_M comfortably with minimal CPU offload. A 4×RTX 5090 build (128 GB) runs it entirely in VRAM with headroom for context. V3's reasoning capabilities are excellent — most users won't hit its ceiling before hitting their patience limit for token generation speed.

Consider V4 only if you have: 4×RTX 5090 or better, 256+ GB system RAM, and tolerance for slower generation. Even then, wait for community-optimized quants. Early V4 GGUF releases are often naive conversions — expect 10-20% efficiency improvements as quantizers adapt to its specific activation patterns.

Don't buy Ascend hardware for inference. The Ascend 910B and 310P inference cards exist, but their software stack (CANN, MindSpore) lacks the mature quantization and serving tools of CUDA. For local LLM use, they're strictly worse than equivalent NVIDIA cards at equivalent memory capacity.

Watch for distillation. DeepSeek typically releases smaller distilled versions 4-8 weeks after flagship drops. A 70B or 236B V4-distilled model would be the actual upgrade path for most local runners — not the full 1T parameter release.


The Real Story — Why Ascend Training Matters (Just Not for You)

DeepSeek V4's Ascend training is genuinely significant. It's proof that frontier AI development can decouple from NVIDIA's supply chain — not easily, not cheaply, but possibly. For data center operators in export-restricted regions, that's actionable intelligence. For investors tracking semiconductor diversification, it's a signal.

For you, running models in your basement or home office? It's a footnote. The weights are the weights. Quantization is quantization. Your CUDA GPU doesn't check the training cluster's provenance before loading a GGUF file.

The noise around "China breaks CUDA" is category error dressed up as technical analysis. Training infrastructure and inference infrastructure have always been separable. Google trained PaLM on TPUv4; you ran it on CUDA via HuggingFace downloads. DeepSeek trained V4 on Ascend; you run it the same way.

What is changing is parameter inflation. V3 to V4 is a 50% jump. The next generation may hit 2T parameters. At some point, local inference of full frontier models becomes physically impossible for non-institutional budgets — not because of hardware lock-in, but because you can't fit the weights in addressable memory.

We're not there yet. V3 is still excellent. V4 is runnable but impractical for most. Your build is fine. Keep quantizing, keep testing, and ignore the silicon nationalism panic.


deepseek-v4 huawei-ascend cuda-inference local-llm gguf-quantization multi-gpu 1t-parameter-models ascend-950pr vram-requirements

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.