The conventional wisdom on VRAM requirements is roughly 18 months out of date. It was built on Llama and Mistral benchmarks, and those two models no longer define the efficiency frontier. If you're sizing a GPU build on the assumption that "24 GB is the minimum for anything serious," there's a real chance you're either overspending — or that the math has already shifted under your feet.
TL;DR: Chinese open-source models deliver more capability per GB of VRAM through two separate mechanisms. Dense models like Qwen 2.5 7B outperform Llama 3.1 8B on every benchmark at identical VRAM usage — through better training, not architectural tricks. MoE models like DeepSeek-V2-Lite activate only 2.4B out of 15.7B total parameters per token, running in 9-10 GB at Q4_K_M. On a 24 GB GPU, Qwen 2.5 32B Q4_K_M (19.85 GB) now scores above Llama 3.1 70B on MMLU — a model that needs 40+ GB and won't fit on any single consumer card. The hardware math has changed. Your model selection should too.
The Download Numbers Show the Shift
Qwen 2.5 7B-Instruct pulled 17.9 million HuggingFace downloads in March 2026. Llama 3.1 8B-Instruct pulled 8.4 million in the same window. That's not close. Qwen 2.5 32B alone (4.1 million monthly downloads) matches Llama 3.1 70B despite running in under half the VRAM.
This isn't driven by marketing. It's driven by people running both models side by side and noticing the difference. Chinese labs didn't just release competitive models — they released models that out-perform at the same hardware constraint.
Two Efficiency Stories — And Most Articles Conflate Them
There are two separate reasons Chinese models go further per GB. Conflating them leads to real hardware mistakes:
Dense efficiency: Qwen 2.5 models use standard transformer architecture — same as Llama. The efficiency gain is training quality. Qwen 2.5 was trained on 18 trillion tokens, roughly three times Llama 3.1's training data, with a stronger focus on code and mathematics. More capability per parameter, same parameter count, same VRAM footprint.
MoE efficiency: Models like DeepSeek-V2-Lite use Mixture of Experts architecture. Each token is routed to a small subset of specialized sub-networks — the rest are skipped. Total parameter count tells you how much storage you need. Active parameter count tells you how much compute runs per token. These are very different numbers in a MoE model.
Mixing these up leads to incorrect builds. Qwen 2.5 32B is a dense model — it doesn't activate a fraction of itself. DeepSeek-V3 is a 671B MoE model with 37B active per token, and it doesn't run on any single consumer GPU regardless of quantization. Both are Chinese, both are efficient, and they're efficient in completely different ways.
Dense Models: Better Benchmarks, Same VRAM Budget
The most actionable efficiency story is also the most overlooked: you don't need a bigger GPU, you need smarter model selection.
Qwen 2.5 7B-Instruct and Llama 3.1 8B-Instruct are effectively the same weight at Q4_K_M quantization — both around 4.9 GB. Here's what the benchmarks show (source: Qwen2.5 Technical Report, arxiv:2412.15115, last verified March 2026):
Gap
+8 pts
+12 pts
+24 pts
+24 pts Same ~4.9 GB footprint. Qwen 2.5 7B wins every column, with the math gap hitting 24 points. On an RTX 4060 Ti 8GB, Qwen 2.5 7B Q4_K_M fits in under 5 GB — leaving 3 GB headroom for context — and produces outputs that Llama 3.1 8B simply can't match on coding or reasoning tasks.
The pattern holds at 32B. Qwen 2.5 32B base model scores 83.3% on MMLU (source: Qwen2.5 Technical Report, Table 3, verified March 2026). Llama 3.1 70B base scores 79.3% (source: Meta Llama 3.1 official model card). Qwen 32B outscores Llama 70B on MMLU — fitting in 19.85 GB Q4_K_M versus a model that needs roughly 40 GB and won't load on a single RTX 4090.
Note
Dense efficiency doesn't mean MoE. Qwen 2.5 models are standard transformers. The benchmark gains come from training on 18 trillion tokens — roughly 3x Llama 3.1 — with stronger emphasis on code and math. There are no routing tricks, no expert activations. Just better training data applied to a similar architecture.
MoE Architecture: When Active Parameters Are the Real Number
For MoE models, the VRAM equation changes. Here's how to read it correctly.
A dense model runs all weights for every token. A MoE model uses a learned routing layer to pick a subset of expert sub-networks per token, skipping the rest. Total parameters describes storage requirements. Active parameters describes compute load — and that's the number that determines inference speed and, in practice with expert offloading, effective VRAM.
DeepSeek-V2-Lite (source: deepseek-ai/DeepSeek-V2-Lite, HuggingFace):
- Total parameters: 15.7B
- Architecture: 64 routed experts + 2 shared experts per MoE layer, top-6 routing
- Active per token: 2.4B
- VRAM at Q4_K_M: approximately 9-10 GB
That's a 15.7B-parameter model running on the same VRAM as a dense 9-10B model, with the compute load of a 2.4B model. On a 16 GB card, you load it in under 10 GB and use the remaining 6 GB for KV cache — which matters a lot for long document work and extended code sessions.
Qwen3-30B-A3B (source: QwenLM/Qwen3 GitHub, released April 2025):
- Total parameters: 30B
- Activated per token: 3B
- Estimated VRAM at Q4_K_M: ~16-17 GB
- Fits on a 24 GB RTX 4090, room for KV cache
Qwen3 was the first Qwen release with downloadable open-weight MoE models. The 30B-A3B supports both standard inference and a switchable "thinking" mode for step-by-step reasoning — no separate model download required.
Warning
Do not confuse DeepSeek-V3 (671B total, 37B active per token) with DeepSeek-V2-Lite. V3 requires multi-GPU server infrastructure — even at extreme IQ1_M compression, the model weights alone hit ~126 GB. DeepSeek-V2-Lite is the consumer-accessible MoE model from the DeepSeek family. They are not different versions of the same thing.
Reading Active Parameters from a Model Card
Step 1: Find the "model architecture" section. If it says "Mixture of Experts" — you're in MoE territory and need to keep reading.
Step 2: Find num_experts and num_experts_per_tok in config.json. DeepSeek-V2-Lite shows 64 routed experts, 6 activated per token. Qwen3-30B-A3B shows 128 experts, 8 activated.
Step 3: Estimate active VRAM. For MoE models at Q4, a working estimate is: total params × ~0.5 GB/B for weights. Add 1.3-1.5× multiplier for KV cache and inference overhead.
Step 4: Compare against a dense baseline using active-parameter count, not total-parameter count. A 15.7B MoE with 2.4B active competes with a dense 3-4B model on inference speed — not with a 16B dense model.
What Each VRAM Tier Gets You in 2026
8-12 GB (Entry: ~$500-$1,200)
Best GPU: RTX 4060 Ti 8GB for the budget build.
Best model: Qwen 2.5 7B Q4_K_M (~4.9 GB).
This is a straight replacement for Llama 3.1 8B. Same hardware, same VRAM usage, better benchmarks across the board — 12 points on HumanEval, 24 points on MATH. For coding assistance, general-purpose chat, and document summarization, Qwen 2.5 7B is where to start in 2026.
On Apple Silicon (M3/M4), Qwen 2.5 7B runs cleanly in Ollama with MLX optimization. Unified memory handles it without breaking a sweat.
Verdict: Start with Qwen 2.5 7B Q4_K_M. Upgrade only when you've specifically identified what it can't do.
16 GB (Mid-Range: ~$1,200-$2,000)
Best GPU: RTX 4070 Ti Super, 16 GB GDDR6X, $799 MSRP (launch price, January 2024 — verify current street price before buying).
Model options:
- DeepSeek-V2-Lite Q4_K_M (~9-10 GB): 15.7B total, 2.4B active per token. 6 GB headroom for long-context KV cache. Best choice if you do extended document work, long code reviews, or multi-turn sessions.
- Qwen 2.5 14B Q4_K_M (~9 GB): Dense 14B, strong on instruction following and code generation.
- Qwen 2.5 32B Q3_K_M (~15.94 GB): Fits, but barely — limited context window, no headroom for longer sessions.
Tip
DeepSeek-V2-Lite's sparse activation means inference runs faster than the 15.7B parameter count implies. With only 2.4B active per token, throughput is closer to a 3B dense model's speed. If you're doing interactive work with many back-and-forth turns, this compounds over a session.
See our complete build guide for this tier for full component recommendations beyond the GPU.
24 GB (High-End: ~$2,000+)
Best GPU: RTX 4090, 24 GB GDDR6X.
Model options:
- Qwen 2.5 32B Q4_K_M (19.85 GB confirmed): 4 GB remaining for KV cache. MMLU base 83.3% — above Llama 3.1 70B's 79.3% at less than half the parameters. This is the headline result of what better training achieves.
- DeepSeek-R1-Distill-Qwen-32B Q4_K_M (~19.85 GB): DeepSeek's reasoning-focused 32B distillation. Strong on multi-step and logical inference problems.
- Qwen3-30B-A3B Q4_K_M (~16-17 GB estimated): MoE, 3B active per token, thinking mode available. Leaves more room for context than the 32B dense options.
For comparison, Llama 3.1 70B at Q4_K_M requires roughly 40 GB — it doesn't fit on an RTX 4090. The 24 GB VRAM tier crossed a threshold when Qwen 2.5 32B landed: you can now run a model that beats 70B Llama on MMLU on a single consumer card.
All benchmark data last verified March 2026 against official technical reports and model cards.
Common Misconceptions
"MoE models are still experimental." DeepSeek-V2-Lite has over 600,000 monthly downloads and runs in Ollama, llama.cpp, and LM Studio without any special setup. ollama pull deepseek-v2:16b works. The implementation is stable and well-supported.
"Chinese models are Llama clones." Qwen 2.5 uses its own tokenizer, its own GQA head configuration, and was trained from scratch on 18 trillion tokens. DeepSeek-V3 uses Multi-head Latent Attention (MLA), a novel attention architecture not present in Llama. Neither model is derived from Llama weights or architecture.
"Western models still lead on English tasks." Arena-Hard and HumanEval are English-language benchmarks. Qwen 2.5 7B-Instruct scores 52.0% on Arena-Hard versus Llama 3.1 8B-Instruct's 27.8%. The gap isn't a multilingual artifact — it holds on English reasoning.
"I need special tooling to run Qwen locally." Ollama supports Qwen 2.5 natively. ollama pull qwen2.5:7b, qwen2.5:14b, and qwen2.5:32b all work. GGUF variants are available for llama.cpp and LM Studio. Setup instructions are identical to any other model — check our Ollama setup guide if you're starting fresh.
Where This Is Headed
MoE isn't a temporary detour. Qwen3's open-weight MoE release (30B-A3B, 235B-A22B) in April 2025 marked the point where MoE moved from a closed-API feature to standard open-weight architecture for Chinese labs. Dense efficiency through training scale is also accelerating — Qwen 2.5's 18 trillion token training run will likely be exceeded in the next major release cycle.
What this means practically: the capability ceiling at each VRAM tier is a moving target, and it's moving faster than GPU specs are. The VRAM calculator at CraftRigs has been updated to handle MoE active-parameter math. If you're building in the next 90 days, run your model selection through it before locking in GPU specs — the right answer today is probably a tier lower than you'd have picked 12 months ago.
FAQ
Does Qwen 2.5 use MoE architecture?
Open-weight Qwen 2.5 models (0.5B through 72B) are all dense — standard transformer architecture, no expert routing. The MoE variants (Qwen2.5-Turbo, Qwen2.5-Plus) are closed API products only. The first open-weight Qwen MoE arrived with Qwen3 in April 2025: the 30B-A3B (3B activated per token) and 235B-A22B (22B activated per token). If you downloaded a Qwen 2.5 model, it's dense. Source: Qwen2.5 Technical Report, arxiv:2412.15115.
What's the best Chinese model for a 16 GB GPU in 2026?
DeepSeek-V2-Lite is the standout MoE option at this tier: 15.7B total parameters, 2.4B activated per token, running in roughly 9-10 GB at Q4_K_M. That leaves 6 GB for KV cache on a 16 GB card, which matters for long-context sessions. For a dense alternative, Qwen 2.5 14B Q4_K_M fits in roughly 9 GB with strong results on coding and instruction tasks. Source: deepseek-ai/DeepSeek-V2-Lite HuggingFace model card.
Can Qwen 2.5 32B run on an RTX 4090?
Yes — Q4_K_M comes in at 19.85 GB (confirmed from bartowski/Qwen2.5-32B-Instruct-GGUF release files), leaving approximately 4 GB for KV cache on a 24 GB GPU. Context lengths up to around 8,000 tokens are workable. At Q3_K_M (15.94 GB), it fits on a 16 GB card but with limited context headroom. These are actual file sizes, not estimates.
Why does Qwen 2.5 7B outperform Llama 3.1 8B at identical VRAM?
Training volume. Qwen 2.5 was trained on 18 trillion tokens versus roughly 6 trillion for Llama 3.1, with heavy emphasis on code and mathematics. HumanEval: Qwen 84.8% versus Llama 72.6%. MATH: 75.5% versus 51.9%. Both models are approximately 4.9 GB at Q4_K_M — the VRAM usage is the same, the training investment isn't. Source: Qwen2.5 Technical Report, arxiv:2412.15115, verified March 2026.
What is MoE and how does it affect VRAM requirements?
Mixture of Experts routes each token to a subset of the model's specialist sub-networks, skipping the rest. DeepSeek-V2-Lite has 64 routed experts, activates 6 per token — putting effective compute at 2.4B active parameters despite 15.7B total stored. You still need VRAM or RAM to hold all weights, but active compute shrinks dramatically, inference runs faster than total parameter count implies, and the model fits in the same VRAM tier as a much smaller dense model. Source: deepseek-ai/DeepSeek-V2-Lite model card, HuggingFace.