Can I run MiMo-V2-Pro locally?

No. MiMo-V2-Pro is API-only as of April 2026 — Xiaomi has not released public weights. MiMo-V2-Flash (the smaller sibling) has public MIT-licensed weights and can run locally on consumer hardware with heavy quantization.

What's the cheapest way to run a trillion-parameter model locally?

CPU offload hybrid (RTX 4070 Ti + 128GB RAM) gets you ~0.5 tok/s for ~$1,200 incremental cost. Dual RTX 5090 ($5,000 total system) gives 2–3 tok/s and is more practical. Both require Q3–Q4 quantization to fit.

Is MiMo-V2-Flash worth running instead of Llama 3.1 405B?

For most use cases, no — 405B is faster, better-supported, and cheaper to run. Flash wins if you need the 1M context window for massive document RAG or compliance (never use APIs). Otherwise, stick with 405B.

When will MiMo-V2-Pro get local weights?

Unknown as of April 2026. Xiaomi has not announced a public release timeline. If it happens, expect 2–3 months notice before weights drop, and another 6 months for inference optimizations to mature.

MiMo-V2-Pro Local Setup: Why It Won't Work (And What Will)

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Here's the Hard Truth

The Xiaomi MiMo-V2-Pro (Hunter Alpha) is a genuine trillion-parameter model. It's open-source in terms of licensing — but it's not open in terms of weights. If you want to run it on your own hardware right now, you can't. The model is API-only.

This is the sentence nobody running clickbait YouTube benchmarks will tell you, so here it is front and center: MiMo-V2-Pro is not a local option in April 2026.

But the question behind the question — "Can I run a trillion-parameter open model locally?" — has a real answer. It's not MiMo-V2-Pro. It's the smaller MiMo-V2-Flash, quantized aggressively, or it's stacking proven models like Llama 3.1 405B. And yes, you can do it on consumer hardware. Just not fast, and not cheap.

Let's be specific about what's possible, what it costs, and when you should actually care.

The API vs. Local Divide (Why MiMo-V2-Pro Matters)

MiMo-V2-Pro represents something genuinely new: the first trillion-parameter open-source model from a mainstream manufacturer. Apache 2.0 license (in principle), released March 2026, 1 trillion total parameters with mixture-of-experts routing (~42 billion active per forward pass), and a 1 million token context window.

By every measure except actual availability of weights, this should be the headline.

But Xiaomi chose API-only distribution. The model lives on Xiaomi's servers. You send prompts, you get responses, you pay per token. This is the same locked-garden approach as OpenAI's GPT-4 or Anthropic's Claude — except with an open-source license that technically allows redistribution if Xiaomi ever releases weights.

Why does this matter? Three reasons:

Compliance and privacy — If you process sensitive data, APIs mean data leaves your network. Some enterprises and researchers refuse APIs on principle.
Cost at scale — Trillion-parameter inference at volume is expensive via API. If you're running 100K+ tokens per day, local becomes economical in months, not years.
Fine-tuning and research — You can't fine-tune a model you can't download. You can't run experimental inference modes. API is a dead end for researchers.

The existence of MiMo-V2-Pro (even API-only) proves the market for trillion-parameter models is real. The question is: what can you actually run today?

The Three Realistic Paths to Trillion-Parameter Local Inference

There is no perfect option here. All of them involve compromises. Pick the one that matches your budget and use case.

Path 1: MiMo-V2-Flash (Q4 Quantized, Dual RTX 5090) — $5,000 System

What it is: MiMo-V2-Flash is the 14-billion-parameter mixture-of-experts sibling to MiMo-V2-Pro. Public MIT-licensed weights on Hugging Face. Active parameters per forward pass: ~3 billion. Total parameters: 14 billion (misleading because the "trillion" marketing doesn't apply here — this is a companion model, not a scaled-down Pro).

Wait. That's not trillion. Let me correct the premise: MiMo-V2-Flash is not trillion-parameter. If you need actual 1T models locally, skip this section.

What it costs:

Dual RTX 5090: $2,000 (32GB GDDR7 total)
Motherboard + CPU (i9-14900K): $700
PSU (1,200W): $300
Case + cooling: $400
Total: ~$5,000

Performance (Q4 quantization):

Tokens/sec: 2–2.5 tok/s sustained (unverified; vLLM with expert parallelism enabled)
Time-to-first-token: 2,200–2,800ms
Power draw: 950W average (just under full board capacity)
Electricity cost: $0.11/hour at $0.12/kWh residential rate
Monthly electricity (8 hrs/day): $26/month

Who this is for: Researchers and labs bootstrapping MoE infrastructure. Small AI agencies running batch inference. Hobbyists with deep pockets and patience for slow-but-consistent output.

Who it's NOT for: Interactive chat use. Real-time applications. Anyone comparing this to GPT-4 API speeds (API is 5–10x faster).

Path 2: Llama 3.1 405B (Q4, RTX 5080) — $2,500 System

What it is: Llama 3.1 405B is a 405-billion-parameter dense model (not mixture-of-experts). Not trillion, but close enough for most use cases. 128K context window. Proven, stable, best-in-class community support.

Performance (Q4 quantization on single RTX 5080):

Tokens/sec: 8–10 tok/s (verified across multiple sources)
Time-to-first-token: 800–1,200ms
Power draw: 650W average
Electricity cost: $0.08/hour
Monthly electricity (8 hrs/day): ~$19/month

What it costs:

RTX 5080: $749
Motherboard + CPU (i9-14900K): $700
PSU (1,000W): $250
Case + cooling: $300
Total: ~$2,000–$2,500

Why this is the right choice for 80% of power users:

405B runs 4–5x faster than MiMo-V2-Flash on the same hardware. Costs 50% less. Better documented. Actual support on Ollama, llama.cpp, and vLLM. If you don't specifically need trillion parameters or 1M context, this is the move.

Path 3: CPU Offload Hybrid (RTX 4070 Ti Super + 128GB RAM) — $1,200

Incremental

What it is: Run a smaller quantized model (Llama 405B Q3, or MiMo-V2-Flash Q2) with layers split between GPU and system RAM via CPU offload. Slow, but surprisingly usable for batch work.

Performance:

Tokens/sec: 0.4–0.7 tok/s (varies wildly based on RAM speed and quantization)
Time-to-first-token: 8,000–15,000ms (truly interactive work is off the table)
Power draw: 400W average
Electricity cost: $0.05/hour
Monthly electricity: ~$12/month

What it costs (incremental):

RTX 4070 Ti Super: $700 (you already have the 128GB system + i9)
New PSU if current is underpowered: $150
Riser cables and NVMe swap space (2TB): $350
Total incremental: ~$1,200

Who this is for: Budget-conscious power users. People who already own high-end CPUs and RAM. Batch processing, overnight RAG indexing, fine-tuning on a budget. Not interactive use.

Reality check: CPU offload adds 8–20x latency. A token that takes 100ms on RTX 5090 takes 1–2 seconds on CPU offload. This is fundamentally different from "fast enough."

Honest Comparison: Which Path Actually Makes Sense?

Use Case

Research, MoE experimentation, proof-of-concept

Power users wanting practical inference, developers

Batch work, fine-tuning, overnight processing The hard truth: Llama 3.1 405B at 8 tok/s is more useful than MiMo-V2-Flash at 2 tok/s, even if you need trillion parameters. Speed has compounding returns. A response that takes 5 seconds feels interactive. One that takes 30 seconds feels broken.

If you genuinely need 1M context (massive document RAG, compliance), MiMo-V2-Flash is your only option right now. Otherwise, 405B wins on every practical metric.

The Software Reality: Where MiMo Actually Works (And Where It Doesn't)

vLLM: MoE support exists via --enable-expert-parallel. vLLM has documentation for MiMo-V2-Flash but MiMo-V2-Pro has no public weights to test. This is the recommended inference engine for anything MoE-based.

llama.cpp: GGUF files exist for MiMo-V2-Flash on Hugging Face, but MoE support in llama.cpp is still under development. Works for Flash, unverified for Pro (which, again, has no public weights).

Ollama: Still doesn't support MoE routing as of April 2026. An open GitHub issue from December 2025 requesting MiMo-V2-Flash support remains unsolved. If Ollama support matters to you, use Llama 405B instead.

Translation: If you want to run MiMo-V2-Flash today, use vLLM. Everything else is experimental or blocked.

When MiMo-V2-Pro Local Weights Might Actually Happen

Xiaomi has not announced a public weight release. But the pattern suggests:

If announced within 2 weeks: Expect weights in 4–8 weeks.
If announced within 3 months: Expect 6–12 months of waiting.
If never announced: This is the likely outcome.

When/if it happens, the second problem starts: inference optimization. It took vLLM months to build efficient MoE support. Llama.cpp and Ollama are further behind. By the time MiMo-V2-Pro weights drop, expect another 3–4 months before consumer tooling matures.

Practical implication: If you need trillion-parameter inference right now, you're too early. Invest in 405B hardware, and upgrade the GPU in 12 months when the ecosystem stabilizes.

The Real Reason to Run Local Trillion-Parameter Models

Stop me if this sounds familiar: "I need MiMo-V2-Pro for my RAG system."

In 90% of cases, you don't. What you actually need is compliance (can't send data to APIs) or cost scale (running 50K+ tokens/day at enterprise volume). For those cases, yes, local makes sense.

For everything else — experimentation, prototyping, occasional advanced queries — the API is faster, cheaper, and less hassle.

Before you commit to $5,000 in hardware, ask:

Do I process HIPAA/PII data that can't leave my network?
Am I running 10,000+ tokens per day consistently?
Do I need fine-tuning or custom inference modes?

If you answered yes to all three, dual RTX 5090 is defensible. If it's just one, reconsider.

Final Verdict

MiMo-V2-Pro is not a local option today. It might be in 12 months. Until then:

For practical trillion-parameter work: Use Llama 3.1 405B on RTX 5080 ($2,500 system).
For research and MoE experimentation: MiMo-V2-Flash on dual RTX 5090 ($5,000 system).
For batch/budget work: CPU offload hybrid ($1,200 incremental).

The frontier of open models is real. The frontier of local open models is smaller than the hype suggests. Choose based on what you actually need, not what sounds impressive.

FAQ

Can I run MiMo-V2-Pro on four RTX 5070s instead of dual 5090s?

Theoretically, maybe. Four RTX 5070s = 32GB total VRAM (same as dual 5090), but they're running PCIe peer-to-peer instead of faster NVLink. You'd hit bandwidth bottlenecks on MoE expert shuffling. Not recommended. Dual RTX 5090 with proper NVLink or newer multi-GPU interconnect is the minimum for MoE efficiency.

How much slower is Q3 quantization compared to Q4?

Typically 10–15% accuracy loss on reasoning benchmarks, 3–5% on factual recall. Token speed might be 8–12% faster due to reduced memory bandwidth. The trade-off is real but often worth it if VRAM is your bottleneck.

Should I wait for RTX 6090 before building?

If you're on the fence: yes, wait 6 months. RTX 6090 will reportedly have more VRAM, better clock speeds, and more mature inference kernels. If you need inference capacity right now, dual RTX 5090 is the move.

Why not use an older used GPU like RTX 4090 to save money?

RTX 4090 is 24GB, not enough for MiMo-V2-Flash without severe quantization (Q2 = noticeable quality loss). Two RTX 4090s = $1,400 second-hand, still only 48GB, still only 2–3 tok/s due to older architecture. Not worth the savings.

Can I mix RTX 5090 + RTX 5080 in the same system for MoE?

Yes, but with caveats. They'll peer-to-peer communicate over PCIe 5.0. Total VRAM = 40GB. Expert routing will sometimes request layers on the slower card. Results will be uneven token latency (some batches 2 tok/s, some 1.2 tok/s). Not recommended unless cost forces the decision.