The Uncomfortable Truth About MiMo-V2-Flash on Dual RTX 3090s
TL;DR: MiMo-V2-Flash doesn't fit on dual RTX 3090s. The Q4_K_M quantization is 187 GB; you have 48 GB. Before you consider CPU offloading (which cuts speed to 3–5 tok/s), read why other sources get this wrong and which hardware actually delivers the promised 15+ tok/s.
The hype around MiMo-V2-Flash is justified—it's a clever 309B parameter model with only 15B active parameters, making it genuinely useful for reasoning tasks. The MIT license is a bonus. But there's a problem floating around online: claims that it runs well on dual RTX 3090s. Those claims are wrong, and I want to explain why so you don't waste a week chasing an impossible setup.
VRAM Math: Where the Numbers Break Down
Let's start with what people claim versus what the hardware actually supports.
The claim: "Q4_K_M loads to ~40 GB."
Reality: The Q4_K_M GGUF file is 187.14 GB on disk. It loads to approximately 187 GB in VRAM—not 40 GB.
Here's why this matters. A GGUF file is already quantized. When you load it into VRAM, the file size translates almost directly to VRAM usage. The bartowski Q4_K_M release on Hugging Face is 187 GB. That's not negotiable.
Your dual RTX 3090s have:
- GPU 0: 24 GB
- GPU 1: 24 GB
- Total: 48 GB
48 GB < 187 GB. By a factor of 3.9x.
That's not a gap you can close with optimization tricks. It's a hard constraint.
What People Actually Mean by "Dual RTX 3090"
When benchmarks claim MiMo-V2-Flash runs at 20+ tok/s on dual RTX 3090s, one of three things is happening:
-
They're testing a different quantization. Q3_K_M might fit in ~32 GB (tight on single RTX 3090), but loses 8–12% reasoning accuracy vs. Q4. If that's the case, they should disclose it.
-
They're measuring with CPU offloading, and not mentioning the speed hit. You CAN run 187 GB of model across 48 GB VRAM + system RAM, but the GPU stalls constantly waiting for PCIe transfers. Real-world throughput drops to 3–5 tok/s, not 20.
-
They're using vLLM or llama.cpp with aggressive context windowing. Running batch size 1 with max context 512 tokens instead of 2K reduces intermediate KV cache pressure, but it's not a general solution—it's a workaround for a specific narrow use case.
None of these scenarios match the headline "MiMo-V2-Flash on dual RTX 3090s at 20+ tok/s."
The Physics of Tensor Parallelism
Here's how tensor parallelism is supposed to work:
- Split model weights evenly across GPUs
- Each GPU holds ~50% of weights
- During inference, GPUs communicate via NVLink or PCIe
- All computation stays on GPU; speed is bottleneck by arithmetic, not memory transfer
For MiMo-V2-Flash at 187 GB:
- GPU 0 would hold ~94 GB
- GPU 1 would hold ~94 GB
- Both exceed their VRAM by 3.9x
You can't use tensor parallelism. The weights literally don't fit.
The only fallback is CPU offloading with tensor parallelism, where:
- System RAM (you presumably have 32+ GB) holds overflow weights
- GPU requests weights from RAM as needed
- PCIe bandwidth becomes the bottleneck (16x or 8x lanes, ~16–32 GB/sec)
- Effective throughput collapses
Real-world result: 3–5 tok/s sustained, with pauses while weights transfer.
Which Quantizations Actually Fit?
Let me give you the honest breakdown:
Expected tok/s
—
— (with CPU offload: 3–5)
—
— None of them fit. Not even Q3_K_M.
The smallest viable quantization for dual RTX 3090s would need to be around 24 GB per GPU (48 GB total). That's closer to 2-bit or 1.5-bit, which don't exist in published GGUF format and would reduce quality below usability.
What Hardware Actually Runs MiMo-V2-Flash?
If you want 15+ tok/s at Q4_K_M, here's what you need:
Dual RTX 5090 (~$3,600 used market, April 2026)
- 48 GB × 2 = 96 GB combined
- Still tight, but Q4 with some KV cache optimization is feasible
- Expected: 18–22 tok/s
- This is the consumer sweet spot
Dual H100 (80 GB) (~$20,000+ used, enterprise)
- 80 GB × 2 = 160 GB combined
- Q4 with comfortable headroom
- Expected: 25–35 tok/s
- Overkill for consumer use
Single RTX 6000 Ada (~$7,000, professional card)
- 48 GB unified VRAM
- Q4 fits with tight margin
- Expected: 15–18 tok/s
- Better for production; worse for tinkering
Single H100 (~$15,000)
- 80 GB
- Q4 with headroom
- Expected: 22–28 tok/s
- Professional single-GPU alternative
The hard truth: MiMo-V2-Flash is an enterprise-class model. Dual RTX 3090s are not enterprise hardware. They're excellent for 30B–70B models, but 300B+ is out of reach without a significant hardware jump.
The Real Value in MiMo-V2-Flash (And Where It Fits)
This isn't a reason to skip the model entirely. MiMo-V2-Flash is genuinely clever:
- 15B active parameters means inference cost is comparable to Llama 3.1 8B for most workloads
- MIT license means you can run it in production without API dependencies
- Reasoning quality rivals much larger dense models on math, coding, and multi-step logic
If you have the hardware (or can upgrade), it's worth the investment. But the practical deployment tiers are:
- Home lab / hobby: Stick with Llama 3.1 70B, Qwen 72B, or Mixtral 8×7B on dual RTX 3090s. You get 95% of the reasoning capability at 1/4 the hardware cost.
- Professional / small team: Upgrade to dual RTX 5090 and run MiMo-V2-Flash Q4. The 3–5x speed gain vs. CPU offloading justifies the hardware cost in ~6 months of heavy use.
- Production SLA / multi-user: Go H100 dual or better. MiMo-V2-Flash at 25+ tok/s, concurrent requests, fine-tuning support—justify the budget against API costs.
FAQ: Why This Matters
Q: Can I just add more system RAM and use CPU offloading?
A: Yes, technically. But 3–5 tok/s is painful for interactive use. Expect 45–90 seconds per 200-token response. At that point, ChatGPT API ($0.003 per 1K tokens) becomes cheaper and faster.
Q: What about using just one RTX 3090 and Q3_K_M?
A: Q3_K_M is still 141 GB. You'd need ~141 GB RAM + GPU, which is CPU offloading again. Realistically, drop to Llama 3.1 8B (13 GB Q5) or Qwen 14B (18 GB Q5) and get snappier responses.
Q: Is the MIT license actually safe?
A: Yes, MIT is permissive and production-safe. That's not the constraint. The constraint is physics.
Q: Why do YouTube benchmarks show 20+ tok/s?
A: Benchmarks are often:
- Testing older/smaller quantizations
- Using different models (Llama 3.1 70B, not MiMo-V2-Flash)
- Including CPU offload overhead in measurement without saying so
- Measuring prefill speed (tokens/sec while processing input), not generation speed (tokens/sec of output)
Always verify VRAM requirements. A 187 GB model on 48 GB hardware is not viable at production speed.
The Honest Verdict
MiMo-V2-Flash is a powerful model. It's not ready for dual RTX 3090s. If you own that hardware, you already have excellent options—Llama 3.1 70B, Qwen 72B, Mixtral 8×7B—that run beautifully at 15–25 tok/s. They're 95% as smart for 1/4 the hardware complexity.
If you want to run MiMo-V2-Flash, be honest about the upgrade: dual RTX 5090 ($3,600), dual H100 ($20K+), or single H100 ($15K). Each of these unlocks the model's full potential. Anything less is a compromise that isn't worth the effort.
The model is MIT-licensed and real. The hardware requirements are too. Don't let benchmark hype override VRAM math.