CraftRigs
Technical Report

Mistral Small 4 Is Free — But Running It Locally Will Cost You $10,000

By Chloe Smith 7 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

The model is free. The weights are Apache 2.0. You can download Mistral Small 4, modify it, ship it inside a commercial product, and Mistral won't charge you a cent.

The hardware to run it locally? That's the part of the announcement nobody put in the headline.

Depending on how serious you are about inference quality, you're looking at somewhere between $8,000 and $120,000. A dual RTX 5090 workstation lands right around the $10,000 mark — which gets you into the model, at a quantization level that's decent but not the full picture. If you want clean, production-grade BF16 inference, you're shopping H100 clusters.

Mistral Small 4 dropped March 16, 2026. The announcement was genuinely impressive: 119 billion parameters, one model that handles reasoning, vision, and coding, with a reasoning_effort parameter you can flip between fast and deep modes. Apache 2.0. Open weights. The whole package. Mistral called it "Small." Part irony, part statement about MoE efficiency.

But "small" is doing a lot of work in that name. VRAM doesn't care what the marketing department calls it.

What Mistral Small 4 Actually Is

Quick orientation before the math gets heavy.

Mistral Small 4 is a Mixture-of-Experts model. 128 total expert networks in the feed-forward layers, 4 of them activated per token. Total parameters: 119B. Active parameters per forward pass: roughly 6.5B. That's why Mistral's throughput claims hold up — you get the knowledge capacity of a 119B model with inference compute closer to a 6.5B dense model. 40% faster latency and 3x the throughput vs Small 3.

The model replaces four separate releases: Mistral Small (instruct), Magistral (reasoning), Pixtral (vision), and Devstral (coding). One download. One deployment. Configurable reasoning depth per request. 256K context window. The NVIDIA partnership through the Nemotron Coalition is why an NVFP4 quantized checkpoint was ready on launch day — that format is co-developed specifically for H100 deployment.

[!INFO] Model at a Glance 119B total / 6.5B active parameters | 128 experts, 4 active per token | 256K context window | Apache 2.0 | Multimodal (text + image input) | MLA attention architecture | Released March 16, 2026

The catch with MoE is the one most guides gloss over. Yes, only 4 experts fire per token. But all 128 expert weight matrices have to be addressable in memory. The router needs to be able to reach any of them. So all 119 billion parameters need to live in VRAM. The compute savings are real. The memory savings aren't.

The VRAM Math

Here's the calculation. It's not complicated, just inconvenient.

Parameters × bytes per parameter = VRAM floor. Add 10–15% overhead for the KV cache and runtime buffers. That's it.

For Mistral Small 4 at 119B parameters:

  • BF16 / FP16 — 2 bytes per parameter → 238GB base → ~250GB with overhead
  • Q8_0 — 1 byte per parameter → 119GB base → ~130GB with overhead
  • Q6_K — 0.75 bytes → ~89GB → ~100GB with overhead
  • Q5_K_M — 0.625 bytes → ~74GB → ~83GB with overhead
  • Q4_K_M — 0.5 bytes → ~60GB → ~67GB with overhead
  • NVFP4 / MXFP4 — ~0.53 bytes → ~63GB with overhead (Mistral's official quantization format)
  • IQ3_M — 0.375 bytes → ~45GB → ~51GB with overhead
  • IQ2_M — 0.25 bytes → ~30GB → ~36GB with overhead

One thing worth knowing about MoE quantization specifically: the model tolerates aggressive quantization better than an equivalent dense model. You lose less at Q4 or IQ3 because the expert routing gives the model more ways to compensate. IQ3_M on a MoE like Small 4 is meaningfully different from IQ3_M on a dense 70B. Still not BF16. But the gap is smaller than people expect.

Hardware Comparison Chart

What each quantization level actually requires, and what that hardware costs in March 2026:

Approx Build Cost

$100K–$120K

$50K–$60K

$50K–$60K

$28K–$32K

$25K–$30K or $10K–$16K

$25K–$30K

$8K–$14K

$4K–$6K A few things to note. The NVFP4 checkpoint is Mistral's reference format — designed specifically to slot into a single H100 80GB. That's their definition of "locally deployable." An H100 80GB runs $25,000–$30,000. That's one card.

The IQ2_M numbers technically work. Unsloth has GGUF releases available and people on LocalLLaMA are running it on a 4090 with aggressive CPU RAM offloading. Whether the output quality at that compression level is useful for anything serious is a different question.

The RTX 5090 Case

The RTX 5090 launched at $1,999 MSRP. Retail in early 2026 is running $4,000–$4,200 new, with used cards around $3,500. Korean industry leaks are predicting the card hits $5,000 before year end as AI demand absorbs supply.

At 32GB GDDR7, a single RTX 5090 cannot run Mistral Small 4 at any quantization level above IQ2. Not close. 32GB is less than half of what Q4 needs.

Two RTX 5090s give you 64GB combined. That's workable for Q4_K_M — barely. You'd want to offload the KV cache to system RAM and keep context lengths short. The IQ3_M case is more comfortable on dual 5090s: 51GB needed, 64GB available, some breathing room.

The real-world all-in cost on a dual RTX 5090 AI workstation — Threadripper PRO CPU, 256GB DDR5, proper chassis, liquid cooling rated for 1,150W of combined GPU TDP — lands between $9,999 and $16,412 depending on configuration. That's where the headline number comes from. It's not a round number. It's the actual listed price on workstation configurations available right now.

Warning

Multi-GPU Consumer Builds Need Stack Verification Running Mistral Small 4 across multiple RTX 5090s requires vLLM or llama.cpp with tensor parallelism configured. The MLA attention architecture in Small 4 needs specific backend support — vLLM had confirmed bugs on SM 8.6 GPUs (RTX 3090) at launch that caused failed inference. The 5090 runs SM 9.0 (Blackwell) and fares better, but verify your exact stack before committing the capital. Mistral's reference vLLM command uses --tensor-parallel-size 2, tuned for two H100s.

The Enterprise Path

If you want clean BF16 inference with no quality compromise, the math is straightforward and brutal. Four H100 80GB cards handle it comfortably. Two covers Q8 with headroom. NVIDIA's own DGX Spark (GB10, 128GB unified memory) handles Q4 through IQ3 with room to spare — there are already threads on the NVIDIA developer forums confirming Mistral Small 4 at NVFP4 running on the Spark.

A practical production-grade local inference server — 2× H100 SXM with shared NVLink fabric, running Q8 at reasonable throughput — lands around $55,000–$70,000 all-in including server chassis, memory, NVMe, and networking. That's the minimum you'd consider for enterprise deployment.

The DGX B200 systems that serious enterprise AI teams actually run start at $300,000+. Mistral Small 4 is not the model you'd deploy on a B200 — you'd save that for training runs and use Small 4 for inference. But the hardware context matters when you're evaluating whether "free model" means anything for your budget.

Tip

The API Break-Even Point Mistral's API pricing for Small 4 makes self-hosting a losing bet below a certain volume threshold. If you're processing fewer than roughly 400 million tokens per month, cloud inference almost certainly beats a self-hosted setup on pure cost. The math inverts when you have high-volume production workloads, strict data residency requirements, or latency sensitivity that cloud endpoints can't satisfy. Run the numbers for your specific workload before buying hardware.

Who Should Actually Self-Host This

There are three realistic buyer profiles.

Teams with genuine privacy constraints — healthcare, legal, financial services, anyone where data leaving the building is a regulatory problem. Apache 2.0 means full on-premises deployment with no phone-home, no API logs, no third-party data handling. For those organizations, the hardware cost is the cost of compliance. They were already spending money on solutions that did less.

High-volume production workloads. At several billion tokens per day, a $55K inference server amortizes in 6–12 months against API spend. The threshold moves depending on your token costs, but the crossover is real and calculable.

CraftRigs builders who want the most capable open-weight model on consumer hardware. A dual RTX 5090 rig at Q4_K_M is running something that legitimately competes with GPT-OSS and Qwen 3.5 for coding and reasoning tasks — not as fast as H100-class inference, not at full quality, but yours, with no token bill and no rate limits. For how a similar high-VRAM consumer setup compares against dedicated AI hardware, see the Tenstorrent QuietBox 2 vs. dual RTX 5090 comparison.

What doesn't make sense is the IQ2_M-on-a-single-4090 crowd. Technically possible. Quality-wise, you're running a compromised version of a model that was designed for 63GB of clean VRAM. The output isn't nothing, but it's not Mistral Small 4 either.

The Bottom Line

"Small" is branding. 119 billion parameters is not small by any hardware measure that matters.

The Apache 2.0 license is genuinely valuable — it's the right call from Mistral, and it opens up deployment scenarios that closed-weight models block entirely. But "free model" and "free to run" are different things, and the gap between them is somewhere between $8,000 and $30,000 depending on how much quality you're willing to trade.

For most developers: the API makes more sense until your token volume forces the conversation. For builders who want maximum local capability, a dual RTX 5090 workstation gets you into Q4 territory around $10K. Anything above that and you're shopping H100s.

The model is free. The inference isn't.

mistral-small-4 moe vram local-llm rtx-5090 h100 hardware-cost apache-2 2026

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.