CraftRigs
articles

Mistral Small 4 Hype vs. Reality: It's Ranked #54, Not #1

By Charlotte Stewart 5 min read
Mistral Small 4 Hype vs. Reality: It's Ranked #54, Not #1

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

Mistral's press release for Small 4 used phrases like "state-of-the-art at its size class" and highlighted comparisons against models it beats. That's how product announcements work. But when BenchLM.ai — an independent benchmarking service — ran Mistral Small 4 through their evaluation suite, it placed 54th out of 70 models tested.

That gap between marketing positioning and independent ranking is worth understanding. Mistral Small 4 has real strengths. It also has real weaknesses that the official benchmarks are carefully constructed to avoid. If you're deciding whether to build hardware to run it locally, you should know both.


Quick Summary

  • Mistral Small 4 ranks #54/70 on BenchLM.ai's independent leaderboard — well below Mistral's "best in class" framing
  • Genuine strengths: document understanding (beats GPT-4.1 in that category), multilingual capability, long-context handling
  • Hardware reality: 119B MoE model needs multi-GPU setup — single 24GB card isn't sufficient for real performance

What "Best Small Model" Actually Means

Mistral defines "small" relative to their own lineup. Compared to Mistral Large (123B dense), Mistral Small 4 at 119B MoE with 6B active is efficient — it runs faster and cheaper at inference time because the active parameter count is low.

But the model benchmarking community doesn't use Mistral's definitions. They compare against everything: Qwen2.5 series, Llama 3.3, Gemma 3, DeepSeek R1, Phi-4, Gemini 2 Flash. In that broader field, Mistral Small 4's positioning is more complicated.

BenchLM.ai evaluates across six categories:

  1. Reasoning and logic
  2. Mathematics
  3. Code generation
  4. Document understanding
  5. Instruction following
  6. Multilingual capability

Mistral Small 4's performance across these categories is not uniform, which is exactly what gets obscured when Mistral cherry-picks comparison benchmarks.


Where Mistral Small 4 Actually Excels

Document Understanding

This is the genuine standout. Mistral Small 4 placed ahead of GPT-4.1 on document understanding tasks in independent evaluation — a real result on a real benchmark. Document understanding covers tasks like:

  • Extracting structured information from PDFs and scanned documents
  • Answering questions about long-form documents (contracts, reports, research papers)
  • Summarizing across multi-document corpora
  • Table extraction and interpretation

The strength here likely comes from Mistral's training data focus and the model's long native context window. If you're building local pipelines that process documents — invoices, research papers, legal contracts — Mistral Small 4 is genuinely worth evaluating.

Multilingual Capability

Mistral has historically been stronger than most American AI labs on non-English languages. Mistral Small 4 continues this. It performs well on French, German, Spanish, Italian, Portuguese, and several other European languages. The gap narrows on Asian languages but still outperforms similarly-sized models.

For any application serving non-English speakers, this is a real differentiator.

Long Context

The native 128K context window is functional — not just listed as a spec but usable without significant performance degradation at long context lengths. This is rarer than it sounds. Several models advertise long context but struggle to maintain coherence and retrieval accuracy past 32K tokens. Mistral Small 4 holds up better than average.


Where It Falls Short

Complex Reasoning

BenchLM.ai's reasoning evaluation is where Mistral Small 4 loses ground. Models like DeepSeek R1 32B and Qwen2.5 72B outperform it on multi-step logical reasoning tasks. The gap is meaningful — approximately 8–12 percentage points on their logic evaluation suite.

This matters for local AI use cases like:

  • Agent-based workflows requiring multi-step planning
  • Mathematical problem solving
  • Code debugging requiring chain-of-thought reasoning

Code Generation

Mistral has never been the coding leader, and Small 4 doesn't change that. Qwen2.5-Coder models, DeepSeek-Coder V2, and even older Codestral beat it consistently on code generation benchmarks. If coding is your primary use case, Mistral Small 4 is not the model you should be sizing hardware for.

The Parameter Efficiency Problem

119B total parameters, 6B active is a larger-than-usual expert count split. For comparison, Mixtral 8x22B has 22B active out of 141B total — nearly 4x more active parameters per forward pass. More active parameters generally means more task performance per token.

Mistral Small 4's 6B active is lean, which helps inference speed but limits per-query capability on complex tasks. You're essentially getting 6B-dense quality on many tasks, inside a larger model that needs much more VRAM to load.


The Hardware Math for Running It Properly

At 119B parameters, Q4 GGUF is approximately 65–70GB. Here's what can and can't handle that:

Notes

Heavy quality penalty

Not recommended

Needs PCIe bandwidth management

Better throughput

Single-card ideal

More headroom The minimum viable config for running Mistral Small 4 at a quality level that justifies the hardware is two RTX 3090s at ~$1,000 used. Below that, you're running at quantizations that erase most of the model's quality advantages.


What to Do With This Information

If you're building hardware specifically to run the best available open model, Mistral Small 4 is not the answer unless your workload is document understanding or multilingual. For those use cases, it's genuinely excellent.

For general-purpose use: Qwen2.5 72B Q4 on a single RTX 4090 (~$1,600) outperforms Mistral Small 4 on most tasks, runs on one GPU, and fits in 43GB at Q4. The hardware complexity is lower and the benchmark performance is higher.

Mistral Small 4 is a good model that's been marketed as a great model. Know the difference before you spec a multi-GPU rig around it.


FAQ

What does Mistral Small 4 actually rank on independent benchmarks? On BenchLM.ai's independent leaderboard, Mistral Small 4 ranks #54 out of 70 models. It performs well on document understanding and multilingual tasks but falls behind on complex reasoning and code generation compared to models like Qwen2.5 72B and DeepSeek R1.

What hardware do you need to run Mistral Small 4? Mistral Small 4 is a 119B total/6B active MoE model. The Q4 GGUF is approximately 65–70GB. You need a multi-GPU setup to run it without heavy CPU offloading — minimum two RTX 3090s (48GB total) or equivalent. Single 24GB GPU can run it with aggressive quantization but quality suffers.

Is Mistral Small 4 worth running locally vs smaller alternatives? For document understanding tasks, yes — Mistral Small 4's category win over GPT-4.1 on that benchmark is real. For general reasoning and coding, a well-quantized Qwen2.5 32B or DeepSeek R1 32B on a single RTX 3090 will likely outperform M Small 4 at a fraction of the hardware cost.

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.