CraftRigs
articles

MiniMax M2.5 Local: The 230B Model That Demands Multi-GPU

By Charlotte Stewart 5 min read
MiniMax M2.5 Local: The 230B Model That Demands Multi-GPU

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.

MiniMax M2.5 dropped last week with three numbers worth noting: 230B total parameters, 10B active parameters (mixture-of-experts), and 80.2% on SWE-Bench Verified. For context, that SWE-Bench score puts it above GPT-4o and well above most open-weight models released to date.

The license is MIT. The weights are publicly available. And the full GGUF file weighs 101GB.

This is not a "run it on your gaming PC" situation. M2.5 is an MoE model, which means only 10B parameters activate per forward pass — but all 230B parameters need to be in memory because you don't know in advance which experts will fire. That 101GB isn't a worst-case estimate. It's the model.

Here's what a practical M2.5 build actually looks like.


Quick Summary

  • MiniMax M2.5 at 230B/10B active requires ~120GB VRAM minimum — single GPU cannot run it
  • Practical configs: 5x RTX 3090, 4x RTX 3090 + CPU offload, or 3x NVIDIA A6000 48GB
  • The MoE architecture means faster inference than a dense 70B when you have the hardware, but the hardware bar is steep

Why MoE Doesn't Help With VRAM

Mixture-of-experts models like M2.5 are often described as "runs like a 10B model" because only 10B parameters activate per token. That's true for compute — the FLOPS per forward pass are comparable to a 10B dense model. But for memory loading, all expert weights need to be resident.

Think of it like a reference library. Even if you only read one page per query, the whole library has to be on the shelf. The 230B parameters are the shelves. The 10B active is how many pages you read at once.

VRAM requirement formula for M2.5:

  • Full precision (FP16): 230B × 2 bytes = 460GB — not feasible on consumer hardware
  • Q8 quantization: ~230GB — still requires 10+ high-VRAM GPUs
  • Q4 quantization: ~115–120GB — achievable with the right multi-GPU config

The Q4 GGUF on Hugging Face is 101GB. Add approximately 15–20GB for KV cache and runtime overhead, and you're at ~120GB working VRAM requirement.


GPU Configurations That Actually Work

Config 1: 5x RTX 3090 (~$2,500–$3,000)

Total VRAM: 120GB (5 × 24GB)

  • This is the budget floor for local M2.5 inference
  • Requires a workstation board with 5 PCIe slots (ASUS Pro WS X570-ACE, Gigabyte TRX50 series)
  • PCIe x4 slots are fine for inference — you're bandwidth-limited by inter-GPU communication anyway
  • Expected throughput: 8–15 tokens/second (PCIe bottleneck on multi-GPU inference)

Warning

5 RTX 3090s in one system draw 1,750W under load. You need a 2,000W+ PSU, proper case airflow, and ideally 30A circuit capacity. This is a workstation build, not a gaming rig modification.

Config 2: 4x RTX 3090 + CPU Offload (~$2,000–$2,400)

Total VRAM: 96GB + RAM offload for the remaining ~24GB

  • Requires 256GB+ system RAM to offload the excess layers without destroying throughput
  • Offloaded layers run at ~5–10x lower throughput than GPU layers
  • Tokens-per-second will be lower than the 5-GPU config, but the hardware cost is lower
  • Best for occasional use rather than heavy workloads

Config 3: 3x NVIDIA A6000 48GB (~$6,000–$9,000 used)

Total VRAM: 144GB (3 × 48GB)

  • More VRAM headroom means you can run Q8 on 144GB and stay entirely in GPU memory
  • Professional workstation cards — low power per VRAM GB, excellent driver stability
  • NVLink bridges only work in pairs, so the third card still talks over PCIe
  • Best sustained throughput of the three configs

Config 4: 2x RTX 5090 (~$4,000)

Total VRAM: 64GB (2 × 32GB) — not enough at Q4

This is a trap. Two RTX 5090s give you 64GB, which sounds like a lot until you remember M2.5 needs 120GB. You'd need to offload 56GB to system RAM. The 5090s are faster per GPU but the offloading penalty makes this worse than the 5x 3090 config on throughput.


NVLink bridges two consumer cards (RTX 30/40/50 series) and gives approximately 2x the bandwidth of PCIe x16 for GPU-to-GPU communication. For a 2-GPU setup running a model that fits in 2-GPU VRAM, NVLink meaningfully improves throughput on attention operations that require cross-GPU tensor sync.

For M2.5, which needs 5 GPUs at 24GB each, NVLink helps one pair at most. Cards 3, 4, and 5 still communicate over PCIe. The performance gain from NVLink in a 5-GPU M2.5 setup is marginal — the bottleneck moves to the non-NVLink connections.

If you're specifically building for MoE models at this scale, PCIe bandwidth optimization matters more than NVLink:

  • Use x16 slots for as many cards as possible
  • Use PCIe 4.0 or 5.0 (not 3.0) for inter-GPU bandwidth
  • A Threadripper Pro platform gives more PCIe lanes than a consumer AM5 or LGA1700 board

Is M2.5 Actually Worth the Hardware Investment?

The honest answer depends on your use case.

The case for M2.5: If you specifically need coding agent performance that competes with GPT-4o at inference costs of $0, M2.5 is genuinely impressive. The 80.2% SWE-Bench score is real. The MIT license means you can deploy it commercially without per-token costs.

The case against: A single RTX 4090 running Qwen2.5-Coder 32B at Q4 hits ~45% on SWE-Bench. A dual RTX 3090 setup running DeepSeek-Coder V2 Lite hits ~50%. You get to 80% by going from a $1,600 setup to a $2,500–$4,000+ setup. The incremental gain is real but the cost multiplier is steep.

For most builders, M2.5 makes sense as a future investment — the multi-GPU platform you build today will also run whatever 200B+ models come out in 2027. If you're sizing a workstation platform for long-term local AI infrastructure, the Threadripper-based 4–6 GPU config is a reasonable foundation even if M2.5 isn't your immediate target.


FAQ

What is the minimum hardware to run MiniMax M2.5 locally? At Q4 quantization, M2.5 requires approximately 120GB of VRAM. That means a minimum of 5x RTX 3090 (24GB each = 120GB) or 3x RTX A6000 48GB cards. Single-GPU is not viable — this is a multi-GPU-only model at any reasonable quantization level.

Does MiniMax M2.5 support NVLink for multi-GPU inference? M2.5 can use NVLink for 2-GPU setups with llama.cpp's tensor parallelism. However, NVLink only connects two cards. For 3+ GPU configurations, you're relying on PCIe bandwidth, which is slower. This affects tokens-per-second but doesn't prevent inference.

Is MiniMax M2.5 better than running smaller models on single GPU? M2.5 beats most 70B models on coding and reasoning benchmarks. But the infrastructure cost is significant. For most use cases, a well-tuned 70B model on a single RTX 4090 or 3090 will outperform M2.5 on a constrained multi-GPU PCIe setup due to bandwidth limitations on inter-GPU communication.

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.