CraftRigs
Architecture Guide

How to Run MiniMax-M1 Locally: The Honest Hardware Requirements (March 2026)

By Charlotte Stewart 9 min read

Some links on this page may be affiliate links. We disclose it because you deserve to know, not because it changes anything. Every recommendation here comes from benchmarks, not budgets.


# How to Run MiniMax-M1 Locally: The Honest Hardware Requirements (March 2026)

**MiniMax-M1 456B is getting a lot of attention in r/LocalLLaMA right now, and half the posts are wrong about what it takes to run it. The short version: you cannot run MiniMax-M1 on consumer hardware in March 2026. Not dual RTX 3090s. Not an RTX 4090. Not even a Mac Studio M4 Ultra with 192 GB of unified memory. The model requires a minimum of 640 GB VRAM, no [GGUF](/glossary/gguf) format exists yet, and Ollama doesn't support local deployment.**

**If you need to use MiniMax-M1 today, your path is renting GPU clusters on Vast.ai or Lambda Labs — not a local build. This guide explains why, what tools actually work, and what the hardware picture looks like once GGUF support arrives.**

---

## What MiniMax-M1 Actually Is

MiniMax-M1 is a 456B-parameter [MoE (Mixture of Experts)](/glossary/moe-mixture-of-experts) model released in June 2025. The MoE architecture means only ~45.9B parameters activate per inference pass — which sounds like good news for [VRAM](/glossary/vram) requirements, and it would be, if the model were small. It isn't.

The full BF16 model weighs ~912 GB across 413 shards on Hugging Face (verified via the official MiniMaxAI/MiniMax-M1-80k repo). Even activating only 10% of the parameters per token doesn't help you when you still have to load all 456B weights into memory — the routing logic needs to know what's available before it decides what to activate.

> [!NOTE]
> MoE models are often marketed as "efficient" compared to dense models of the same parameter count. That's true for compute and cost-per-token at cloud scale. For local inference, where your bottleneck is loading weights into VRAM, you still pay the full memory price.

The model's context window goes up to 1M tokens (80k variant) with strong performance on long-context benchmarks. That's legitimately impressive. But it doesn't change the hardware math.

### Why the VRAM Math Doesn't Work for Consumer Hardware

Dense 70B models at Q4 quantization need roughly 40 GB VRAM — within reach of dual-GPU consumer builds. The 456B scale of MiniMax-M1 changes things fundamentally.

The only non-BF16 quantization currently available is a community W4A16 INT4 safetensors conversion from user `justinjja` on Hugging Face, spread across 54 shards totaling ~266 GB. That's still more than five times what two RTX 3090s offer combined.

Standard [quantization](/glossary/quantization) tools can't convert MiniMax-M1 to GGUF format yet. When community members tried using `convert_hf_to_gguf.py` from llama.cpp, it failed because the model uses custom architecture code that the converter doesn't support. The MiniMax team acknowledged this in GitHub issue #19:

> "We're currently working on enabling support for the GGUF format in llama.cpp for the MiniMax-M1 model. Once the support is complete, we plan to publish a detailed guide."

No timeline was given. As of March 2026, no GGUF file exists — not from MiniMax, not from bartowski, not from any major community quantizer.

---

## What Hardware Actually Works Today

MiniMax's official deployment documentation is direct about requirements. From their vLLM deployment guide (Tier 1 source, verified March 2026):

- **Production:** 8x H800 GPUs (80 GB each = 640 GB VRAM total), tensor parallelism size 8
- **Long-context:** 8x H20 GPUs (96 GB each = 768 GB VRAM total)

The launch command from official docs requires `--tensor-parallel-size 8`. There is no official single-node configuration for smaller GPU counts.

A MiniMax team member responding to GitHub issue #12 stated the minimum recommendation is "8-card setup with 96 GB VRAM per card" — that's an H20 cluster, not consumer hardware.

**To be explicit:** An 8x H100 80GB setup costs approximately $200,000–$250,000 to purchase. An 8x H20 NVL cluster is comparable in price. Nobody is buying this for a home lab.

> [!WARNING]
> If you've read posts claiming dual RTX 3090s or a Mac Studio can run MiniMax-M1 at useful speeds — those numbers were invented. No public benchmark exists for MiniMax-M1 on any consumer GPU configuration because the model cannot physically fit in consumer VRAM, with or without quantization.

---

## What Doesn't Work (And Why People Think It Does)

### Ollama

Ollama lists MiniMax models in its library, which creates confusion. But the available entries are cloud API connectors for MiniMax-M2.1 and M2.5 — not downloadable model files for MiniMax-M1. You can't pull a MiniMax-M1 model file locally via Ollama. GitHub issue #11116 in the Ollama repo confirms local M1 support was requested and never shipped.

### llama.cpp / LM Studio / Jan

All of these tools depend on GGUF format. No GGUF exists. Until MiniMax ships llama.cpp support or a community quantizer cracks the conversion, none of these options work.

### The "CPU offload" workaround

You'll find suggestions to use system RAM as overflow with tools like llama.cpp's `--n-gpu-layers` partial offload. Two problems: first, there's no GGUF file to load. Second, even if there were, CPU inference on a 456B model at usable speeds would require Threadripper-scale hardware with 768 GB+ of RAM, and you'd be lucky to see 0.5–1 tok/s — unusable for any practical task. Check the see-for-yourself posts on r/LocalLLaMA about CPU offload at 400B+ scale before investing in that path.

---

## What Actually Works: vLLM on Server Hardware

If you have access to a multi-GPU server (or are willing to rent one), the setup is straightforward. vLLM 0.9.2 or later is required — versions before 0.9.2 cause precision errors in the LM head and KV cache that degrade output quality. The fix landed in PR #19592, merged June 13, 2025, and was included in the official v0.9.2 release.

**Verified setup command (from official MiniMax documentation):**

```bash
export SAFETENSORS_FAST_GPU=1
export VLLM_USE_V1=0
python3 -m vllm.entrypoints.openai.api_server \
  --model <model_path> \
  --tensor-parallel-size 8 \
  --trust-remote-code \
  --quantization experts_int8 \
  --max_model_len 4096 \
  --dtype bfloat16

The --quantization experts_int8 flag applies INT8 quantization to the expert layers, reducing memory pressure while preserving quality on the attention layers. This is the recommended production configuration per MiniMax.

For people without server hardware, renting is the practical path. Vast.ai and Lambda Labs both offer 8x H100 or H200 cluster rentals. At current Lambda pricing (verified March 2026), 8x H100 runs approximately $14–18/hour. For occasional deep research tasks or long-context document processing, that's a reasonable hourly spend — and significantly cheaper than the $200K+ hardware cost.


The Future Local Path: What Happens When GGUF Arrives

Once llama.cpp support lands and a proper GGUF quantization exists, what hardware could realistically run MiniMax-M1?

Here's the honest sizing math, working backward from what quantization levels would require:

Feasible Hardware

12x H100 cluster

4x H100 80GB

3x H100 80GB

2x H100 80GB

AMD Ryzen AI Max+ 395 (128 GB) Estimates based on standard quantization size ratios. Unverified — GGUF files do not exist as of March 2026.

The AMD Ryzen AI Max+ 395 — the top Strix Halo chip, released Q1 2025 — has 128 GB of unified memory shared between CPU and GPU (Radeon 8060S with 40 compute units). Memory bandwidth is 256 GB/s, which is respectable for inference. If an aggressive Q2-level quantization ever brings MiniMax-M1 under 120 GB, this chip could theoretically handle it.

But Q2 quantization on a 456B MoE model is a significant quality tradeoff. For the kind of long-context reasoning and multi-turn work that MiniMax-M1 is actually good at, Q2 degradation may make the whole exercise pointless. We'll need real testing once GGUF support exists. For more on the AMD Strix Halo platform and how its unified memory compares to discrete VRAM for LLM inference, see our Apple Silicon vs NVIDIA comparison guide.

Tip

If you're building a rig now for running the largest possible models locally, the AMD Ryzen AI Max+ 395 at 128 GB unified memory is the most future-proof consumer option available — even if MiniMax-M1 is out of reach today. It handles 70B models comfortably at full quality. See our hardware upgrade ladder guide for where it fits in the broader progression.


Real Performance Data: What We Know

The table below distinguishes verified from estimated. No consumer GPU benchmarks exist for MiniMax-M1 because the model cannot run on consumer hardware. Claims you'll see elsewhere presenting tok/s numbers on dual-3090 setups are not real.

Status

No public benchmark

No public benchmark

Hardware insufficient

No GGUF exists Last verified: March 29, 2026. Once MiniMax publishes throughput data or GGUF support lands, this table will be updated.


Should You Wait or Move On?

MiniMax-M1's real edge is 1M-token context and strong performance on long-document tasks — not raw chat speed. If that use case matters to you, the cloud rental path makes economic sense right now. A $15/hour H100 cluster for an occasional 20-minute session costs $5 — cheaper than a 4090 upgrade that still wouldn't run the model.

If you're drawn to MiniMax-M1 for general local inference and multi-turn chat, you're probably better served by Llama 3.3 70B or Qwen 2.5 72B on hardware you already own. See our dual-GPU local LLM stack guide for the 70B sweet spot setup — that's where the dual-3090 build actually earns its price tag.

The GGUF situation is the real blocker. Watch the MiniMax-AI/MiniMax-M1 GitHub repo. When issue #19 closes with merged llama.cpp support, that's the signal to start planning a local deployment. Until then, the model lives in data center territory.


FAQ

Can you run MiniMax-M1 on a dual RTX 3090 setup?

No. Two RTX 3090s provide 48 GB VRAM combined. The minimum verified requirement per official MiniMax documentation is 8x H800 GPUs totaling 640 GB VRAM. The only available non-BF16 quantization (community INT4, ~266 GB) is still roughly 5.5× over dual-3090 capacity. There is no quantization path that makes this feasible on current consumer hardware.

Does Ollama support MiniMax-M1 for local inference?

No. Ollama's MiniMax entries are cloud API connectors for M2.1 and M2.5, not downloadable files for local M1 inference. GitHub issue #11116 in the Ollama repo confirms local M1 support was requested and not delivered as of March 2026.

Is there a GGUF version of MiniMax-M1?

Not yet. llama.cpp's conversion script fails on MiniMax-M1's custom architecture. The MiniMax team confirmed they're working on llama.cpp support (GitHub issue #19) with no published timeline. The only alternative quantization is a community W4A16 INT4 in safetensors format (~266 GB) — not GGUF-compatible and still server-scale.

What is the minimum VRAM to run MiniMax-M1 locally?

640 GB, based on official MiniMax documentation recommending 8x H800 (80 GB each). A MiniMax team member in GitHub issue #12 stated the minimum is 8 cards at 96 GB each (768 GB total). There is no confirmed path below 640 GB VRAM as of March 2026.

When will MiniMax-M1 be runnable on consumer hardware?

Unknown. Once GGUF support arrives, aggressive Q2 quantization might theoretically bring it under 128 GB — within range of AMD Ryzen AI Max+ 395 systems. But Q2 quality on a task-sensitive model is a steep tradeoff, and no timeline exists for GGUF support. Best estimate: watch the MiniMax GitHub repo for issue #19 to close.


Verdict

MiniMax-M1 is a legitimate model worth following — the 1M-token context window and MoE efficiency at cloud scale are real advantages. But the local inference story in March 2026 is simply: it's not ready for consumer hardware, the tooling doesn't exist, and anyone telling you otherwise is working from specs, not test results.

If you need MiniMax-M1 now: Rent 8x H100 clusters on Vast.ai or Lambda Labs. Budget ~$15/hour. Use vLLM 0.9.2+ with --tensor-parallel-size 8 and --quantization experts_int8.

If you're building a local rig for future large-model capability: AMD Ryzen AI Max+ 395 at 128 GB unified memory is the most defensible buy. It won't run MiniMax-M1 today. It might once GGUF support matures, and it runs every 70B model at full quality right now.

Don't buy dual RTX 3090s expecting to run this model. Don't wait on Ollama support. Watch the GitHub issue tracker, and check back here when the GGUF situation changes. Last verified: March 29, 2026

minimax-m1 local-llm hardware-requirements moe-models power-user

Technical Intelligence, Weekly.

Access our longitudinal study of hardware performance and architectural optimization benchmarks.