MoE (Mixture of Experts)
Architecture where a model has many specialized sub-networks (experts) but activates only a subset per token — enabling larger models with lower compute costs.
Mixture of Experts (MoE) is a model architecture where a transformer has multiple parallel feed-forward networks (the "experts") at each layer, but routes each token through only a small subset of them — typically 2 out of 8, 16, or more total experts. This means the model can have many more total parameters than a dense model while using roughly the same compute per token.
Why MoE Matters for Local AI
MoE models create a counterintuitive hardware situation: a model with 141 billion total parameters (like Mixtral 8x22B) might only activate 39 billion parameters per token. The compute requirement per token is similar to a 39B dense model, but quality is higher because the full 141B parameter space is available.
The catch: all the parameters still need to fit in VRAM. Mixtral 8x22B at Q4 requires ~80GB of VRAM, or significant RAM offloading. MoE efficiency doesn't reduce memory requirements — it reduces compute, not storage.
Common MoE Models for Local Use
- Mixtral 8x7B — 46.7B total, 12.9B active per token. Q4 fits in ~30GB VRAM. The first widely popular local MoE model.
- Mixtral 8x22B — 141B total, 39B active per token. Requires 48GB+ VRAM at Q4.
- DeepSeek models — DeepSeek uses aggressive MoE architectures with many more fine-grained experts per layer.
MoE Inference Characteristics
MoE inference is less smooth than dense model inference. The expert routing step introduces branching — different tokens take different paths through the model. This makes batching less efficient and can create load imbalance where some experts are called far more frequently than others (a problem called expert collapse).
For single-user local inference (batch size 1), this matters less. For serving multiple concurrent users, MoE models can have worse throughput than dense models of equivalent quality.
Memory Bandwidth Implications
Because not all experts are active simultaneously, some inference frameworks can partially offload MoE experts to system RAM and load them on demand. This allows running Mixtral 8x7B on a 16GB GPU with 32GB of fast system RAM, with acceptable (if reduced) speeds compared to full VRAM loading.