PaPoo
cover

What is a mixture-of-experts (MoE) model?

A mixture-of-experts (MoE) model is a neural network that routes each input, or part of an input, to only a small subset of specialized submodels (“experts”) instead of using the whole model every time.

Why it matters

MoE models are useful when you want much larger capacity without paying the full compute cost on every token or example. In practice, that means you can scale model size to improve quality while keeping inference and training more efficient than a dense model of similar total parameter count.

You’d reach for MoE when:

How it works

An MoE model typically has:

For each token, the router scores the experts and selects a small number of them, often just 1 or 2. Only those chosen experts run, and their outputs are combined to produce the final result. The rest of the experts stay idle for that token.

This is different from a dense model, where every layer is used for every token. The key idea is conditional computation: activate only the parts of the network that seem most relevant.

Training usually includes extra balancing or regularization so the router does not send everything to a single expert. Otherwise, a few experts can become overloaded while others are never used.

Tiny concrete example

Suppose a text model sees these two tokens:

A router might learn that:

So for “def”, only a code expert runs; for “invoice”, only a document expert runs. The model still has many experts overall, but each token only pays for a small part of the network.

Common pitfalls / when NOT to use it

In short: use MoE when you need large capacity with selective compute, not when you just want a simpler baseline.

Related terms

Related terms

同じ著者の記事