PaPoo
cover

What is distillation?

Distillation is a way to train a smaller model to imitate a larger, better one, so you keep much of the quality while reducing cost, latency, or memory use.

Why it matters

Distillation solves a very practical problem: the model you want may be too slow, too expensive, or too large to deploy everywhere.

You reach for distillation when you want one or more of these:

In practice, teams often use a large “teacher” model to generate guidance, then train a “student” model to match it.

How it works

The basic idea comes from knowledge distillation, first formalized in classic model-compression work: instead of training only on hard labels, the student learns from the teacher’s outputs too.

A common setup is:

  1. Run the teacher model on training inputs.
  2. Capture its outputs — often probabilities over classes, logits, or generated text.
  3. Train the student to imitate those outputs, sometimes mixed with the original supervised target.

For large language models, the same pattern is used in a broader sense: the student learns to reproduce the teacher’s behavior on examples, prompts, or preference data. The exact recipe varies by domain, and “distillation” can mean anything from matching soft probabilities to copying full generations.

The key intuition is that the teacher’s output carries more information than a single hard label. For example, if the teacher assigns some probability to near-miss answers, the student can learn a smoother decision boundary.

Tiny concrete example

Suppose a teacher model predicts:

Instead of training the student only with the hard label “cat,” you train it to match that full distribution. Over many examples, the student learns a compressed version of the teacher’s behavior.

A simple workflow:

Common pitfalls / when NOT to use it

Related terms

Related terms

同じ著者の記事