Distillation is a way to train a smaller model to imitate a larger, better one, so you keep much of the quality while reducing cost, latency, or memory use.
Distillation solves a very practical problem: the model you want may be too slow, too expensive, or too large to deploy everywhere.
You reach for distillation when you want one or more of these:
In practice, teams often use a large “teacher” model to generate guidance, then train a “student” model to match it.
The basic idea comes from knowledge distillation, first formalized in classic model-compression work: instead of training only on hard labels, the student learns from the teacher’s outputs too.
A common setup is:
For large language models, the same pattern is used in a broader sense: the student learns to reproduce the teacher’s behavior on examples, prompts, or preference data. The exact recipe varies by domain, and “distillation” can mean anything from matching soft probabilities to copying full generations.
The key intuition is that the teacher’s output carries more information than a single hard label. For example, if the teacher assigns some probability to near-miss answers, the student can learn a smoother decision boundary.
Suppose a teacher model predicts:
Instead of training the student only with the hard label “cat,” you train it to match that full distribution. Over many examples, the student learns a compressed version of the teacher’s behavior.
A simple workflow: