PaPoo
cover

What is top-p / nucleus sampling?

Top-p sampling, also called nucleus sampling, is a way to make a language model pick the next token from only the smallest set of likely options whose combined probability reaches a chosen threshold.

Why it matters

If you always choose the single most likely next token, model outputs can become repetitive and brittle. If you sample from all tokens, outputs can get noisy and incoherent.

Top-p gives you a middle ground: it keeps the model flexible, but trims away the long tail of very unlikely tokens before sampling. In practice, teams use it to control creativity and randomness without hard-coding a fixed number of candidates.

How it works

  1. The model assigns a probability to every possible next token.
  2. You sort tokens from most likely to least likely.
  3. You keep adding tokens until their cumulative probability reaches p — for example, 0.9.
  4. You discard the rest, renormalize the remaining probabilities, and sample one token from that smaller set.

The key idea is that the set size is dynamic. On a predictable prompt, the nucleus may be small. On a more open-ended prompt, it may be larger. That is why top-p is often preferred over fixed-k sampling when you want the model to adapt to context.

Top-p is usually paired with temperature. Temperature changes how peaked or flat the distribution is; top-p changes which tokens are even eligible to be sampled.

Tiny concrete example

Suppose the next-token probabilities are:

If top_p = 0.80, you keep A + B + C = 0.80 and drop D and E.

Then the model samples from just A, B, and C, after renormalizing their probabilities.

Common pitfalls / when NOT to use it

Related terms

Related terms

同じ著者の記事