Top-p sampling, also called nucleus sampling, is a way to make a language model pick the next token from only the smallest set of likely options whose combined probability reaches a chosen threshold.
If you always choose the single most likely next token, model outputs can become repetitive and brittle. If you sample from all tokens, outputs can get noisy and incoherent.
Top-p gives you a middle ground: it keeps the model flexible, but trims away the long tail of very unlikely tokens before sampling. In practice, teams use it to control creativity and randomness without hard-coding a fixed number of candidates.
p — for example, 0.9.The key idea is that the set size is dynamic. On a predictable prompt, the nucleus may be small. On a more open-ended prompt, it may be larger. That is why top-p is often preferred over fixed-k sampling when you want the model to adapt to context.
Top-p is usually paired with temperature. Temperature changes how peaked or flat the distribution is; top-p changes which tokens are even eligible to be sampled.
Suppose the next-token probabilities are:
A: 0.40B: 0.25C: 0.15D: 0.10E: 0.10If top_p = 0.80, you keep A + B + C = 0.80 and drop D and E.
Then the model samples from just A, B, and C, after renormalizing their probabilities.
p as a quality score. A higher p is not automatically “better”; it just allows a larger candidate set.