PaPoo
cover

What is speculative decoding?

Speculative decoding is a text-generation trick that lets a fast “draft” model propose several next tokens and then uses a larger “target” model to verify them, often reducing end-to-end latency without changing the final output distribution.

Why it matters

Autoregressive LLMs normally generate one token at a time, and each token requires a full forward pass through the big model. That is accurate, but it can be slow.

Speculative decoding matters when you want lower latency or higher throughput without switching to a smaller, lower-quality model. In practice, it is most useful when:

It is not a model change so much as a decoding strategy: you keep the same target model, but you change how tokens are proposed and checked.

How it works

The core idea is simple:

  1. A small draft model generates a short candidate continuation, for example a few tokens ahead.
  2. The large target model evaluates those candidates in parallel.
  3. Tokens that the target model would have produced are accepted.
  4. When a candidate token disagrees, the target model samples the correct next token itself and the process repeats.

This works because the target model can often verify multiple tokens in one pass more cheaply than generating them one-by-one. If the draft model is reasonably aligned with the target model, many proposed tokens get accepted, so you save time.

The important guarantee from the original method is that, when implemented correctly, the final token distribution matches the target model’s distribution. In other words, it speeds up decoding without turning the result into “whatever the small model guessed.”

Tiny concrete example

Suppose the target model would likely continue:

“The capital of France is Paris.”

A draft model proposes:

“The capital of France is Paris and”

The target model checks those proposed tokens. If it agrees through “Paris” but not “and,” it accepts the matching prefix and then samples the next token itself, continuing generation from there.

So the user sees the same kind of output they would have gotten from the target model, but some of the work was done by a cheaper proposer.

Common pitfalls / when NOT to use it

In practice, teams usually try it when they already have a strong serving setup and want more tokens per second without sacrificing output quality.

Related terms

同じ著者の記事