2026-06-17

What is speculative decoding?

Speculative decoding is a text-generation trick that lets a fast “draft” model propose several next tokens and then uses a larger “target” model to verify them, often reducing end-to-end latency without changing the final output distribution.

Why it matters

Autoregressive LLMs normally generate one token at a time, and each token requires a full forward pass through the big model. That is accurate, but it can be slow.

Speculative decoding matters when you want lower latency or higher throughput without switching to a smaller, lower-quality model. In practice, it is most useful when:

the target model is expensive to run,
you can afford an extra smaller model alongside it,
and your workload is generation-heavy, such as chat, autocomplete, or code synthesis.

It is not a model change so much as a decoding strategy: you keep the same target model, but you change how tokens are proposed and checked.

How it works

The core idea is simple:

A small draft model generates a short candidate continuation, for example a few tokens ahead.
The large target model evaluates those candidates in parallel.
Tokens that the target model would have produced are accepted.
When a candidate token disagrees, the target model samples the correct next token itself and the process repeats.

This works because the target model can often verify multiple tokens in one pass more cheaply than generating them one-by-one. If the draft model is reasonably aligned with the target model, many proposed tokens get accepted, so you save time.

The important guarantee from the original method is that, when implemented correctly, the final token distribution matches the target model’s distribution. In other words, it speeds up decoding without turning the result into “whatever the small model guessed.”

Tiny concrete example

Suppose the target model would likely continue:

“The capital of France is Paris.”

A draft model proposes:

“The capital of France is Paris and”

The target model checks those proposed tokens. If it agrees through “Paris” but not “and,” it accepts the matching prefix and then samples the next token itself, continuing generation from there.

So the user sees the same kind of output they would have gotten from the target model, but some of the work was done by a cheaper proposer.

Common pitfalls / when NOT to use it

If the draft model is too weak, acceptance rates drop. Then you pay overhead for little benefit.
It adds system complexity. You need two models or two decoding paths, plus orchestration logic.
It is not a universal speedup. For very small models or very short outputs, the extra verification can outweigh the gains.
It does not fix memory bottlenecks by itself. If your main problem is KV-cache size or batch fragmentation, speculative decoding is not the first lever to pull.
It is sometimes confused with beam search or distillation. Those are different: beam search changes the search strategy; speculative decoding changes how you propose and verify tokens; distillation changes the model itself.

In practice, teams usually try it when they already have a strong serving setup and want more tokens per second without sacrificing output quality.