Streaming, or token streaming, is when an AI model sends you its output incrementally as it is generated instead of waiting until the whole answer is finished.
This solves the “dead air” problem: users see text quickly, which makes apps feel faster and more interactive. It is especially useful for chatbots, copilots, and any UI where a short time-to-first-token matters more than waiting for the full response.
In practice, teams use streaming when they want:
Large language models generate text one token at a time. A token is a chunk of text, often a word piece rather than a whole word. In streaming mode, the server exposes each newly generated token (or small group of tokens) as soon as it is available.
Typically:
The important point is that streaming changes delivery, not the model’s core task. The model still generates the same kind of output; you just receive it progressively instead of in one final payload.
Non-streaming response:
{
"text": "Paris is the capital of France."
}
Streaming response, shown piece by piece:
Paris
is
the capital
of France.
A chat UI might render this as:
Paris is the capital of France.
with the text appearing gradually as the model generates it.