PaPoo
cover

What is streaming (token streaming)?

Streaming, or token streaming, is when an AI model sends you its output incrementally as it is generated instead of waiting until the whole answer is finished.

Why it matters

This solves the “dead air” problem: users see text quickly, which makes apps feel faster and more interactive. It is especially useful for chatbots, copilots, and any UI where a short time-to-first-token matters more than waiting for the full response.

In practice, teams use streaming when they want:

How it works

Large language models generate text one token at a time. A token is a chunk of text, often a word piece rather than a whole word. In streaming mode, the server exposes each newly generated token (or small group of tokens) as soon as it is available.

Typically:

  1. The client sends a generation request.
  2. The model starts decoding tokens.
  3. The server pushes partial output back over a streaming channel, such as server-sent events or chunked HTTP.
  4. The client appends those chunks to the visible answer until the stream ends.

The important point is that streaming changes delivery, not the model’s core task. The model still generates the same kind of output; you just receive it progressively instead of in one final payload.

Tiny concrete example

Non-streaming response:

{
  "text": "Paris is the capital of France."
}

Streaming response, shown piece by piece:

Paris
 is
 the capital
 of France.

A chat UI might render this as:

Paris is the capital of France.

with the text appearing gradually as the model generates it.

Common pitfalls / when NOT to use it

Related terms

Related terms

同じ著者の記事