PaPoo
cover

What is time to first token (TTFT)?

Time to first token (TTFT) is the time between sending a generative AI request and receiving the model’s first output token.

Why it matters

TTFT is the part of latency users feel first. If it is high, the system can seem “stuck” even if the full answer eventually streams quickly.

You usually care about TTFT when you want:

In practice, teams often optimize TTFT before chasing total response time, because a fast first token makes the product feel much more interactive.

How it works

A model does not usually begin emitting text the instant your request arrives. Before the first token appears, several steps may happen:

  1. Request handling — the server receives the prompt, validates it, and queues it if needed.
  2. Prompt processing — the model encodes the input and prepares internal state.
  3. First decode step — the model generates the first output token.
  4. Streaming delivery — the token is sent back to the client, which measures the elapsed time.

TTFT is typically measured from the moment the request is sent, or from when the server accepts it, until the first token is received by the client. The exact measurement point can vary by system, so compare TTFT numbers only when the measurement method is the same.

TTFT is different from:

Tiny concrete example

A user asks a coding assistant:

User: "Rewrite this function to be async."

The assistant starts streaming:

Here, the TTFT is 180 ms. The rest of the answer may still take much longer, but the user already sees that the system is alive.

Common pitfalls / when NOT to use it

If you are building a streaming chat UX, TTFT is usually one of the best metrics to watch. If you are generating long documents or running offline inference, it may be less important than overall throughput.

Related terms

同じ著者の記事