Time to first token (TTFT) is the time between sending a generative AI request and receiving the model’s first output token.
TTFT is the part of latency users feel first. If it is high, the system can seem “stuck” even if the full answer eventually streams quickly.
You usually care about TTFT when you want:
In practice, teams often optimize TTFT before chasing total response time, because a fast first token makes the product feel much more interactive.
A model does not usually begin emitting text the instant your request arrives. Before the first token appears, several steps may happen:
TTFT is typically measured from the moment the request is sent, or from when the server accepts it, until the first token is received by the client. The exact measurement point can vary by system, so compare TTFT numbers only when the measurement method is the same.
TTFT is different from:
A user asks a coding assistant:
User: "Rewrite this function to be async."
The assistant starts streaming:
"Sure,""here""is"Here, the TTFT is 180 ms. The rest of the answer may still take much longer, but the user already sees that the system is alive.
If you are building a streaming chat UX, TTFT is usually one of the best metrics to watch. If you are generating long documents or running offline inference, it may be less important than overall throughput.