Tokens per second, or throughput, is the rate at which a language model generates or processes tokens over time.
Throughput is a practical way to describe model speed. If you care about user experience, cost efficiency, or server capacity, tokens per second tells you how much text the system can handle in a given time.
You’d use it when comparing models, sizing infrastructure, or setting expectations for streaming responses. In practice, teams often care about both throughput and latency: a model can be high-throughput overall but still feel slow for the first token.
A token is a chunk of text the model reads or writes. Throughput is usually measured as:
The exact number depends on what you measure. For a single request, you might divide the number of output tokens by the time it took to generate them. For a server, you might measure aggregate throughput across many concurrent requests.
In real systems, throughput is affected by model size, hardware, batch size, prompt length, context length, decoding strategy, and concurrency. Official model docs and benchmarking papers often distinguish this from latency, because they answer different questions.
If a model returns 300 output tokens in 6 seconds, its output throughput is:
That does not mean the user waited only 6 seconds for the first word. A streamed response might show the first token much earlier, while total throughput is still measured over the full generation.