The KV cache, or key-value cache, is a memory buffer that stores past attention keys and values so a transformer can generate the next token without recomputing everything from scratch.
In autoregressive generation, each new token depends on all previous tokens. Without a KV cache, the model would repeatedly re-run attention over the entire prefix at every step, which is much slower than necessary.
You reach for a KV cache when you want faster text generation, lower latency, or higher throughput in decoder-only transformers and similar LLM serving setups. In practice, it is one of the main reasons incremental decoding is feasible at useful speeds.
A transformer attention layer turns each token into three things: queries, keys, and values. During generation, the model only needs to compute a new query for the latest token, but it still needs access to the keys and values for all previous tokens to attend over them.
The KV cache stores those previous keys and values layer by layer. When the next token is generated, the model appends the new token’s key and value to the cache, then uses the cached history plus the new query to compute attention. This avoids recomputing keys and values for the whole prefix on every step.
Most implementations keep a separate cache for each attention layer, and the cache grows as the sequence grows. That speedup comes with a memory cost: longer contexts and larger batch sizes require more GPU memory, so serving systems often balance cache size, precision, and batching strategy.
Suppose a model has already processed:
“The cat sat”
It has cached the key/value tensors for those three tokens in each layer. Now it wants to generate the next token.