2026-06-17

What is batching / continuous batching?

Batching is the practice of grouping multiple requests together so a model can process them more efficiently, and continuous batching is the version where new requests are added to that group while others are still running.

Why it matters

Large models are expensive to run because the same GPU is often underused if you serve one request at a time. Batching improves throughput by sharing compute across many requests. Continuous batching goes a step further: it keeps the GPU busy by letting the server dynamically admit new requests into the next available step instead of waiting for a whole batch to finish.

You reach for batching when you care about throughput, cost, and GPU utilization more than the absolute lowest latency for a single request. In practice, most production inference systems use some form of batching because it is one of the simplest ways to serve more traffic per GPU.

How it works

In plain batching, the server collects several requests, pads or groups them so they can be run together, and sends them through the model in one go. This works well when the requests are similar in shape, such as same-length inputs or a fixed-size image batch.

For LLM inference, there are usually two phases: prefill and decode. Prefill processes the input prompt, and decode generates tokens one step at a time. The tricky part is that decode is iterative: different requests finish at different times, so a fixed batch can become inefficient as some sequences end early.

Continuous batching solves that by treating generation as a moving queue. After each decoding step, the server removes finished requests and inserts newly arrived ones into the active batch. This keeps the batch full without forcing new traffic to wait for the entire previous batch to complete. Many modern LLM serving systems describe this as “dynamic batching” or “in-flight batching”; the exact implementation details vary, but the core idea is the same.

Tiny concrete example

Suppose a server is generating text for three users:

User A needs 20 tokens
User B needs 5 tokens
User C arrives while A and B are already decoding

With fixed batching, C may wait until the current batch ends. With continuous batching, the server can finish B, keep A running, and insert C into the next decode step so the GPU stays occupied.

A simplified view:

Step 1: [A, B]
Step 2: [A, B, C]   <- C joins as soon as there is room
Step 3: [A, C]
Step 4: [A]

Common pitfalls / when NOT to use it

Not every workload benefits equally. If your requests are tiny, sparse, or strictly latency-sensitive, batching can add queueing delay that hurts the user experience.
Variable-length outputs complicate batching. LLM decode steps end at different times, so you need server logic for eviction, admission, and memory management.
Padding waste can still happen. Plain batching on very uneven inputs may waste compute on padding; continuous batching reduces some waste but does not eliminate all inefficiency.
Memory becomes the bottleneck. For LLMs, batching is not just about compute; keeping many active generations means keeping their KV cache in memory.
Terminology is inconsistent. Some vendors say batching, dynamic batching, or continuous batching for closely related ideas. The name matters less than whether the server can add and remove sequences while decoding.