PaPoo
cover

What is batching / continuous batching?

Batching is the practice of grouping multiple requests together so a model can process them more efficiently, and continuous batching is the version where new requests are added to that group while others are still running.

Why it matters

Large models are expensive to run because the same GPU is often underused if you serve one request at a time. Batching improves throughput by sharing compute across many requests. Continuous batching goes a step further: it keeps the GPU busy by letting the server dynamically admit new requests into the next available step instead of waiting for a whole batch to finish.

You reach for batching when you care about throughput, cost, and GPU utilization more than the absolute lowest latency for a single request. In practice, most production inference systems use some form of batching because it is one of the simplest ways to serve more traffic per GPU.

How it works

In plain batching, the server collects several requests, pads or groups them so they can be run together, and sends them through the model in one go. This works well when the requests are similar in shape, such as same-length inputs or a fixed-size image batch.

For LLM inference, there are usually two phases: prefill and decode. Prefill processes the input prompt, and decode generates tokens one step at a time. The tricky part is that decode is iterative: different requests finish at different times, so a fixed batch can become inefficient as some sequences end early.

Continuous batching solves that by treating generation as a moving queue. After each decoding step, the server removes finished requests and inserts newly arrived ones into the active batch. This keeps the batch full without forcing new traffic to wait for the entire previous batch to complete. Many modern LLM serving systems describe this as “dynamic batching” or “in-flight batching”; the exact implementation details vary, but the core idea is the same.

Tiny concrete example

Suppose a server is generating text for three users:

With fixed batching, C may wait until the current batch ends. With continuous batching, the server can finish B, keep A running, and insert C into the next decode step so the GPU stays occupied.

A simplified view:

Step 1: [A, B]
Step 2: [A, B, C]   <- C joins as soon as there is room
Step 3: [A, C]
Step 4: [A]

Common pitfalls / when NOT to use it

Related terms

同じ著者の記事