Quantization is the process of representing numbers with fewer bits, so a model or computation uses less memory and can often run faster.
Large neural networks store weights, activations, and sometimes key/value caches as floating-point numbers. That is accurate, but expensive. Quantization reduces that cost by converting some of those values to lower-precision formats, such as 8-bit integers or even 4-bit representations.
You usually reach for quantization when you want one or more of these:
In practice, it is one of the most common ways to make LLMs easier to serve without retraining from scratch.
The basic idea is to map a wide numeric range into a smaller set of representable values.
A simple version works like this:
There are different kinds of quantization:
The tradeoff is always the same: less precision for less cost. The art is choosing a scheme that saves enough memory and compute while keeping model quality acceptable.
Suppose a model weight is 0.83.
Instead of storing it exactly as a 32-bit float, a quantized system might store a nearby 8-bit value plus a scale like:
830.0183 * 0.01 = 0.83That is a simplified example, but it shows the idea: store a compact proxy, then recover an approximate original value when used.
If you are serving a model, teams often start with modest post-training quantization before trying more aggressive schemes.