2026-06-17

What is quantization?

Quantization is the process of representing numbers with fewer bits, so a model or computation uses less memory and can often run faster.

Why it matters

Large neural networks store weights, activations, and sometimes key/value caches as floating-point numbers. That is accurate, but expensive. Quantization reduces that cost by converting some of those values to lower-precision formats, such as 8-bit integers or even 4-bit representations.

You usually reach for quantization when you want one or more of these:

lower memory use
lower bandwidth use when moving tensors around
faster inference on supported hardware
cheaper deployment of large models

In practice, it is one of the most common ways to make LLMs easier to serve without retraining from scratch.

How it works

The basic idea is to map a wide numeric range into a smaller set of representable values.

A simple version works like this:

Choose a low-precision format, such as int8 or int4.
Compute a scale factor, and sometimes a zero point, that lets the low-precision values approximate the original floating-point values.
Store or run parts of the model in that lower-precision form.
During inference, dequantize back to higher precision when needed, or do arithmetic directly in the quantized format if the backend supports it.

There are different kinds of quantization:

Post-training quantization: apply quantization after training.
Quantization-aware training: train the model while simulating quantization effects so it learns to be more robust.
Weight-only quantization: quantize only the weights, leaving activations in higher precision.
Activation quantization: also quantize intermediate activations.

The tradeoff is always the same: less precision for less cost. The art is choosing a scheme that saves enough memory and compute while keeping model quality acceptable.

Tiny concrete example

Suppose a model weight is 0.83.

Instead of storing it exactly as a 32-bit float, a quantized system might store a nearby 8-bit value plus a scale like:

stored value: 83
scale: 0.01
reconstructed value: 83 * 0.01 = 0.83

That is a simplified example, but it shows the idea: store a compact proxy, then recover an approximate original value when used.

Common pitfalls / when NOT to use it

Quantization is not compression in the broad sense. It reduces numeric precision; it does not automatically solve every storage or latency problem.
Quality can drop. Some models, tasks, or layers are more sensitive than others, especially with aggressive low-bit quantization.
Hardware support matters. A quantized model may be smaller, but not always faster unless the runtime and hardware can exploit the low-precision format.
It is not the same as pruning. Pruning removes parameters; quantization changes how parameters are represented.
For training, be careful. Very low precision can make optimization unstable unless the method is designed for it.

If you are serving a model, teams often start with modest post-training quantization before trying more aggressive schemes.