PaPoo
cover

What is quantization?

Quantization is the process of representing numbers with fewer bits, so a model or computation uses less memory and can often run faster.

Why it matters

Large neural networks store weights, activations, and sometimes key/value caches as floating-point numbers. That is accurate, but expensive. Quantization reduces that cost by converting some of those values to lower-precision formats, such as 8-bit integers or even 4-bit representations.

You usually reach for quantization when you want one or more of these:

In practice, it is one of the most common ways to make LLMs easier to serve without retraining from scratch.

How it works

The basic idea is to map a wide numeric range into a smaller set of representable values.

A simple version works like this:

  1. Choose a low-precision format, such as int8 or int4.
  2. Compute a scale factor, and sometimes a zero point, that lets the low-precision values approximate the original floating-point values.
  3. Store or run parts of the model in that lower-precision form.
  4. During inference, dequantize back to higher precision when needed, or do arithmetic directly in the quantized format if the backend supports it.

There are different kinds of quantization:

The tradeoff is always the same: less precision for less cost. The art is choosing a scheme that saves enough memory and compute while keeping model quality acceptable.

Tiny concrete example

Suppose a model weight is 0.83.

Instead of storing it exactly as a 32-bit float, a quantized system might store a nearby 8-bit value plus a scale like:

That is a simplified example, but it shows the idea: store a compact proxy, then recover an approximate original value when used.

Common pitfalls / when NOT to use it

If you are serving a model, teams often start with modest post-training quantization before trying more aggressive schemes.

Related terms

Related terms

同じ著者の記事