A token is a chunk of text that a language model reads and predicts, and tokenization is the process of splitting text into those chunks.
LLMs do not operate directly on raw characters or words; they operate on tokens. That matters because:
In practice, if you are building with LLMs, tokenization is one of the first things you should check before debugging “why did this get truncated?” or “why was this so expensive?”
Most modern tokenizers do not split text only by spaces. They usually break text into subword units using a learned vocabulary. This is a practical compromise:
A token can be:
This is why “one word = one token” is only sometimes true. In many tokenizers, "unbelievable" may be one token or several pieces, while "hello" and "!" might each be separate tokens. Exact behavior depends on the tokenizer used by the model.
Text:
I love tokenization!
A tokenizer might split it roughly like this:
Ilovetokenization!Another tokenizer might split it differently, for example:
Ilovetokenization!Same sentence, different tokenization. The model then works with token IDs rather than the original text.
If you need exact behavior, use the tokenizer associated with the specific model you plan to call.