A transformer is a neural network architecture that processes sequences by using attention to compare each token with the others, making it a flexible foundation for modern language models and many other sequence tasks.
Before transformers, many sequence models had to read text step by step, which made long-range dependencies harder to handle and training slower to parallelize. Transformers solve this by letting the model look at many positions in the input at once.
In practice, you reach for a transformer when you want a model that can:
Most modern LLMs are built on transformer variants because the architecture works well for language, code, and other token-based data.
A transformer first turns the input into tokens, then converts those tokens into vectors. The key mechanism is attention: for each token, the model computes how much it should “pay attention” to other tokens in the sequence.
In the original transformer design, there are two major parts:
Many language models use only the decoder side, while many classification or translation systems use encoder-only or encoder-decoder variants.
Instead of processing strictly left-to-right through recurrence, a transformer uses attention plus feed-forward layers, with positional information added so the model still knows token order. This lets it learn relationships like “this pronoun refers to that noun” or “this symbol depends on something earlier in the sequence.”
Input:
The cat sat on the mat because it was tired.
A transformer can use attention to connect it back to cat and to earlier words like tired, rather than relying only on nearby tokens. That makes it better at modeling context across a sentence.