PaPoo
cover

What is attention (self-attention)?

Attention, or self-attention, is a way for a model to decide which other words or tokens in the same input matter most when interpreting each token.

Why it matters

Self-attention solves a basic language problem: the meaning of a word often depends on other words nearby or far away. It lets a model build a context-aware representation of each token instead of treating tokens mostly independently or relying only on fixed-size windows.

In practice, this is why Transformers work so well for text, code, and other sequence data. If you need a model to connect “it” to the right noun, or to link a later instruction to an earlier constraint, self-attention is the mechanism doing that work.

How it works

At a high level, each token produces three vectors:

For a given token, the model compares its query to the keys of all tokens in the sequence. Those comparisons become weights: tokens that seem more relevant get higher weight. The model then takes a weighted sum of the corresponding values to produce a new representation for that token.

Because every token can attend to every other token, self-attention is good at modeling long-range dependencies. In Transformers, this is usually done in parallel across many tokens, and often with multi-head attention, where several attention “heads” learn different kinds of relationships at once.

A key point: self-attention is not a single rule for “importance.” It is a learned mechanism. The model figures out, during training, which patterns of attention help it do the task.

Tiny concrete example

Sentence:
“The trophy didn’t fit in the suitcase because it was too big.”

When processing “it”, a self-attention layer can put more weight on “trophy” than on “suitcase” if that helps the model infer that “it” refers to the trophy. In another sentence, the weights could shift to a different word depending on context.

Common pitfalls / when NOT to use it

In practice, if you need a model to connect parts of the same input flexibly, self-attention is the right idea. If you need durable state across interactions, look elsewhere.

What is attention (self-attention)?

Attention, or self-attention, is a way for a model to decide which other words or tokens in the same input matter most when interpreting each token.

Why it matters

Self-attention solves a basic language problem: the meaning of a word often depends on other words nearby or far away. It lets a model build a context-aware representation of each token instead of treating tokens mostly independently or relying only on fixed-size windows.

In practice, this is why Transformers work so well for text, code, and other sequence data. If you need a model to connect “it” to the right noun, or to link a later instruction to an earlier constraint, self-attention is the mechanism doing that work.

How it works

At a high level, each token produces three vectors:

For a given token, the model compares its query to the keys of all tokens in the sequence. Those comparisons become weights: tokens that seem more relevant get higher weight. The model then takes a weighted sum of the corresponding values to produce a new representation for that token.

Because every token can attend to every other token, self-attention is good at modeling long-range dependencies. In Transformers, this is usually done in parallel across many tokens, and often with multi-head attention, where several attention “heads” learn different kinds of relationships at once.

A key point: self-attention is not a single rule for “importance.” It is a learned mechanism. The model figures out, during training, which patterns of attention help it do the task.

Tiny concrete example

Sentence:
“The trophy didn’t fit in the suitcase because it was too big.”

When processing “it”, a self-attention layer can put more weight on “trophy” than on “suitcase” if that helps the model infer that “it” refers to the trophy. In another sentence, the weights could shift to a different word depending on context.

Common pitfalls / when NOT to use it

In practice, if you need a model to connect parts of the same input flexibly, self-attention is the right idea. If you need durable state across interactions, look elsewhere.

Related terms

同じ著者の記事