cover

2026-06-14

What is attention (self-attention)?

Attention, or self-attention, is a way for a model to decide which other words or tokens in the same input matter most when interpreting each token.

Why it matters

Self-attention solves a basic language problem: the meaning of a word often depends on other words nearby or far away. It lets a model build a context-aware representation of each token instead of treating tokens mostly independently or relying only on fixed-size windows.

In practice, this is why Transformers work so well for text, code, and other sequence data. If you need a model to connect “it” to the right noun, or to link a later instruction to an earlier constraint, self-attention is the mechanism doing that work.

How it works

At a high level, each token produces three vectors:

Query: what this token is looking for
Key: what this token offers
Value: the information to pass along

For a given token, the model compares its query to the keys of all tokens in the sequence. Those comparisons become weights: tokens that seem more relevant get higher weight. The model then takes a weighted sum of the corresponding values to produce a new representation for that token.

Because every token can attend to every other token, self-attention is good at modeling long-range dependencies. In Transformers, this is usually done in parallel across many tokens, and often with multi-head attention, where several attention “heads” learn different kinds of relationships at once.

A key point: self-attention is not a single rule for “importance.” It is a learned mechanism. The model figures out, during training, which patterns of attention help it do the task.

Tiny concrete example

Sentence:
“The trophy didn’t fit in the suitcase because it was too big.”

When processing “it”, a self-attention layer can put more weight on “trophy” than on “suitcase” if that helps the model infer that “it” refers to the trophy. In another sentence, the weights could shift to a different word depending on context.

Common pitfalls / when NOT to use it

It is not human-like reasoning. Attention weights show what the model focused on, but they do not by themselves explain the model’s full decision.
It is not the same as “memory.” Self-attention helps relate tokens inside the current context; it does not automatically create persistent long-term memory across sessions.
It can be expensive on long inputs. Naive self-attention scales poorly with sequence length, so very long documents may need optimized variants or chunking.
Don’t assume a high attention weight means causation. Attention is useful, but it is not a guaranteed explanation of why the output happened.

In practice, if you need a model to connect parts of the same input flexibly, self-attention is the right idea. If you need durable state across interactions, look elsewhere.

What is attention (self-attention)?

Attention, or self-attention, is a way for a model to decide which other words or tokens in the same input matter most when interpreting each token.

Why it matters

Self-attention solves a basic language problem: the meaning of a word often depends on other words nearby or far away. It lets a model build a context-aware representation of each token instead of treating tokens mostly independently or relying only on fixed-size windows.

In practice, this is why Transformers work so well for text, code, and other sequence data. If you need a model to connect “it” to the right noun, or to link a later instruction to an earlier constraint, self-attention is the mechanism doing that work.

How it works

At a high level, each token produces three vectors:

Query: what this token is looking for
Key: what this token offers
Value: the information to pass along

For a given token, the model compares its query to the keys of all tokens in the sequence. Those comparisons become weights: tokens that seem more relevant get higher weight. The model then takes a weighted sum of the corresponding values to produce a new representation for that token.

Because every token can attend to every other token, self-attention is good at modeling long-range dependencies. In Transformers, this is usually done in parallel across many tokens, and often with multi-head attention, where several attention “heads” learn different kinds of relationships at once.

A key point: self-attention is not a single rule for “importance.” It is a learned mechanism. The model figures out, during training, which patterns of attention help it do the task.

Tiny concrete example

Sentence:
“The trophy didn’t fit in the suitcase because it was too big.”

When processing “it”, a self-attention layer can put more weight on “trophy” than on “suitcase” if that helps the model infer that “it” refers to the trophy. In another sentence, the weights could shift to a different word depending on context.

Common pitfalls / when NOT to use it

It is not human-like reasoning. Attention weights show what the model focused on, but they do not by themselves explain the model’s full decision.
It is not the same as “memory.” Self-attention helps relate tokens inside the current context; it does not automatically create persistent long-term memory across sessions.
It can be expensive on long inputs. Naive self-attention scales poorly with sequence length, so very long documents may need optimized variants or chunking.
Don’t assume a high attention weight means causation. Attention is useful, but it is not a guaranteed explanation of why the output happened.

In practice, if you need a model to connect parts of the same input flexibly, self-attention is the right idea. If you need durable state across interactions, look elsewhere.

Related terms

What is a token (and tokenization)?
What is a transformer?
What is temperature in an LLM?
What is top-p / nucleus sampling?
What is a mixture-of-experts (MoE) model?
What is a parameter (model size)?
What is the KV cache?
What is a hallucination?
What is grounding?
What is a logit?
What is softmax in an LLM?
What is greedy decoding?
What is beam search?
What is byte-pair encoding (BPE)?
What is positional encoding?
What is a decoder-only model?
What is perplexity?

同じ著者の記事

What is meta-prompting?

What is meta-prompting?

Meta-prompting is the practice of using one prompt to design, improve, or control another prompt, rather than asking the model to do the task directly. It helps when a plain prompt is too vague, too brittle, or too hard to maintain. Instead of hand-tuning a long instruction by trial and error, you can ask an LLM to generate, refine, critique, or transform prompts for you. In practice, teams reach for meta-prompting when they want: better prompt quality with less manual iteration, reusable prompt

What is tree-of-thought prompting?

What is tree-of-thought prompting?

Tree-of-thought prompting, often abbreviated ToT, is a prompting method where an LLM explores multiple reasoning branches instead of committing to a single straight-line answer. Tree-of-thought prompting is useful when a problem benefits from deliberate search: puzzle solving, planning, multi-step reasoning, or tasks where an early mistake can derail the whole answer. In practice, it gives the model a way to "think in branches," then compare candidates before choosing one. You'd reach for To

What is self-consistency?

What is self-consistency?

Self-consistency is a way to get a language model to answer by generating several reasoning paths and then choosing the answer that appears most often, instead of trusting a single chain of thought. A single reasoning trace from an LLM can be brittle: one unlucky step can derail the final answer. Self-consistency helps when the task needs multi-step reasoning, such as math word problems, logic puzzles, or multi-hop question answering. In practice, you reach for it when you want a cheap reliabili

What is using XML tags / delimiters in prompts?

What is using XML tags / delimiters in prompts?

Using XML tags or other delimiters in prompts means wrapping parts of your prompt in clear markers like `<context>...</context>` so the model can separate instructions, data, and examples more reliably. This pattern helps when a prompt contains multiple kinds of content at once: instructions, source text, user input, examples, or fields you want the model to treat differently. Delimiters reduce ambiguity and make prompts easier to read, debug, and reuse. In practice, teams reach for XML-style ta

What is a prompt template?

What is a prompt template?

A prompt template is a reusable prompt with placeholders for variable details, so you can generate consistent LLM inputs without rewriting the whole prompt each time. Prompt templates solve a simple but common problem: most prompts are partly fixed and partly changing. In practice, they help you: keep instructions consistent across many requests swap in user-specific data, documents, or task details reduce copy-paste errors when building apps, workflows, or evals make prompts easier to version,

What is structured output (JSON mode)?

What is structured output (JSON mode)?

Structured output, often called JSON mode, is a way to ask an LLM to return data in a fixed machine-readable format instead of free-form text. When you want an answer that another program can safely parse, plain chat text is fragile. A model might add extra words, change field names, or wrap the result in an explanation. Structured output helps when you need: predictable API responses reliable extraction into downstream systems cleaner automation for forms, workflows, or tools outputs that match

What is prompt engineering?

What is prompt engineering?

Prompt engineering is the practice of designing and refining the text you give to a language model so it produces more useful, reliable, and specific outputs. Large language models are sensitive to how you ask. A vague request can produce a vague answer; a well-structured prompt can improve format, scope, tone, and task performance. You reach for prompt engineering when you want to: get a model to follow instructions more consistently, steer it toward a role, style, or output format, reduce back

What is a system prompt?

What is a system prompt?

A system prompt is the instruction text that sets the assistant’s role, behavior, and boundaries before the user’s message is processed. System prompts help you control how an LLM behaves without changing the model itself. They’re useful when you want consistent tone, safety rules, formatting, or task-specific behavior across many requests. In practice, teams use system prompts to: keep outputs on-brand, enforce policies or guardrails, define what the assistant should and should not do, reduce a

What is few-shot prompting?

What is few-shot prompting?

Few-shot prompting, also called in-context learning, is a way of asking a language model to do a task by giving it a few example inputs and outputs in the prompt. Few-shot prompting helps when you want the model to follow a pattern, format, or classification rule without training or fine-tuning a new model. You’d reach for it when: the task is simple but the output needs to be consistent, you can show the model 1–5 good examples, you want to prototype quickly before investing in fine-tuning

What is zero-shot prompting?

What is zero-shot prompting?

Zero-shot prompting is asking an AI model to do a task without giving it any worked examples first. It is the simplest way to use a large language model: you describe the task in plain language and let the model infer the pattern from its pretraining. That makes it useful when you want: a fast baseline before writing better prompts simple classification, extraction, or drafting tasks lower prompt length and less example curation a test of whether the model already “gets” the task well enough In