Attention Mechanism

The revolutionary idea behind Transformers: "Attention Is All You Need"

Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(QK^T /√d_k)V

Q (Query)

"What am I looking for?" - The token asking for information

K (Key)

"What do I contain?" - Labels that queries match against

V (Value)

"Here's my content" - The actual information to retrieve

Interactive Visualization

Click on any input token to see how it attends to all other tokens. The beam thickness and color intensity represent attention weights. The matrix on the right shows all attention patterns.

How Attention Works

1
Project to Q, K, V
Each token is transformed into Query, Key, and Value vectors
2
Compute Attention Scores
Dot product between queries and keys measures similarity
3
Apply Softmax
Convert scores to probabilities that sum to 1
4
Weighted Sum of Values
Output is attention-weighted combination of value vectors

Types of Attention

Self-Attention

Each token attends to all tokens in the same sequence. Used in encoders like BERT.

Causal (Masked) Attention

Tokens can only attend to previous tokens. Used in decoders like GPT for autoregressive generation.

Cross-Attention

Queries from one sequence, Keys/Values from another. Used in encoder-decoder models for translation.

Multi-Head Attention

Instead of one attention function, Transformers use multiple attention heads in parallel.

Each head learns different relationship patterns:

Syntactic relationships (subject-verb)
Semantic similarity
Positional patterns
Coreference (pronouns to nouns)

# Multi-Head Attention

head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

MultiHead = Concat(head₁, ..., head_h)W^O

Typically h=8 or h=12 heads
d_k = d_model / h

Why Attention Revolutionized AI

🚀

Parallelization

Unlike RNNs, all positions can be computed simultaneously, enabling massive speedups on GPUs.

🔗

Long-Range Dependencies

Direct connections between any two positions, regardless of distance in the sequence.

🔍

Interpretability

Attention weights show what the model "looks at", providing insights into its reasoning.

📈

Scalability

Powers models from BERT (110M) to GPT-4 (1.7T+ parameters) with consistent architecture.