Attention Mechanism
The revolutionary idea behind Transformers: "Attention Is All You Need"
Scaled Dot-Product Attention
"What am I looking for?" - The token asking for information
"What do I contain?" - Labels that queries match against
"Here's my content" - The actual information to retrieve
Interactive Visualization
Click on any input token to see how it attends to all other tokens. The beam thickness and color intensity represent attention weights. The matrix on the right shows all attention patterns.
How Attention Works
- 1Project to Q, K, V
Each token is transformed into Query, Key, and Value vectors
- 2Compute Attention Scores
Dot product between queries and keys measures similarity
- 3Apply Softmax
Convert scores to probabilities that sum to 1
- 4Weighted Sum of Values
Output is attention-weighted combination of value vectors
Types of Attention
Self-Attention
Each token attends to all tokens in the same sequence. Used in encoders like BERT.
Causal (Masked) Attention
Tokens can only attend to previous tokens. Used in decoders like GPT for autoregressive generation.
Cross-Attention
Queries from one sequence, Keys/Values from another. Used in encoder-decoder models for translation.
Multi-Head Attention
Instead of one attention function, Transformers use multiple attention heads in parallel.
Each head learns different relationship patterns:
- Syntactic relationships (subject-verb)
- Semantic similarity
- Positional patterns
- Coreference (pronouns to nouns)
d_k = d_model / h
Why Attention Revolutionized AI
Parallelization
Unlike RNNs, all positions can be computed simultaneously, enabling massive speedups on GPUs.
Long-Range Dependencies
Direct connections between any two positions, regardless of distance in the sequence.
Interpretability
Attention weights show what the model "looks at", providing insights into its reasoning.
Scalability
Powers models from BERT (110M) to GPT-4 (1.7T+ parameters) with consistent architecture.