KV Cache
The optimization that makes LLM inference fast: caching Key and Value vectors
The Problem
During autoregressive generation, each new token needs to attend to all previous tokens. Without caching, we would recompute K and V for every previous token at each step — extremely wasteful!
Without KV Cache
Recompute Q, K, V for all tokens at every generation step
Total: O(n³) for n tokens
With KV Cache
Cache K, V from previous tokens, only compute Q for new token
Total: O(n²) for n tokens
Interactive Visualization
Watch how the KV cache grows as tokens are generated. The left panel shows the Key cache, the right shows the Value cache. Notice how computation savings increase dramatically as the sequence grows.
How KV Cache Works
- 1Initial Prompt
Compute K, V for all prompt tokens, store in cache
- 2Generate New Token
Only compute Q, K, V for the new token
- 3Attention with Cache
New Q attends to all cached K,V + new K,V
- 4Update Cache
Append new K, V to cache, repeat for next token
Memory vs Compute Trade-off
Memory Cost
KV cache size = 2 × n_layers × seq_len × d_model
For a 7B model with 4K context: ~8GB of KV cache
Compute Savings
Reduces generation from O(n³) to O(n²)
10x-100x faster for long sequences
The Trade-off
Trade memory for speed — essential for real-time LLM inference
Pseudocode: Generation with KV Cache
# Initial prompt processing
prompt_tokens = tokenize(prompt)
kv_cache = {}
for layer in model.layers:
Q, K, V = layer.qkv_proj(prompt_tokens)
kv_cache[layer] = {"K": K, "V": V}
output = attention(Q, K, V)
# Autoregressive generation
for step in range(max_new_tokens):
# Only process the LAST token
new_token = output[-1]
for layer in model.layers:
q, k, v = layer.qkv_proj(new_token) # Just 1 token!
# Append to cache
kv_cache[layer]["K"] = concat(kv_cache[layer]["K"], k)
kv_cache[layer]["V"] = concat(kv_cache[layer]["V"], v)
# Attend to full cache
output = attention(q, kv_cache[layer]["K"],
kv_cache[layer]["V"])
next_token = sample(output)Advanced KV Cache Techniques
Sliding Window
Only cache last N tokens. Used in Mistral for infinite context with bounded memory.
Paged Attention
Manage cache like virtual memory pages. Powers vLLM for efficient batching.
Multi-Query Attention
Share K,V across heads to reduce cache size. Used in PaLM, Llama 2.
Grouped-Query Attention
Middle ground: groups of heads share K,V. Used in Llama 2 70B, Mixtral.
Why KV Cache Matters
Without KV caching, generating a 1000-token response would require computing attention over 500 million token pairs instead of just 500 thousand — a 1000x difference!
This is why KV cache is essential for:
- ChatGPT, Claude — Real-time conversation
- Code completion — Low-latency suggestions
- Streaming responses — Token-by-token output
- Long-context models — 100K+ token windows