KV Cache

The optimization that makes LLM inference fast: caching Key and Value vectors

The Problem

During autoregressive generation, each new token needs to attend to all previous tokens. Without caching, we would recompute K and V for every previous token at each step — extremely wasteful!

Without KV Cache

Recompute Q, K, V for all tokens at every generation step

Step n: O(n²) attention computation
Total: O(n³) for n tokens

With KV Cache

Cache K, V from previous tokens, only compute Q for new token

Step n: O(n) attention computation
Total: O(n²) for n tokens

Interactive Visualization

Watch how the KV cache grows as tokens are generated. The left panel shows the Key cache, the right shows the Value cache. Notice how computation savings increase dramatically as the sequence grows.

How KV Cache Works

  1. 1
    Initial Prompt

    Compute K, V for all prompt tokens, store in cache

  2. 2
    Generate New Token

    Only compute Q, K, V for the new token

  3. 3
    Attention with Cache

    New Q attends to all cached K,V + new K,V

  4. 4
    Update Cache

    Append new K, V to cache, repeat for next token

Memory vs Compute Trade-off

Memory Cost

KV cache size = 2 × n_layers × seq_len × d_model

For a 7B model with 4K context: ~8GB of KV cache

Compute Savings

Reduces generation from O(n³) to O(n²)

10x-100x faster for long sequences

The Trade-off

Trade memory for speed — essential for real-time LLM inference

Pseudocode: Generation with KV Cache

# Initial prompt processing
prompt_tokens = tokenize(prompt)
kv_cache = {}

for layer in model.layers:
    Q, K, V = layer.qkv_proj(prompt_tokens)
    kv_cache[layer] = {"K": K, "V": V}
    output = attention(Q, K, V)

# Autoregressive generation
for step in range(max_new_tokens):
    # Only process the LAST token
    new_token = output[-1]

    for layer in model.layers:
        q, k, v = layer.qkv_proj(new_token)  # Just 1 token!

        # Append to cache
        kv_cache[layer]["K"] = concat(kv_cache[layer]["K"], k)
        kv_cache[layer]["V"] = concat(kv_cache[layer]["V"], v)

        # Attend to full cache
        output = attention(q, kv_cache[layer]["K"],
                              kv_cache[layer]["V"])

    next_token = sample(output)

Advanced KV Cache Techniques

Sliding Window

Only cache last N tokens. Used in Mistral for infinite context with bounded memory.

Paged Attention

Manage cache like virtual memory pages. Powers vLLM for efficient batching.

Multi-Query Attention

Share K,V across heads to reduce cache size. Used in PaLM, Llama 2.

Grouped-Query Attention

Middle ground: groups of heads share K,V. Used in Llama 2 70B, Mixtral.

Why KV Cache Matters

Without KV caching, generating a 1000-token response would require computing attention over 500 million token pairs instead of just 500 thousand — a 1000x difference!

This is why KV cache is essential for:

  • ChatGPT, Claude — Real-time conversation
  • Code completion — Low-latency suggestions
  • Streaming responses — Token-by-token output
  • Long-context models — 100K+ token windows