KL Divergence
Kullback-Leibler Divergence: Measuring the difference between probability distributions
Interactive Visualization
Adjust the mean and standard deviation of two Gaussian distributions to see how KL Divergence changes. Notice how the divergence is asymmetric: D(P||Q) ≠ D(Q||P).
Key Properties
- +Non-negative: D(P||Q) ≥ 0, with equality only when P = Q
- ≠Asymmetric: D(P||Q) ≠ D(Q||P) in general
- !Not a metric: Doesn't satisfy triangle inequality
- ∞Can be infinite: If Q(x) = 0 where P(x) > 0
Intuition
KL Divergence measures the expected extra bits needed to encode samples from P using a code optimized for Q.
Think of it as the "information cost" of using the wrong distribution Q when the true distribution is P.
The asymmetry matters: using Q to approximate P has a different cost than using P to approximate Q.
Applications in Machine Learning
Variational Autoencoders (VAE)
KL Divergence is used to regularize the latent space, ensuring the encoded distribution stays close to a prior (usually standard normal).
Cross-Entropy Loss
Minimizing cross-entropy is equivalent to minimizing KL divergence between the true labels and model predictions.
Policy Optimization (RL)
Algorithms like TRPO and PPO use KL constraints to ensure policy updates don't change the behavior too drastically.
Knowledge Distillation
KL Divergence measures how well a smaller "student" model matches the output distribution of a larger "teacher" model.
Forward KL vs Reverse KL
Forward KL: D(P || Q)
Minimizing this makes Q cover all modes of P, even if it means putting probability mass where P is low.
Reverse KL: D(Q || P)
Minimizing this makes Q focus on a single mode of P, avoiding regions where P is low.