KL Divergence

Kullback-Leibler Divergence: Measuring the difference between probability distributions

DKL(P || Q) = Σ P(x) log(P(x) / Q(x))
For continuous distributions:
DKL(P || Q) = ∫ p(x) log(p(x) / q(x)) dx
P(x)
True distribution: The actual probability distribution we want to model
Q(x)
Approximate distribution: Our model's estimated distribution
log(P/Q)
Information ratio: Measures the "surprise" difference at each point

Interactive Visualization

Adjust the mean and standard deviation of two Gaussian distributions to see how KL Divergence changes. Notice how the divergence is asymmetric: D(P||Q) ≠ D(Q||P).

Key Properties

  • +
    Non-negative: D(P||Q) ≥ 0, with equality only when P = Q
  • Asymmetric: D(P||Q) ≠ D(Q||P) in general
  • !
    Not a metric: Doesn't satisfy triangle inequality
  • Can be infinite: If Q(x) = 0 where P(x) > 0

Intuition

KL Divergence measures the expected extra bits needed to encode samples from P using a code optimized for Q.

Think of it as the "information cost" of using the wrong distribution Q when the true distribution is P.

The asymmetry matters: using Q to approximate P has a different cost than using P to approximate Q.

Applications in Machine Learning

Variational Autoencoders (VAE)

KL Divergence is used to regularize the latent space, ensuring the encoded distribution stays close to a prior (usually standard normal).

Cross-Entropy Loss

Minimizing cross-entropy is equivalent to minimizing KL divergence between the true labels and model predictions.

Policy Optimization (RL)

Algorithms like TRPO and PPO use KL constraints to ensure policy updates don't change the behavior too drastically.

Knowledge Distillation

KL Divergence measures how well a smaller "student" model matches the output distribution of a larger "teacher" model.

Forward KL vs Reverse KL

Forward KL: D(P || Q)

Minimizing this makes Q cover all modes of P, even if it means putting probability mass where P is low.

Use when: You want Q to be "mean-seeking"

Reverse KL: D(Q || P)

Minimizing this makes Q focus on a single mode of P, avoiding regions where P is low.

Use when: You want Q to be "mode-seeking"