Backpropagation

The algorithm that powers deep learning: computing gradients through neural networks

The Chain Rule in Action

∂Loss/∂w = ∂Loss/∂a × ∂a/∂w

Forward Pass

Input data flows through the network layer by layer. Each neuron computes a weighted sum of inputs, applies an activation function, and passes the result to the next layer.

a(l) = σ(W(l) · a(l-1) + b(l))

Backward Pass

Gradients flow backward from the loss function. Using the chain rule, we compute how much each weight contributed to the error, enabling precise weight updates.

δ(l) = (W(l+1))T · δ(l+1) ⊙ σ'(z(l))

Interactive Visualization

Watch how signals propagate forward through the network (blue) and how gradients flow backward (red). Click the buttons to run forward pass, backward pass, or a complete training step.

Algorithm Steps

  1. 1
    Forward Propagation
    Compute activations layer by layer from input to output
  2. 2
    Compute Loss
    Calculate the error between prediction and target
  3. 3
    Backward Propagation
    Compute gradients using the chain rule, layer by layer
  4. 4
    Update Weights
    Adjust weights in the direction that reduces loss

Key Concepts

Gradient Descent

w = w - η · ∂Loss/∂w
Weights are updated in the opposite direction of the gradient

Learning Rate (η)

Controls step size. Too high causes overshooting, too low causes slow convergence.

Vanishing Gradients

Gradients can shrink exponentially in deep networks, solved by ReLU, residual connections, and normalization.

Mathematical Foundation

Forward Pass Equations

Pre-activation:
z(l) = W(l) · a(l-1) + b(l)
Activation:
a(l) = σ(z(l))

Backward Pass Equations

Output layer error:
δ(L) = ∇aLoss ⊙ σ'(z(L))
Weight gradient:
∂Loss/∂W(l) = δ(l) · (a(l-1))T

Why Backpropagation Matters

Backpropagation, formalized by Rumelhart, Hinton, and Williams in 1986, is the cornerstone of modern deep learning. It efficiently computes gradients in O(n) time, where n is the number of parameters.

Without backpropagation, training neural networks with millions or billions of parameters would be computationally infeasible. It enables:

  • Large Language Models like GPT and Claude
  • Computer Vision networks like ResNet and Vision Transformers
  • Generative AI including diffusion models and GANs
  • Reinforcement Learning algorithms like PPO and SAC