Understanding Backpropagation Through Computational Graphs
Backpropagation is the cornerstone algorithm for training neural networks. This interactive visualization demonstrates how gradients flow backward through a computational graph using the chain rule.
Interactive Demo
This example computes L = (a × b + c) × f
with values a=2, b=3, c=-10, f=5.
Forward Pass: Initial values: a=2, b=3, c=-10, f=5
Forward Pass
Computes values from inputs to output. Each node calculates its value based on its inputs.
Backward Pass
Computes gradients from output to inputs using the chain rule. Shows how much each input affects the output.
How It Works
Forward Pass
The forward pass computes values from inputs to output in topological order:
e = a × b = 2 × 3 = 6d = e + c = 6 + (-10) = -4L = d × f = (-4) × 5 = -20
Backward Pass (Backpropagation)
The backward pass computes gradients from output to inputs using the chain rule:
- Initialize:
∂L/∂L = 1(gradient of output w.r.t. itself) - Gradient at d:
∂L/∂d = ∂L/∂L × ∂L/∂d = 1 × f = 5
Since L = d × f, the derivative ∂L/∂d = f - Gradient at e:
∂L/∂e = ∂L/∂d × ∂d/∂e = 5 × 1 = 5
Since d = e + c, the derivative ∂d/∂e = 1 - Gradient at a:
∂L/∂a = ∂L/∂e × ∂e/∂a = 5 × b = 5 × 3 = 15
Since e = a × b, the derivative ∂e/∂a = b - Gradient at b:
∂L/∂b = ∂L/∂e × ∂e/∂b = 5 × a = 5 × 2 = 10
Since e = a × b, the derivative ∂e/∂b = a
The Chain Rule
The chain rule is the mathematical foundation of backpropagation.
For a function composed as
y = g(f(x)), the chain rule states:
dy/dx = (dy/du) × (du/dx)
where u = f(x). In our computational graph, we apply
this rule repeatedly to compute how the output L changes with
respect to each input variable.
Local Derivatives
Each operation in the graph has local derivatives that describe how its output changes with respect to its inputs:
- Multiplication: For
c = a × b, we have∂c/∂a = band∂c/∂b = a - Addition: For
c = a + b, we have∂c/∂a = 1and∂c/∂b = 1
These local derivatives are combined via the chain rule during backpropagation to compute the gradients of the output with respect to all inputs.
Why This Matters
In neural networks, the computational graph is much larger (with thousands to billions of nodes), but the principle remains the same. Backpropagation efficiently computes gradients for all parameters in a single backward pass, enabling gradient descent optimization to train the network.
The gradients tell us how to adjust each parameter to reduce the loss function. A large gradient (like ∂L/∂a = 15) means that variable has a strong influence on the output, while a small gradient means less influence.
Key Takeaways
- Forward pass: Compute values from inputs to output
- Backward pass: Compute gradients from output to inputs
- Chain rule: Multiply gradients along each path
- Local derivatives: Simple rules for each operation type
- Efficiency: One backward pass computes all gradients
- Applications: Training neural networks via gradient descent
Further Reading
To dive deeper into the mathematics of backpropagation: