Understanding Backpropagation Through Computational Graphs

Backpropagation is the cornerstone algorithm for training neural networks. This interactive visualization demonstrates how gradients flow backward through a computational graph using the chain rule.

Interactive Demo

This example computes L = (a × b + c) × f with values a=2, b=3, c=-10, f=5.

How It Works

Forward Pass

The forward pass computes values from inputs to output in topological order:

e = a × b = 2 × 3 = 6
d = e + c = 6 + (-10) = -4
L = d × f = (-4) × 5 = -20

Backward Pass (Backpropagation)

The backward pass computes gradients from output to inputs using the chain rule:

Initialize: ∂L/∂L = 1 (gradient of output w.r.t. itself)
Gradient at d: ∂L/∂d = ∂L/∂L × ∂L/∂d = 1 × f = 5
Since L = d × f, the derivative ∂L/∂d = f
Gradient at e: ∂L/∂e = ∂L/∂d × ∂d/∂e = 5 × 1 = 5
Since d = e + c, the derivative ∂d/∂e = 1
Gradient at a: ∂L/∂a = ∂L/∂e × ∂e/∂a = 5 × b = 5 × 3 = 15
Since e = a × b, the derivative ∂e/∂a = b
Gradient at b: ∂L/∂b = ∂L/∂e × ∂e/∂b = 5 × a = 5 × 2 = 10
Since e = a × b, the derivative ∂e/∂b = a

The Chain Rule

The chain rule is the mathematical foundation of backpropagation. For a function composed as y = g(f(x)), the chain rule states:

dy/dx = (dy/du) × (du/dx)

where u = f(x). In our computational graph, we apply this rule repeatedly to compute how the output L changes with respect to each input variable.

Local Derivatives

Each operation in the graph has local derivatives that describe how its output changes with respect to its inputs:

Multiplication: For c = a × b, we have ∂c/∂a = b and ∂c/∂b = a
Addition: For c = a + b, we have ∂c/∂a = 1 and ∂c/∂b = 1

These local derivatives are combined via the chain rule during backpropagation to compute the gradients of the output with respect to all inputs.

Why This Matters

In neural networks, the computational graph is much larger (with thousands to billions of nodes), but the principle remains the same. Backpropagation efficiently computes gradients for all parameters in a single backward pass, enabling gradient descent optimization to train the network.

The gradients tell us how to adjust each parameter to reduce the loss function. A large gradient (like ∂L/∂a = 15) means that variable has a strong influence on the output, while a small gradient means less influence.

Key Takeaways

Forward pass: Compute values from inputs to output
Backward pass: Compute gradients from output to inputs
Chain rule: Multiply gradients along each path
Local derivatives: Simple rules for each operation type
Efficiency: One backward pass computes all gradients
Applications: Training neural networks via gradient descent

Understanding Backpropagation Through Computational Graphs

Interactive Demo

Forward Pass

Backward Pass

How It Works

Forward Pass

Backward Pass (Backpropagation)

The Chain Rule

Local Derivatives

Why This Matters

Key Takeaways

Further Reading