The Chain Rule is All You Need
Backpropagation sounds scary. It's just the chain rule from calculus applied systematically to a computation graph. If you understand function composition, you can understand backpropagation.
The chain rule: if y = f(g(x)), then dy/dx = (dy/dg) * (dg/dx).
That's it. Backprop chains this rule through every operation in your neural network.
Build a Miniature Autograd Engine
The best way to understand backpropagation is to implement it. Let's build a tiny version of PyTorch's autograd.
import math
class Value:
"""A scalar value with automatic gradient computation."""
def __init__(self, data: float, _children=(), _op=""):
self.data = data
self.grad = 0.0 # Gradient of loss w.r.t. this value
self._backward = lambda: None # Function to compute gradients
self._prev = set(_children)
self._op = _op
def __repr__(self):
return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"
Defining Operations with Backward Passes
Each operation needs a forward computation AND a backward computation (how to compute input gradients given output gradient):
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
result = Value(self.data + other.data, (self, other), "+")
def _backward():
# d(a+b)/da = 1, d(a+b)/db = 1
# Chain rule: grad_a += result.grad * 1
self.grad += result.grad
other.grad += result.grad
result._backward = _backward
return result
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
result = Value(self.data * other.data, (self, other), "*")
def _backward():
# d(a*b)/da = b, d(a*b)/db = a
self.grad += other.data * result.grad
self.grad = self.grad # Keep for clarity
other.grad += self.data * result.grad
result._backward = _backward
return result
def __pow__(self, power):
assert isinstance(power, (int, float))
result = Value(self.data ** power, (self,), f"**{power}")
def _backward():
# d(x^n)/dx = n * x^(n-1)
self.grad += power * (self.data ** (power - 1)) * result.grad
result._backward = _backward
return result
def relu(self):
result = Value(max(0, self.data), (self,), "relu")
def _backward():
# d(relu(x))/dx = 1 if x > 0 else 0
self.grad += (result.data > 0) * result.grad
result._backward = _backward
return result
def exp(self):
result = Value(math.exp(self.data), (self,), "exp")
def _backward():
# d(e^x)/dx = e^x
self.grad += result.data * result.grad
result._backward = _backward
return result
# Convenience methods
def __neg__(self): return self * -1
def __sub__(self, other): return self + (-other)
def __truediv__(self, other): return self * other**-1
def __radd__(self, other): return self + other
def __rmul__(self, other): return self * other
def __rsub__(self, other): return other + (-self)
The Backward Pass: Topological Sort
To compute gradients correctly, we need to call each node's _backward() in reverse topological order (outputs before inputs):
def backward(self):
"""Compute gradients for all nodes via reverse-mode autodiff."""
# Build topological order
topo = []
visited = set()
def build_topo(node):
if node not in visited:
visited.add(node)
for child in node._prev:
build_topo(child)
topo.append(node)
build_topo(self)
# Gradient of loss w.r.t. itself = 1
self.grad = 1.0
# Propagate gradients in reverse topological order
for node in reversed(topo):
node._backward()
Testing It
# Simple example: f(x) = x^2
x = Value(3.0)
y = x ** 2
y.backward()
print(x.grad) # 6.0 ✓ (dy/dx = 2x at x=3)
# Chain rule: f(x) = (x + 2) * x
x = Value(3.0)
y = (x + Value(2.0)) * x
y.backward()
print(x.grad) # 8.0 ✓ (y = x^2 + 2x, dy/dx = 2x + 2 = 8 at x=3)
# Verify against PyTorch
import torch
x_torch = torch.tensor(3.0, requires_grad=True)
y_torch = (x_torch + 2) * x_torch
y_torch.backward()
print(x_torch.grad) # tensor(8.) ✓
Building a Neural Network with Our Engine
import random
class Neuron:
def __init__(self, n_inputs: int):
self.weights = [Value(random.uniform(-1, 1)) for _ in range(n_inputs)]
self.bias = Value(0.0)
def __call__(self, inputs: list[Value]) -> Value:
# Weighted sum + bias
activation = sum((w * x for w, x in zip(self.weights, inputs)), self.bias)
return activation.relu()
def parameters(self) -> list[Value]:
return self.weights + [self.bias]
class Layer:
def __init__(self, n_inputs: int, n_outputs: int):
self.neurons = [Neuron(n_inputs) for _ in range(n_outputs)]
def __call__(self, x: list[Value]) -> list[Value]:
return [neuron(x) for neuron in self.neurons]
def parameters(self) -> list[Value]:
return [p for neuron in self.neurons for p in neuron.parameters()]
class MLP:
def __init__(self, n_inputs: int, layer_sizes: list[int]):
sizes = [n_inputs] + layer_sizes
self.layers = [Layer(sizes[i], sizes[i+1]) for i in range(len(sizes)-1)]
def __call__(self, x: list[float]) -> Value:
out = [Value(xi) for xi in x]
for layer in self.layers:
out = layer(out)
return out[0] if len(out) == 1 else out
def parameters(self) -> list[Value]:
return [p for layer in self.layers for p in layer.parameters()]
def zero_grad(self):
for p in self.parameters():
p.grad = 0.0
Training the Network
# XOR problem — not linearly separable
X = [[0, 0], [0, 1], [1, 0], [1, 1]]
y = [0, 1, 1, 0]
model = MLP(n_inputs=2, layer_sizes=[4, 4, 1])
learning_rate = 0.1
for step in range(1000):
# Forward pass
predictions = [model(x) for x in X]
# MSE loss
loss = sum((pred - target)**2 for pred, target in zip(predictions, y))
loss = loss * (1 / len(y))
# Backward pass
model.zero_grad()
loss.backward()
# Gradient descent
for param in model.parameters():
param.data -= learning_rate * param.grad
if step % 100 == 0:
print(f"Step {step}: loss={loss.data:.6f}")
# Test
for x, target in zip(X, y):
pred = model(x)
print(f"Input: {x}, Target: {target}, Predicted: {pred.data:.4f}")
# Input: [0, 0], Target: 0, Predicted: 0.0123
# Input: [0, 1], Target: 1, Predicted: 0.9876
# Input: [1, 0], Target: 1, Predicted: 0.9854
# Input: [1, 1], Target: 0, Predicted: 0.0234 ✓
How PyTorch Does This at Scale
Our engine computes on scalars. PyTorch operates on tensors (n-dimensional arrays) using the same principles:
- Every tensor operation creates a node in the computation graph
- Each node stores a
grad_fn(equivalent to our_backward) .backward()performs topological sort + applies chain rule at each node- Gradients flow from the scalar loss back to all leaf tensors with
requires_grad=True
The math is identical. PyTorch's implementation:
- Operates on batches of data simultaneously (vectorized)
- Runs the computation graph on GPU
- Handles edge cases (in-place operations, non-differentiable ops)
- Uses efficient C++ and CUDA kernels for performance
Common Backpropagation Bugs
Vanishing gradients: Gradients shrink as they propagate through many layers.
# Problem: sigmoid at every layer
# Gradient of sigmoid is max 0.25 — multiply 10 layers: 0.25^10 ≈ 0.0000001
# Fix: ReLU (gradient = 1 when active), residual connections (gradient highway)
Exploding gradients: Gradients grow exponentially.
# Fix: gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
Gradients not flowing to some parameters: A layer is disconnected from the loss.
# Debug: check which parameters have None gradients after backward
for name, param in model.named_parameters():
if param.grad is None and param.requires_grad:
print(f"No gradient: {name}")
Now that you understand how gradients work, see how they're used at scale in our guide to distributed training.