Backpropagation Explained for Programmers

The Chain Rule is All You Need

Backpropagation sounds scary. It's just the chain rule from calculus applied systematically to a computation graph. If you understand function composition, you can understand backpropagation.

The chain rule: if y = f(g(x)), then dy/dx = (dy/dg) * (dg/dx).

That's it. Backprop chains this rule through every operation in your neural network.

Build a Miniature Autograd Engine

The best way to understand backpropagation is to implement it. Let's build a tiny version of PyTorch's autograd.

import math

class Value:
    """A scalar value with automatic gradient computation."""

    def __init__(self, data: float, _children=(), _op=""):
        self.data = data
        self.grad = 0.0  # Gradient of loss w.r.t. this value
        self._backward = lambda: None  # Function to compute gradients
        self._prev = set(_children)
        self._op = _op

    def __repr__(self):
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"

Defining Operations with Backward Passes

Each operation needs a forward computation AND a backward computation (how to compute input gradients given output gradient):

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        result = Value(self.data + other.data, (self, other), "+")

        def _backward():
            # d(a+b)/da = 1, d(a+b)/db = 1
            # Chain rule: grad_a += result.grad * 1
            self.grad += result.grad
            other.grad += result.grad

        result._backward = _backward
        return result

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        result = Value(self.data * other.data, (self, other), "*")

        def _backward():
            # d(a*b)/da = b, d(a*b)/db = a
            self.grad += other.data * result.grad
            self.grad = self.grad  # Keep for clarity
            other.grad += self.data * result.grad

        result._backward = _backward
        return result

    def __pow__(self, power):
        assert isinstance(power, (int, float))
        result = Value(self.data ** power, (self,), f"**{power}")

        def _backward():
            # d(x^n)/dx = n * x^(n-1)
            self.grad += power * (self.data ** (power - 1)) * result.grad

        result._backward = _backward
        return result

    def relu(self):
        result = Value(max(0, self.data), (self,), "relu")

        def _backward():
            # d(relu(x))/dx = 1 if x > 0 else 0
            self.grad += (result.data > 0) * result.grad

        result._backward = _backward
        return result

    def exp(self):
        result = Value(math.exp(self.data), (self,), "exp")

        def _backward():
            # d(e^x)/dx = e^x
            self.grad += result.data * result.grad

        result._backward = _backward
        return result

    # Convenience methods
    def __neg__(self): return self * -1
    def __sub__(self, other): return self + (-other)
    def __truediv__(self, other): return self * other**-1
    def __radd__(self, other): return self + other
    def __rmul__(self, other): return self * other
    def __rsub__(self, other): return other + (-self)

The Backward Pass: Topological Sort

To compute gradients correctly, we need to call each node's _backward() in reverse topological order (outputs before inputs):

    def backward(self):
        """Compute gradients for all nodes via reverse-mode autodiff."""

        # Build topological order
        topo = []
        visited = set()

        def build_topo(node):
            if node not in visited:
                visited.add(node)
                for child in node._prev:
                    build_topo(child)
                topo.append(node)

        build_topo(self)

        # Gradient of loss w.r.t. itself = 1
        self.grad = 1.0

        # Propagate gradients in reverse topological order
        for node in reversed(topo):
            node._backward()

Testing It

# Simple example: f(x) = x^2
x = Value(3.0)
y = x ** 2
y.backward()
print(x.grad)  # 6.0 ✓ (dy/dx = 2x at x=3)

# Chain rule: f(x) = (x + 2) * x
x = Value(3.0)
y = (x + Value(2.0)) * x
y.backward()
print(x.grad)  # 8.0 ✓ (y = x^2 + 2x, dy/dx = 2x + 2 = 8 at x=3)

# Verify against PyTorch
import torch
x_torch = torch.tensor(3.0, requires_grad=True)
y_torch = (x_torch + 2) * x_torch
y_torch.backward()
print(x_torch.grad)  # tensor(8.) ✓

Building a Neural Network with Our Engine

import random

class Neuron:
    def __init__(self, n_inputs: int):
        self.weights = [Value(random.uniform(-1, 1)) for _ in range(n_inputs)]
        self.bias = Value(0.0)

    def __call__(self, inputs: list[Value]) -> Value:
        # Weighted sum + bias
        activation = sum((w * x for w, x in zip(self.weights, inputs)), self.bias)
        return activation.relu()

    def parameters(self) -> list[Value]:
        return self.weights + [self.bias]

class Layer:
    def __init__(self, n_inputs: int, n_outputs: int):
        self.neurons = [Neuron(n_inputs) for _ in range(n_outputs)]

    def __call__(self, x: list[Value]) -> list[Value]:
        return [neuron(x) for neuron in self.neurons]

    def parameters(self) -> list[Value]:
        return [p for neuron in self.neurons for p in neuron.parameters()]

class MLP:
    def __init__(self, n_inputs: int, layer_sizes: list[int]):
        sizes = [n_inputs] + layer_sizes
        self.layers = [Layer(sizes[i], sizes[i+1]) for i in range(len(sizes)-1)]

    def __call__(self, x: list[float]) -> Value:
        out = [Value(xi) for xi in x]
        for layer in self.layers:
            out = layer(out)
        return out[0] if len(out) == 1 else out

    def parameters(self) -> list[Value]:
        return [p for layer in self.layers for p in layer.parameters()]

    def zero_grad(self):
        for p in self.parameters():
            p.grad = 0.0

Training the Network

# XOR problem — not linearly separable
X = [[0, 0], [0, 1], [1, 0], [1, 1]]
y = [0, 1, 1, 0]

model = MLP(n_inputs=2, layer_sizes=[4, 4, 1])

learning_rate = 0.1

for step in range(1000):
    # Forward pass
    predictions = [model(x) for x in X]

    # MSE loss
    loss = sum((pred - target)**2 for pred, target in zip(predictions, y))
    loss = loss * (1 / len(y))

    # Backward pass
    model.zero_grad()
    loss.backward()

    # Gradient descent
    for param in model.parameters():
        param.data -= learning_rate * param.grad

    if step % 100 == 0:
        print(f"Step {step}: loss={loss.data:.6f}")

# Test
for x, target in zip(X, y):
    pred = model(x)
    print(f"Input: {x}, Target: {target}, Predicted: {pred.data:.4f}")

# Input: [0, 0], Target: 0, Predicted: 0.0123
# Input: [0, 1], Target: 1, Predicted: 0.9876
# Input: [1, 0], Target: 1, Predicted: 0.9854
# Input: [1, 1], Target: 0, Predicted: 0.0234  ✓

How PyTorch Does This at Scale

Our engine computes on scalars. PyTorch operates on tensors (n-dimensional arrays) using the same principles:

Every tensor operation creates a node in the computation graph
Each node stores a grad_fn (equivalent to our _backward)
.backward() performs topological sort + applies chain rule at each node
Gradients flow from the scalar loss back to all leaf tensors with requires_grad=True

The math is identical. PyTorch's implementation:

Operates on batches of data simultaneously (vectorized)
Runs the computation graph on GPU
Handles edge cases (in-place operations, non-differentiable ops)
Uses efficient C++ and CUDA kernels for performance

Common Backpropagation Bugs

Vanishing gradients: Gradients shrink as they propagate through many layers.

# Problem: sigmoid at every layer
# Gradient of sigmoid is max 0.25 — multiply 10 layers: 0.25^10 ≈ 0.0000001
# Fix: ReLU (gradient = 1 when active), residual connections (gradient highway)

Exploding gradients: Gradients grow exponentially.

# Fix: gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Gradients not flowing to some parameters: A layer is disconnected from the loss.

# Debug: check which parameters have None gradients after backward
for name, param in model.named_parameters():
    if param.grad is None and param.requires_grad:
        print(f"No gradient: {name}")

Now that you understand how gradients work, see how they're used at scale in our guide to distributed training.

The Chain Rule is All You Need

Build a Miniature Autograd Engine

Defining Operations with Backward Passes

The Backward Pass: Topological Sort

Testing It

Building a Neural Network with Our Engine

Training the Network

How PyTorch Does This at Scale

Common Backpropagation Bugs

Related Articles

Deep Neural Networks for YouTube Recommendations: A Complete Guide

Pinterest Recommendation System: Evolution Through the Years

Towards Large-Scale Generative Ranking in Machine Learning

Want to Go Deeper?