career 2025-03-17 11 min read

Why ML Debugging is Different (And How to Do It)

ML bugs fail silently, produce wrong outputs without errors, and are often caused by data, not code. Learn systematic ML debugging strategies for engineers coming from software development.

debugging ML engineering best practices model evaluation software engineering

The Silent Failure Problem

In software engineering, bugs usually announce themselves. A null pointer dereference crashes the program. A type error fails at compile time. An off-by-one error produces wrong output you can often directly observe.

ML bugs are quieter. Your training loop runs to completion. The loss decreases. The model produces predictions. And yet it's fundamentally broken — the model learned the wrong thing, or nothing at all.

This guide is about developing the debugging instincts for this new failure mode.

The Bug Taxonomy

ML bugs fall into roughly five categories:

1. Data bugs (most common): Wrong labels, leakage, wrong preprocessing, distribution mismatch between train and test.

2. Implementation bugs: Gradients not flowing, wrong loss function, incorrect masking, shape errors.

3. Optimization bugs: Learning rate too high/low, not enough capacity, wrong architecture.

4. Evaluation bugs: Metric computed incorrectly, test set contaminated, wrong baseline comparison.

5. Deployment bugs: Preprocessing at inference differs from training, input distribution shifted.

Debugging Strategy

1. Establish a Sanity Check Baseline

Before debugging a complex model, verify the framework works:

# Can the model overfit a tiny dataset?
# If not, something is fundamentally broken.

X_tiny = X_train[:10]
y_tiny = y_train[:10]

model.fit(X_tiny, y_tiny)
train_acc = accuracy_score(y_tiny, model.predict(X_tiny))

# Should be very close to 1.0 for most models
print(f"Tiny dataset train accuracy: {train_acc:.4f}")
assert train_acc > 0.99, "Model can't overfit 10 samples — implementation bug"

If a model can't memorize 10 training samples, something is broken at a fundamental level. This rules out "not enough data" as a hypothesis before you waste hours on it.

2. Verify Gradient Flow

# Check gradients are flowing to all parameters
model.zero_grad()
loss = compute_loss(model, batch)
loss.backward()

for name, param in model.named_parameters():
    if param.grad is None:
        print(f"WARNING: No gradient for {name}")
    elif param.grad.abs().max() < 1e-7:
        print(f"WARNING: Near-zero gradient for {name} — possible vanishing gradient")
    elif param.grad.abs().max() > 1000:
        print(f"WARNING: Exploding gradient for {name}: {param.grad.abs().max():.2f}")

3. Visualize the Loss Curve

The shape of the loss curve tells you a lot:

Loss
 │
 │\ <-- Good: loss decreasing smoothly
 │ \
 │  \____
 └──────────── Epochs

Loss
 │
 │~~ <-- Bad: loss oscillating — learning rate too high
 │~~~
 └──────────── Epochs

Loss
 │
 │────────── <-- Bad: loss not moving — gradient not flowing or lr too low
 └──────────── Epochs

Loss
 │
 │\   /\ <-- Bad: overfitting — train loss down, val loss up
 │ \ /  \
 │  V    \___
 └──────────── Epochs
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(train_losses, label="Train")
plt.plot(val_losses, label="Val")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.title("Loss Curves")

plt.subplot(1, 2, 2)
plt.plot(train_metrics, label="Train")
plt.plot(val_metrics, label="Val")
plt.ylabel("Metric")
plt.legend()
plt.title("Metric Curves")

4. Inspect Predictions

Don't just look at aggregate metrics. Look at individual predictions:

# What does the model predict?
predictions = model.predict_proba(X_val)[:, 1]
print(f"Prediction range: [{predictions.min():.3f}, {predictions.max():.3f}]")
print(f"Prediction mean: {predictions.mean():.3f}")

# If all predictions cluster around 0.5 — model learned nothing
# If all predictions are near 0 or 1 — check class imbalance

# Inspect the worst mistakes
errors = np.abs(predictions - y_val)
worst_idx = errors.argsort()[-20:]  # 20 largest errors

for idx in worst_idx:
    print(f"True: {y_val[idx]}, Pred: {predictions[idx]:.3f}")
    print(f"Features: {X_val.iloc[idx]}")
    print()

5. Check for Data Leakage

Data leakage is the sneakiest bug in ML. The model achieves suspiciously high performance because test information leaked into training.

# Red flags:
# - Train and val accuracy are both suspiciously high
# - Model dramatically outperforms published baselines
# - Performance drops sharply on truly new data

# Common leakage sources:
# 1. Fitting scalers/encoders on entire dataset before split
bad_code = """
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # BUG: uses test data statistics
X_train, X_test = train_test_split(X_scaled, y)
"""

good_code = """
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit ONLY on train
X_test_scaled = scaler.transform(X_test)         # Transform test separately
"""

# 2. Target encoding before split
# 3. Using future data to predict past events (time series)
# 4. Features that contain the label (e.g., "diagnosis_code" in a disease prediction task)

6. Validate Input/Output Shapes

class DebugModel(nn.Module):
    def forward(self, x):
        print(f"Input shape: {x.shape}")

        x = self.layer1(x)
        print(f"After layer1: {x.shape}")

        x = self.layer2(x)
        print(f"After layer2: {x.shape}")

        return x

This is the ML equivalent of console.log debugging. Ugly, but often the fastest way to find shape mismatches.

Common Bug Patterns and Fixes

Accuracy stuck at chance level (50% for binary, 33% for 3-class)

  • Check if labels are correct: y_train.value_counts()
  • Check if features are being passed correctly
  • Try predicting on training data — if train accuracy is also at chance, it's an implementation bug

Loss is NaN

# Debug: find where NaN is introduced
for name, param in model.named_parameters():
    if torch.isnan(param).any():
        print(f"NaN in {name}")

# Fixes: gradient clipping, lower learning rate, check for log(0) in loss

Model predicts only one class

# Check class imbalance
print(y_train.value_counts(normalize=True))

# Fixes: class weights, oversampling, threshold tuning
model = LogisticRegression(class_weight="balanced")

High train accuracy, low val accuracy (overfitting)

# Fixes: dropout, regularization, early stopping, more data
early_stopping = EarlyStopping(monitor="val_loss", patience=5, restore_best_weights=True)

The Mental Model Shift

Software debugging: "Find the line of code that's wrong."

ML debugging: "Find the incorrect assumption — about the data, the model, or the evaluation."

Before writing any code, always ask: what would need to be true for this model to work? Then systematically verify each assumption.


For production ML debugging, see our guide to model monitoring and drift detection.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.