Why ML Production Failures Are Different
In software, most bugs are detectable: crashes, error logs, failing tests. ML failures are often silent. The model keeps running, keeps returning predictions, keeps logging metrics — but has quietly become worse. By the time you notice, you've served bad predictions for weeks.
This guide documents the most common production ML anti-patterns and how to avoid them.
Anti-Pattern 1: Training-Serving Skew
What it is: The preprocessing applied at serving time differs from what was applied during training. The model receives inputs that don't match the distribution it learned from.
Example:
# Training code
mean = X_train["tenure_days"].mean() # = 365.0
std = X_train["tenure_days"].std() # = 180.0
X_train["tenure_days_scaled"] = (X_train["tenure_days"] - mean) / std
# Serving code — different implementation, subtle bug
def preprocess_request(request):
# Bug: different normalization values hardcoded, or not normalized at all
return {
"tenure_days_scaled": request["tenure_days"] / 365 # WRONG
}
Fix: Serialize the preprocessing logic with the model, not separately.
# Training
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", GradientBoostingClassifier()),
])
pipeline.fit(X_train, y_train)
joblib.dump(pipeline, "model.joblib") # Scaler params included
# Serving — preprocessing is part of the artifact
pipeline = joblib.load("model.joblib")
prediction = pipeline.predict(new_data) # Scaler applied automatically
For deep learning models, save preprocessing params explicitly:
import json
preprocessing_config = {
"feature_means": X_train.mean().to_dict(),
"feature_stds": X_train.std().to_dict(),
"categorical_vocab": {col: list(X_train[col].unique()) for col in cat_cols},
}
with open("preprocessing_config.json", "w") as f:
json.dump(preprocessing_config, f)
Anti-Pattern 2: Label Leakage
What it is: Features used in training contain information about the label that won't be available at inference time.
Classic examples:
- Predicting loan default using "collection_calls_received" (only exists for defaulted loans)
- Predicting disease using a diagnosis code that's set at the same time as the label
- Predicting user churn using activity features computed after the churn date
How to catch it:
# Red flag 1: suspiciously high AUC (>0.99 for a hard problem)
# Red flag 2: feature importance dominated by one unexpected feature
# Red flag 3: model degrades catastrophically on new time period data
# Systematic check: for each feature, ask "would I have this at prediction time?"
def check_temporal_leakage(df, feature_cols, label_col, event_date_col):
for col in feature_cols:
# Check if feature values are computed after the event date
if df[col].notna().sum() > df[df[event_date_col] < df["feature_computed_at"]].shape[0]:
print(f"WARNING: {col} may have temporal leakage")
Anti-Pattern 3: Ignoring Class Imbalance
What it is: Training on imbalanced classes without accounting for it, then evaluating with accuracy.
# 99% negative, 1% positive
# A model that always predicts negative achieves 99% accuracy
# But it's completely useless
# Wrong metric for imbalanced data:
accuracy_score(y_test, predictions) # 99% — misleading
# Right metrics:
from sklearn.metrics import classification_report, roc_auc_score
print(classification_report(y_test, predictions))
# Shows precision, recall, F1 per class
print(f"ROC-AUC: {roc_auc_score(y_test, probabilities):.4f}")
print(f"PR-AUC: {average_precision_score(y_test, probabilities):.4f}")
# PR-AUC is especially informative for severe imbalance
Fixes:
# Class weights (easiest)
model = LogisticRegression(class_weight="balanced")
model = XGBClassifier(scale_pos_weight=(neg_count / pos_count))
# Threshold tuning (don't always use 0.5)
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_val, probabilities)
# Choose threshold that maximizes F1 or meets your business constraint
optimal_threshold = thresholds[np.argmax(2 * precision * recall / (precision + recall))]
Anti-Pattern 4: Evaluation on Shuffled Time-Series Data
What it is: Randomly splitting time-series data into train/test, which allows the model to "see" future data.
# WRONG for time-series data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
# Test set includes data from 6 months ago, train set includes data from today
# RIGHT: temporal split
cutoff_date = "2024-10-01"
train = df[df["date"] < cutoff_date]
test = df[df["date"] >= cutoff_date]
# Even better: include a gap to account for label delay
train = df[df["date"] < "2024-09-01"]
test = df[(df["date"] >= "2024-10-01") & (df["date"] < "2024-12-01")]
Anti-Pattern 5: No Monitoring in Production
What it is: Deploying a model and only checking if it's still running, not if it's still accurate.
Models degrade because:
- Input distribution shifts (user behavior changes, new markets)
- Label distribution shifts (concept drift — what "fraud" looks like changes)
- Data pipeline issues (upstream schema changes, nulls, wrong values)
Minimum viable monitoring:
import pandas as pd
from scipy import stats
class ModelMonitor:
def __init__(self, reference_data: pd.DataFrame):
self.reference_stats = {
col: {
"mean": reference_data[col].mean(),
"std": reference_data[col].std(),
"p25": reference_data[col].quantile(0.25),
"p75": reference_data[col].quantile(0.75),
}
for col in reference_data.select_dtypes("number").columns
}
def check_drift(self, current_data: pd.DataFrame, alpha: float = 0.05) -> dict:
results = {}
for col in self.reference_stats:
if col not in current_data.columns:
results[col] = {"status": "MISSING"}
continue
# Kolmogorov-Smirnov test: are the distributions the same?
stat, p_value = stats.ks_2samp(
self.reference_data[col].dropna(),
current_data[col].dropna()
)
results[col] = {
"drifted": p_value < alpha,
"p_value": p_value,
"current_mean": current_data[col].mean(),
"reference_mean": self.reference_stats[col]["mean"],
}
return results
# Alert if more than 20% of features have drifted
monitor = ModelMonitor(training_data)
drift_results = monitor.check_drift(this_weeks_data)
drifted_features = [k for k, v in drift_results.items() if v.get("drifted")]
if len(drifted_features) / len(drift_results) > 0.2:
send_alert(f"Significant drift detected in: {drifted_features}")
Anti-Pattern 6: Optimizing the Wrong Metric
What it is: The ML metric improves but the business metric doesn't, because the proxy metric doesn't capture what you actually care about.
Examples:
- Optimizing click-through rate → users click but don't convert (misleading signal)
- Optimizing completion rate → model recommends short, easy content (but not valuable)
- Optimizing for accuracy on balanced test set → real-world imbalanced population performs poorly
Fix: before any modeling, align on the business metric. Define success criteria that go beyond ML metrics:
# Define evaluation protocol that matches deployment
# If you're A/B testing, define exactly what you'll measure
evaluation_criteria = {
"primary_metric": "7_day_retention_lift",
"guardrail_metrics": {
"complaint_rate": {"threshold": "must_not_increase", "tolerance": 0.001},
"p99_latency_ms": {"threshold": "must_be_below", "value": 200},
},
"minimum_sample_size": 10000,
"minimum_duration_days": 7,
"success_threshold": 0.02, # Must improve retention by 2%+ to ship
}
Anti-Pattern 7: No Rollback Plan
What it is: Deploying a new model version without the ability to quickly revert.
# Bad: hard-coded model path
model = load_model("models/latest_model.pt")
# Good: versioned model registry with rollback
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Production model is always in the registry with explicit version
model = mlflow.pyfunc.load_model("models:/ChurnModel/Production")
# Rollback: one command
client.transition_model_version_stage(
name="ChurnModel",
version=current_version - 1,
stage="Production",
archive_existing_versions=True,
)
Keep the previous model version deployed in shadow mode during rollout. If metrics degrade, flip back in minutes.
Build production-grade ML systems with our guides to MLOps pipelines and ML system design.