Production ML Anti-Patterns: What Goes Wrong in Real Systems

Why ML Production Failures Are Different

In software, most bugs are detectable: crashes, error logs, failing tests. ML failures are often silent. The model keeps running, keeps returning predictions, keeps logging metrics — but has quietly become worse. By the time you notice, you've served bad predictions for weeks.

This guide documents the most common production ML anti-patterns and how to avoid them.

Anti-Pattern 1: Training-Serving Skew

What it is: The preprocessing applied at serving time differs from what was applied during training. The model receives inputs that don't match the distribution it learned from.

Example:

# Training code
mean = X_train["tenure_days"].mean()  # = 365.0
std = X_train["tenure_days"].std()    # = 180.0
X_train["tenure_days_scaled"] = (X_train["tenure_days"] - mean) / std

# Serving code — different implementation, subtle bug
def preprocess_request(request):
    # Bug: different normalization values hardcoded, or not normalized at all
    return {
        "tenure_days_scaled": request["tenure_days"] / 365  # WRONG
    }

Fix: Serialize the preprocessing logic with the model, not separately.

# Training
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", GradientBoostingClassifier()),
])
pipeline.fit(X_train, y_train)
joblib.dump(pipeline, "model.joblib")  # Scaler params included

# Serving — preprocessing is part of the artifact
pipeline = joblib.load("model.joblib")
prediction = pipeline.predict(new_data)  # Scaler applied automatically

For deep learning models, save preprocessing params explicitly:

import json

preprocessing_config = {
    "feature_means": X_train.mean().to_dict(),
    "feature_stds": X_train.std().to_dict(),
    "categorical_vocab": {col: list(X_train[col].unique()) for col in cat_cols},
}

with open("preprocessing_config.json", "w") as f:
    json.dump(preprocessing_config, f)

Anti-Pattern 2: Label Leakage

What it is: Features used in training contain information about the label that won't be available at inference time.

Classic examples:

Predicting loan default using "collection_calls_received" (only exists for defaulted loans)
Predicting disease using a diagnosis code that's set at the same time as the label
Predicting user churn using activity features computed after the churn date

How to catch it:

# Red flag 1: suspiciously high AUC (>0.99 for a hard problem)
# Red flag 2: feature importance dominated by one unexpected feature
# Red flag 3: model degrades catastrophically on new time period data

# Systematic check: for each feature, ask "would I have this at prediction time?"
def check_temporal_leakage(df, feature_cols, label_col, event_date_col):
    for col in feature_cols:
        # Check if feature values are computed after the event date
        if df[col].notna().sum() > df[df[event_date_col] < df["feature_computed_at"]].shape[0]:
            print(f"WARNING: {col} may have temporal leakage")

Anti-Pattern 3: Ignoring Class Imbalance

What it is: Training on imbalanced classes without accounting for it, then evaluating with accuracy.

# 99% negative, 1% positive
# A model that always predicts negative achieves 99% accuracy
# But it's completely useless

# Wrong metric for imbalanced data:
accuracy_score(y_test, predictions)  # 99% — misleading

# Right metrics:
from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(y_test, predictions))
# Shows precision, recall, F1 per class

print(f"ROC-AUC: {roc_auc_score(y_test, probabilities):.4f}")
print(f"PR-AUC: {average_precision_score(y_test, probabilities):.4f}")
# PR-AUC is especially informative for severe imbalance

Fixes:

# Class weights (easiest)
model = LogisticRegression(class_weight="balanced")
model = XGBClassifier(scale_pos_weight=(neg_count / pos_count))

# Threshold tuning (don't always use 0.5)
from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_val, probabilities)
# Choose threshold that maximizes F1 or meets your business constraint
optimal_threshold = thresholds[np.argmax(2 * precision * recall / (precision + recall))]

Anti-Pattern 4: Evaluation on Shuffled Time-Series Data

What it is: Randomly splitting time-series data into train/test, which allows the model to "see" future data.

# WRONG for time-series data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
# Test set includes data from 6 months ago, train set includes data from today

# RIGHT: temporal split
cutoff_date = "2024-10-01"
train = df[df["date"] < cutoff_date]
test = df[df["date"] >= cutoff_date]

# Even better: include a gap to account for label delay
train = df[df["date"] < "2024-09-01"]
test = df[(df["date"] >= "2024-10-01") & (df["date"] < "2024-12-01")]

Anti-Pattern 5: No Monitoring in Production

What it is: Deploying a model and only checking if it's still running, not if it's still accurate.

Models degrade because:

Input distribution shifts (user behavior changes, new markets)
Label distribution shifts (concept drift — what "fraud" looks like changes)
Data pipeline issues (upstream schema changes, nulls, wrong values)

Minimum viable monitoring:

import pandas as pd
from scipy import stats

class ModelMonitor:
    def __init__(self, reference_data: pd.DataFrame):
        self.reference_stats = {
            col: {
                "mean": reference_data[col].mean(),
                "std": reference_data[col].std(),
                "p25": reference_data[col].quantile(0.25),
                "p75": reference_data[col].quantile(0.75),
            }
            for col in reference_data.select_dtypes("number").columns
        }

    def check_drift(self, current_data: pd.DataFrame, alpha: float = 0.05) -> dict:
        results = {}
        for col in self.reference_stats:
            if col not in current_data.columns:
                results[col] = {"status": "MISSING"}
                continue

            # Kolmogorov-Smirnov test: are the distributions the same?
            stat, p_value = stats.ks_2samp(
                self.reference_data[col].dropna(),
                current_data[col].dropna()
            )
            results[col] = {
                "drifted": p_value < alpha,
                "p_value": p_value,
                "current_mean": current_data[col].mean(),
                "reference_mean": self.reference_stats[col]["mean"],
            }
        return results

# Alert if more than 20% of features have drifted
monitor = ModelMonitor(training_data)
drift_results = monitor.check_drift(this_weeks_data)
drifted_features = [k for k, v in drift_results.items() if v.get("drifted")]

if len(drifted_features) / len(drift_results) > 0.2:
    send_alert(f"Significant drift detected in: {drifted_features}")

Anti-Pattern 6: Optimizing the Wrong Metric

What it is: The ML metric improves but the business metric doesn't, because the proxy metric doesn't capture what you actually care about.

Examples:

Optimizing click-through rate → users click but don't convert (misleading signal)
Optimizing completion rate → model recommends short, easy content (but not valuable)
Optimizing for accuracy on balanced test set → real-world imbalanced population performs poorly

Fix: before any modeling, align on the business metric. Define success criteria that go beyond ML metrics:

# Define evaluation protocol that matches deployment
# If you're A/B testing, define exactly what you'll measure
evaluation_criteria = {
    "primary_metric": "7_day_retention_lift",
    "guardrail_metrics": {
        "complaint_rate": {"threshold": "must_not_increase", "tolerance": 0.001},
        "p99_latency_ms": {"threshold": "must_be_below", "value": 200},
    },
    "minimum_sample_size": 10000,
    "minimum_duration_days": 7,
    "success_threshold": 0.02,  # Must improve retention by 2%+ to ship
}

Anti-Pattern 7: No Rollback Plan

What it is: Deploying a new model version without the ability to quickly revert.

# Bad: hard-coded model path
model = load_model("models/latest_model.pt")

# Good: versioned model registry with rollback
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Production model is always in the registry with explicit version
model = mlflow.pyfunc.load_model("models:/ChurnModel/Production")

# Rollback: one command
client.transition_model_version_stage(
    name="ChurnModel",
    version=current_version - 1,
    stage="Production",
    archive_existing_versions=True,
)

Keep the previous model version deployed in shadow mode during rollout. If metrics degrade, flip back in minutes.

Build production-grade ML systems with our guides to MLOps pipelines and ML system design.

Why ML Production Failures Are Different

Anti-Pattern 1: Training-Serving Skew

Anti-Pattern 2: Label Leakage

Anti-Pattern 3: Ignoring Class Imbalance

Anti-Pattern 4: Evaluation on Shuffled Time-Series Data

Anti-Pattern 5: No Monitoring in Production

Anti-Pattern 6: Optimizing the Wrong Metric

Anti-Pattern 7: No Rollback Plan

Related Articles

Towards Large-Scale Generative Ranking in Machine Learning

Production ML: A Reality Check on MLOps Practices

Agent Context Engineering: Optimizing LLM Agent Performance

Want to Go Deeper?