Monitoring, Debugging, and Closing the Loop

How to monitor production systems, detect issues, and continuously improve.

Metrics to Monitor in Production

Revenue Metrics

RPM (Revenue Per Mille): Overall revenue per 1000 impressions
Revenue per query: Average revenue per user request
Fill rate: Percentage of requests that result in served ads
eCPM: Effective cost per mille (what advertisers pay)

User Experience Metrics

CTR: Click-through rate (engagement indicator)
Ad load: Number of ads per page
User satisfaction: Surveys, negative feedback rates
Page load time: Impact of ads on page performance

Advertiser Metrics

ROAS: Return on ad spend for advertisers
Conversion rates: Clicks to conversions
Budget delivery: How smoothly budgets are spent
Campaign performance: Overall advertiser satisfaction

System Health Metrics

Latency: P50, P95, P99 response times
Error rates: Failed requests, timeouts
Throughput: Requests per second
Resource utilization: CPU, memory, network

Detecting Model Degradation and Drift

Model Degradation

Performance decline over time:

Accuracy: Predictions become less accurate
Calibration: Probabilities drift from actual rates
Revenue impact: System generates less revenue

Drift Detection

Data Drift

Feature distributions: User behavior changes
Ad inventory: New ads, new advertisers
Market conditions: Economic changes affect behavior

Concept Drift

CTR patterns: User clicking behavior changes
Conversion patterns: What drives conversions shifts
Quality signals: Relevance standards evolve

Detection Methods

Statistical tests: Compare current vs. historical distributions
Model performance: Track accuracy on holdout data
A/B testing: Compare new models to current
Anomaly detection: Identify unusual patterns

Diagnosing Revenue Drops: Model, Market, or Bug?

Model Issues

Stale models: Not retrained with recent data
Overfitting: Model doesn't generalize
Feature bugs: Incorrect feature computation
Calibration drift: Predictions no longer calibrated

Market Changes

Advertiser behavior: Bids change, budgets shift
User behavior: Clicking patterns change
Competition: New platforms, market saturation
Seasonality: Expected patterns (holidays, events)

Bugs

Code bugs: Logic errors in serving pipeline
Data bugs: Incorrect data in features or logs
Infrastructure bugs: System failures, network issues
Configuration bugs: Wrong settings, thresholds

Diagnosis Process

Check system health: Is infrastructure working?
Review recent changes: What was deployed recently?
Analyze metrics: Which metrics changed and when?
Compare segments: Is issue global or specific?
Trace examples: Follow specific requests through system

Tracing a Bad Ad Through the System

The Problem

An ad that shouldn't have been shown (low quality, wrong targeting, etc.) was served. Why?

Tracing Steps

Retrieval: Was ad in candidate set? Why?
Filtering: Did it pass all filters? Should it have?
Prediction: What were model predictions? Were they correct?
Ranking: What was the score? Why did it rank high?
Auction: Did it win fairly? Was price correct?
Serving: Was correct ad served? Any last-minute changes?

Tools Needed

Request IDs: Track single request through entire pipeline
Distributed tracing: See all service calls for a request
Feature logs: See exact features used in predictions
Decision logs: See all filtering and ranking decisions

Case Studies: Real Production Incidents

Case Study 1: Model Calibration Drift

Symptom: Revenue dropped 5% over 2 weeks Investigation: Found CTR predictions were overconfident Root cause: Model not retrained, user behavior shifted Fix: Retrained model with recent data, improved calibration Prevention: Automated retraining pipeline, calibration monitoring

Case Study 2: Feature Bug

Symptom: Certain user segments had unusually low CTR Investigation: Traced to user feature computation Root cause: Bug in feature engineering pipeline Fix: Corrected feature computation, backfilled historical data Prevention: Feature validation tests, monitoring feature distributions

Case Study 3: Auction Mechanism Issue

Symptom: Fill rate dropped, many auctions had no winners Investigation: Found reserve prices too high Root cause: Recent change to reserve price algorithm Fix: Rolled back change, fixed algorithm Prevention: Gradual rollouts, A/B testing for revenue changes

These case studies illustrate the importance of comprehensive monitoring and debugging capabilities.

Content to be expanded...