design pattern 2024-07-05 12 min read

Evaluating Ranking Models: Offline and Online Metrics

Complete guide to evaluating ranking models including offline metrics, online experiments, and bridging the gap.

ranking evaluation metrics A/B testing offline evaluation

Introduction

Evaluating ranking models is tricky - offline metrics don't always correlate with online success. This guide covers both approaches and how to bridge the gap.

Offline Metrics

Precision and Recall @ K

def precision_at_k(relevant, retrieved, k):
    retrieved_k = retrieved[:k]
    relevant_retrieved = len(set(retrieved_k) & set(relevant))
    return relevant_retrieved / k

def recall_at_k(relevant, retrieved, k):
    retrieved_k = retrieved[:k]
    relevant_retrieved = len(set(retrieved_k) & set(relevant))
    return relevant_retrieved / len(relevant)

NDCG (Normalized Discounted Cumulative Gain)

def ndcg_at_k(relevance_scores, k):
    dcg = sum(
        (2**rel - 1) / np.log2(i + 2)
        for i, rel in enumerate(relevance_scores[:k])
    )
    ideal = sorted(relevance_scores, reverse=True)
    idcg = sum(
        (2**rel - 1) / np.log2(i + 2)
        for i, rel in enumerate(ideal[:k])
    )
    return dcg / idcg if idcg > 0 else 0

MRR (Mean Reciprocal Rank)

def mrr(queries_results):
    rr_sum = 0
    for results in queries_results:
        for i, (_, relevant) in enumerate(results):
            if relevant:
                rr_sum += 1 / (i + 1)
                break
    return rr_sum / len(queries_results)

Online Metrics

Engagement Metrics

  • Click-through rate (CTR): Clicks / Impressions
  • Session duration: Time spent on platform
  • Items consumed: Videos watched, articles read

Business Metrics

  • Conversion rate: Purchases / Sessions
  • Revenue per user: Total revenue / Users
  • Retention: Users returning

Quality Metrics

  • Diversity: Unique categories in results
  • Novelty: New items surfaced
  • Fairness: Equal exposure across groups

The Offline-Online Gap

Why Metrics Don't Align

  1. Position bias: Users click top results more
  2. Selection bias: Only see what was shown
  3. Delayed feedback: Long-term effects missed
  4. Context missing: Offline data lacks context

Bridging Strategies

Counterfactual Evaluation

def ips_estimator(new_policy, logged_data):
    total = 0
    for (context, action, reward, propensity) in logged_data:
        new_prob = new_policy(context, action)
        total += (new_prob / propensity) * reward
    return total / len(logged_data)

Interleaving

Mix results from two rankers:

  • Team Draft interleaving
  • Balanced interleaving
  • Detect preference quickly

A/B Testing Best Practices

Experiment Design

  • Clear hypothesis
  • Primary and guardrail metrics
  • Sufficient sample size
  • Appropriate duration

Common Pitfalls

  1. Peeking: Looking at results too early
  2. Multiple comparisons: Test many variants
  3. Network effects: Users influence each other
  4. Novelty effects: New != better long-term

Evaluation Framework

Idea -> Offline Eval -> Interleaving -> A/B Test -> Ship
   |         |              |              |          |
(fast)    (medium)      (quick sig)    (full sig)  (monitor)

Master evaluation in our Recommendation Systems at Scale course.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.