ML System Design Interview Guide for Software Engineers

What ML System Design Interviews Test

Unlike algorithm interviews (which test problem-solving), ML system design interviews test your ability to:

Translate a vague business problem into a tractable ML problem
Design the data, training, and serving infrastructure
Identify failure modes and mitigations
Make principled trade-offs under constraints

This guide gives you a reusable framework and applies it to three common problem types.

The Framework (6 Steps)

Step 1: Clarify the Problem

Before touching ML, understand the business:

What's the exact prediction task?
What's the success metric (business) and proxy metric (ML)?
What are the latency and throughput requirements?
How much data do we have? What labels exist?
What's the cost of false positives vs false negatives?

Example: "Design a spam filter"

Task: binary classification — spam or not spam
Business metric: user satisfaction, reduction in spam complaints
ML metric: precision/recall (precision matters more — false positives anger users)
Latency: <100ms (blocks email delivery)
Labels: historical spam reports from users

Step 2: Frame as an ML Problem

Map the business task to a specific ML task:

Business Problem	ML Formulation
Recommend content	Learn user-item affinity, retrieve top-K
Detect fraud	Binary classification on transactions
Rank search results	Learn-to-rank (pairwise or listwise)
Predict churn	Binary classification on user features
Estimate delivery time	Regression on order + context features
Extract entities from text	Sequence labeling (NER)

Step 3: Data

What data exists? (events, logs, explicit feedback, third-party)
How is it labeled? (explicit ratings, implicit clicks, manual annotation)
What's the data volume and velocity?
What are the data quality issues?

Step 4: Features

What raw signals are available?
How do you represent entities (users, items, text)?
What are the most important features based on domain knowledge?
How do you handle real-time vs. precomputed features?

Step 5: Model

What's the training objective?
What model architecture fits the data/constraints?
How do you train at scale?
How do you handle cold start?

Step 6: Serving, Evaluation, Monitoring

What's the serving architecture? (batch vs. online, latency budget)
How do you A/B test?
What metrics do you monitor?
When do you retrain?

Applied Example 1: News Feed Ranking

Problem: Design a feed ranking system for a social network.

Clarify:

Rank N candidate posts for each user at feed load time
Success metric: engagement (likes, comments, shares), time spent
Latency: <200ms for feed generation
Data: user graph, post metadata, historical engagement

Frame:

For each (user, post) pair, predict engagement probability
Rank posts by predicted probability

Data:

Training examples:
- Positive: (user, post) pairs where user engaged
- Negative: (user, post) pairs where post was shown but not engaged

Features:
User: age of account, activity level, interests, friend count
Post: author affinity to user, recency, past engagement rate, content type
Context: time of day, device, session length
Interaction: has user interacted with this author before?

Model:

Two-stage: Candidate Retrieval → Ranking

Stage 1 - Retrieval (must be fast, ~milliseconds):
- Collaborative filtering: posts liked by similar users
- Content-based: posts matching user interests
- Social: posts from friends/follows
- Output: ~500 candidates

Stage 2 - Ranking (can be slower, has all candidates):
- Gradient boosted trees or neural net
- ~hundreds of features
- Multi-task: predict likes, comments, shares, hide separately
- Output: ranked list of 20-50 posts

Serving:

User request
    │
    ▼
[Candidate Retrieval]  ← Pulls from multiple sources in parallel
    │ ~500 candidates
    ▼
[Ranking Model]        ← Scores each candidate
    │ Ranked list
    ▼
[Post-processing]      ← Diversity, freshness, hard rules
    │
    ▼
Feed response

Applied Example 2: Fraud Detection

Problem: Detect fraudulent credit card transactions in real time.

Clarify:

Binary classification: fraud or not
Business metric: fraud loss prevented, false positive rate
Latency: <100ms (blocks transaction approval)
Class imbalance: ~0.1% of transactions are fraud

Key considerations:

Cost asymmetry: false negative (miss fraud) is expensive; false positive (block legit transaction) is a bad user experience
Set decision threshold to reflect this asymmetry
Real-time requirement means no features that require batch computation

Features:

# Transaction-level features
{
    "amount": 250.00,
    "merchant_category": "online_retail",
    "time_of_day": 2,  # 2am — unusual
    "transaction_velocity_1h": 3,  # 3 transactions in 1 hour
}

# User-level features (precomputed, stored in feature store)
{
    "avg_transaction_amount_30d": 75.00,  # 250 is unusually high
    "fraction_international_30d": 0.02,
    "days_since_account_creation": 1200,
}

# Merchant features
{
    "merchant_fraud_rate_30d": 0.002,
    "merchant_new_account_flag": False,
}

Model choice: Gradient boosted trees (XGBoost/LightGBM) work well here. They handle tabular data well, are fast at inference, and are interpretable (feature importance).

Handling imbalance:

model = XGBClassifier(
    scale_pos_weight=999,  # ratio of negative to positive samples
    eval_metric="aucpr",   # Use PR-AUC, not ROC-AUC, for imbalanced data
)

Applied Example 3: Query-Document Search Ranking

Problem: Rank search results for a document search engine.

Frame: learn-to-rank — given a query and a set of candidate documents, rank them by relevance.

Training data: click logs. If a user queries "machine learning tutorial" and clicks result 3 but not 1 or 2, that's a weak signal that result 3 is more relevant.

Features:

Query-Document features:
- BM25 score (keyword match)
- Semantic similarity (cosine of query/doc embeddings)
- Query term coverage
- Document freshness

Document features:
- PageRank / authority score
- Click-through rate historically
- Document length

Query features:
- Query length
- Is it a navigational query? (single entity)
- Query frequency (popular vs. rare)

Model: LambdaMART (gradient boosting for ranking). Directly optimizes ranking metrics like NDCG.

import lightgbm as lgb

train_data = lgb.Dataset(
    X_train, label=y_train,
    group=query_group_sizes,  # Number of docs per query
)

model = lgb.train(
    params={"objective": "lambdarank", "metric": "ndcg", "ndcg_eval_at": [5, 10]},
    train_set=train_data,
    num_boost_round=200,
)

Common Interview Mistakes

Jumping to model before understanding data: Interviewers notice. Always ask about data first.

Ignoring the serving architecture: A model that takes 5 seconds to score is useless for real-time systems. Always think about the latency budget.

Forgetting evaluation: How do you know the model is working? Define metrics, baselines, and A/B testing strategy.

Over-complicating the first iteration: Start with the simplest reasonable approach. Add complexity only when you can justify it.

Not discussing failure modes: What happens when the model is wrong? How do you catch it? This shows production awareness.

Build the system design foundations you need with our deep dives on recommendation systems and LLM inference.