Machine Learning System Design

Master the art of designing and implementing robust machine learning systems. This comprehensive guide covers everything from problem framing to production deployment.

Featuring real-world case studies, practical MLOps techniques, and industry best practices to help you build scalable and efficient ML systems.

Explore our table of contents below for a preview of the topics covered in this extensive guide.

Ready to dive deeper? Check out premium options for full access.

Full PDF (coming soon) Interested in a live high-end course? Let me know!

Table of Contents

3. Key metrics for different problems:

  1. Offline metrics: (tech oriented)
    • Classification metrics:
      • Precision
      • Recall
      • F1 Score
      • Accuracy
      • ROC-AUC
      • PR-AUC
      • Confusion Matrix
    • Regression metrics:
      • MSE
      • MAE
      • RMSE
    • Ranking metrics
      • Precision@k
      • Recall@k
      • MRR
      • mAP
      • nDCG
    • Natural language metrics
      • BLEU
      • METEOR
      • ROUGE
  2. Online metrics: (business oriented)
    • Click-through Rate (CTR)
    • Revenue lift
    • Impression increase
    • Prevalence
    • Valid appeals
    • Total watch time
    • MAU
    • Number of completed videos

4. Data collection and preparation

  1. Data collection and labelling
  2. Feature engineering
  3. Data processing
  4. Feature stores:
    • Standard features
    • Add embeddings to the mix!

5. Model development

  1. The role of baseline models
  2. Choosing the right model for the right case
  3. Testing models on specific slices of data
  4. Measuring the impact of potential improvements

6. Model serving

  1. Online model execution
  2. Testing and evaluation
  3. Model rollback strategies
  4. Monitoring model behavior in production: inputs, operational metrics, predictions, user feedback
  5. Data drift vs concept drif and adversarial validation
  6. Keep your models working in the face of distribution shifts: retraining models, stateless vs stateful training and more

7. Ad systems

  1. Ads targeting on YouTube
  2. Ad click prediction
  3. Different Ad systems:
    • Two stage prediction framework
    • Ad system for a feed based product
    • Ad system for a search based product

8. Anti-abuse systems

  1. Harmful content removal
  2. Detect abusive accounts
  3. Detect hijackers
  4. Email spam classifier
  5. Yelp anti abuse system

9. Recommendation systems

  1. Netflix recommendation system
  2. YouTube recommendation system
  3. Spotify songs recommendation system
  4. Recommendation system to suggest replacement items
  5. Twitter followers recommendation system
  6. YouTube watch next recommendation system
  7. Pinterest recommendation system based on graphs
  8. Real time personalization using embeddings for Search ranking at AirBnB
  9. Behavior Sequence Transformer for E-commerce Recommendation in Alibaba
  10. Deep Reinforcement Learning for Online Advertising in Recommender Systems (TikTok)

10. Ranking systems

  • News feed ranking
  • Search ranking
  • Rental search ranking
  • System to suggest trending hashtags on Twitter
  • Ranking answers on Quora
  • Search ranking on Airbnb

11. NLP based systems

  1. Extract positive and negative reviews from a platform
  2. Build a similar question recommendation system for StackOverflow
  3. Measure difficulty level of stories for language learning on Duolingo
  4. Edit stories to adjust difficulty level for language learners
  5. Implement Uber's queryGPT system
  6. Generate related searches for Google search queries
  7. Design a chatbot for hotel bookings

12. Vision systems

  1. In-video search by Netflix
  2. Design ML system to extract opening hours information from single store front image
  3. Images search
  4. Self driving car, image segmentation

13. Infrastructure ML systems

  • Throughput optimized model inference system
  • Latency sensitive model serving system with horizontal auto scaling
  • Realtime approximate nearest neighbours systems (eg. Annoy, Faiss)
  • Text indexing systems (Lucene, Elasticsearch)
  • Database and distributed data systems like Spark
  • Serving systems (TFX)
  • Model tracking and Management systems (Kubeflow, MLFlow)
  • Distributed training platform

XIV. Bonus systems

  1. ML to improve streaming quality
  2. Design a system to match pool riders for Lyft and Uber
  3. Design a system that estimates the month and day of people's birthdays.