Introduction
Meta operates one of the world's largest ML platforms, supporting everything from content ranking to AR/VR experiences. This case study explores their infrastructure decisions and lessons learned.
Platform Scale
Numbers
- Trillions of predictions daily
- Exabytes of training data
- Thousands of production models
Use Cases
- News Feed ranking
- Ads prediction
- Content moderation
- Recommendation systems
- AR/VR experiences
Architecture
Training Infrastructure
Data Lake -> Feature Extraction -> Training Cluster -> Model Store
| | | |
(HDFS) (Spark) (PyTorch/GPU) (versioned)
Serving Infrastructure
Request -> Feature Serving -> Prediction Serving -> Response
| | |
(edge) (real-time) (low latency)
Key Components
Data Platform
- Unified data lake for all ML teams
- Feature engineering at scale
- Privacy-preserving data access
Training Platform
- Custom hardware (RSC, etc.)
- Distributed training frameworks
- Experiment management
Serving Platform
- Heterogeneous compute (CPU, GPU, custom)
- Multi-tenant serving infrastructure
- Continuous deployment
Technical Innovations
Model Efficiency
Meta pioneered several efficiency techniques:
- Quantization-aware training
- Knowledge distillation
- Neural architecture search
Embedding Tables
# Handling trillion-parameter embeddings
class DistributedEmbedding:
def __init__(self, num_embeddings, dim, num_shards):
self.tables = [
EmbeddingTable(num_embeddings // num_shards, dim)
for _ in range(num_shards)
]
def forward(self, indices):
# Distributed lookup across shards
return distributed_gather(self.tables, indices)
Real-Time ML
- Sub-second feature updates
- Online learning for freshness
- Real-time A/B testing
Operational Excellence
Monitoring
- Model quality metrics
- Serving latency distributions
- Resource utilization
Incident Response
- Automated rollback
- Gradual deployments
- Clear ownership
Lessons Learned
- Invest heavily in infrastructure
- Standardization enables velocity
- Efficiency at scale matters enormously
- People and process are as important as technology
Build enterprise ML platforms with insights from our courses at Machine Learning at Scale.