Introduction
User behavior sequences contain rich information for recommendations, but modeling long sequences efficiently remains challenging. This deep dive explores techniques for scaling sequence models to thousands of interactions.
The Value of Long Sequences
Why Longer is Better
- Comprehensive user understanding
- Capture evolving interests
- Identify long-term patterns
Real-World Evidence
Studies show:
- 10x more history = 15-20% better predictions
- Long-term interests often differ from short-term
- Seasonal patterns require months of data
Challenges
Computational Complexity
Standard self-attention: O(n²) complexity
For 10,000 item sequence:
- Memory: 400MB just for attention matrix
- Compute: Billions of operations
Information Density
- Not all history is equally relevant
- Recent items often most predictive
- Need to separate signal from noise
Solutions
Efficient Attention Mechanisms
Linear Attention
# Standard: O(n²)
attention = softmax(Q @ K.T) @ V
# Linear: O(n)
attention = (phi(Q) @ phi(K).T) @ V
Sparse Attention
- Local attention windows
- Global tokens for long-range
- Block-sparse patterns
Hierarchical Modeling
Raw Items -> Session Summary -> User Summary -> Prediction
| | |
(items) (sessions) (long-term)
Memory-Augmented Models
- External memory banks
- Retrievable user states
- Compressed representations
Industry Implementations
Alibaba SIM
- Search-based Interest Model
- Two-stage: search then attend
- Handles millions of behaviors
Meta HSTU
- Hierarchical Sequential Transduction Units
- Progressive summarization
- Production deployment
Best Practices
- Start with strong baselines: Simple models often competitive
- Profile memory usage: Long sequences can OOM
- Consider inference cost: Training is one-time, inference is forever
- A/B test carefully: Offline gains may not transfer
Master sequence modeling in our Recommendation Systems at Scale course.