Long Sequence Modeling for Recommendation Systems

Introduction

User behavior sequences contain rich information for recommendations, but modeling long sequences efficiently remains challenging. This deep dive explores techniques for scaling sequence models to thousands of interactions.

The Value of Long Sequences

Why Longer is Better

Comprehensive user understanding
Capture evolving interests
Identify long-term patterns

Real-World Evidence

Studies show:

10x more history = 15-20% better predictions
Long-term interests often differ from short-term
Seasonal patterns require months of data

Challenges

Computational Complexity

Standard self-attention: O(n²) complexity

For 10,000 item sequence:

Memory: 400MB just for attention matrix
Compute: Billions of operations

Information Density

Not all history is equally relevant
Recent items often most predictive
Need to separate signal from noise

Solutions

Efficient Attention Mechanisms

Linear Attention

# Standard: O(n²)
attention = softmax(Q @ K.T) @ V

# Linear: O(n)
attention = (phi(Q) @ phi(K).T) @ V

Sparse Attention

Local attention windows
Global tokens for long-range
Block-sparse patterns

Hierarchical Modeling

Raw Items -> Session Summary -> User Summary -> Prediction
    |              |                 |
  (items)     (sessions)        (long-term)

Memory-Augmented Models

External memory banks
Retrievable user states
Compressed representations

Industry Implementations

Alibaba SIM

Search-based Interest Model
Two-stage: search then attend
Handles millions of behaviors

Meta HSTU

Hierarchical Sequential Transduction Units
Progressive summarization
Production deployment

Best Practices

Start with strong baselines: Simple models often competitive
Profile memory usage: Long sequences can OOM
Consider inference cost: Training is one-time, inference is forever
A/B test carefully: Offline gains may not transfer

Master sequence modeling in our Recommendation Systems at Scale course.