Machine learning at scale 🤖

← Back to book index

Filtering — What Gets Cut and When

Understanding the filtering hierarchy and why order matters for system efficiency.

The Filtering Hierarchy: Why Order Matters

Filtering should happen in order of:

Cheapest filters first: Eliminate candidates early to save compute
Highest rejection rate: Apply filters that remove the most candidates first
Deterministic before ML: Use rules before expensive model inference

The wrong order can waste significant compute on ads that will eventually be filtered out.

Hard Constraints: Targeting, Eligibility, Policy Compliance

Targeting Constraints

Geographic restrictions
Demographic targeting
Device type requirements
Time-based restrictions

These are typically checked first using inverted indexes.

Eligibility Checks

Advertiser account status (active, suspended, etc.)
Campaign status (running, paused, exhausted)
Ad creative approval status
Budget availability

Policy Compliance

Content policies (prohibited content, brand safety)
Ad format requirements
Legal restrictions (age-gated products, etc.)

These filters are deterministic and fast, making them ideal for early-stage filtering.

Brand Safety Filtering: Advertiser and Publisher Controls

Advertiser Controls

Advertisers can specify:

Block lists: Categories or sites to avoid
Allow lists: Only show on specific sites
Content categories: Avoid certain content types

Publisher Controls

Publishers can specify:

Ad quality standards: Minimum quality scores
Content restrictions: What types of ads are acceptable
Brand safety requirements: Protect their brand reputation

Implementation

Pre-computed lists: Fast lookup tables
Content classification: ML models for content categorization
Real-time checks: Verify against current policies

Why Most Filtering Belongs Early (and What Doesn't)

Early Filtering Benefits

Saves compute: Don't run expensive ML on filtered ads
Reduces latency: Fewer candidates to process downstream
Lowers costs: Less infrastructure needed

What Shouldn't Be Filtered Early

Quality-based filtering: Requires ML predictions
Diversity requirements: Need to see full candidate set
Exploration: New ads need evaluation before filtering

The Cost of Late-Stage Filtering: Wasted Compute and Lost Revenue

Wasted Compute

If filtering happens after ML inference:

Models run on ads that will be filtered
Feature computation wasted
Ranking computation unnecessary

Lost Revenue

Late filtering can also hurt revenue:

Budget exhaustion: Ads filtered after budget check waste budget
Frequency caps: Filtering after frequency check wastes impressions
Opportunity cost: Time spent on filtered ads could be used for better candidates

Best Practices

Filter as early as possible
Use approximate checks when exact checks are expensive
Cache filtering results when possible
Monitor filtering rates at each stage

Content to be expanded...