Introduction
Labeling data is expensive. Active learning helps you get more model performance per labeled example by selecting the most informative samples to label.
The Core Idea
Instead of random sampling:
Random: Label 1000 random samples -> Train -> 85% accuracy
Active: Label 300 selected samples -> Train -> 85% accuracy
Selection Strategies
Uncertainty Sampling
Label samples where the model is most uncertain:
def uncertainty_sampling(model, unlabeled_pool, n_samples):
predictions = model.predict_proba(unlabeled_pool)
# Entropy as uncertainty measure
entropy = -np.sum(predictions * np.log(predictions + 1e-10), axis=1)
return np.argsort(entropy)[-n_samples:]
Diversity Sampling
Ensure selected samples cover the feature space:
def diversity_sampling(embeddings, n_samples):
# K-means++ style selection
selected = []
for _ in range(n_samples):
if not selected:
selected.append(np.random.randint(len(embeddings)))
else:
distances = cdist(embeddings, embeddings[selected]).min(axis=1)
selected.append(np.argmax(distances))
return selected
Combined Approach
Balance uncertainty and diversity:
def balanced_sampling(model, unlabeled, embeddings, n_samples):
uncertainty_scores = get_uncertainty(model, unlabeled)
diversity_scores = get_diversity(embeddings)
combined = 0.5 * normalize(uncertainty_scores) + 0.5 * normalize(diversity_scores)
return np.argsort(combined)[-n_samples:]
Active Learning Loop
Initialize model with small labeled set
Repeat:
1. Train model on labeled data
2. Score unlabeled pool
3. Select top-k samples
4. Get labels from annotators
5. Add to labeled set
Until: Budget exhausted or performance goal met
Practical Considerations
Cold Start
Need initial labels to start:
- Random sample to bootstrap
- Use pretrained model for initial scoring
- Expert-selected seed set
Batch Selection
Real-world constraints:
- Can't label one at a time
- Batch queries to annotators
- Consider intra-batch diversity
Stopping Criteria
When to stop labeling:
- Budget exhausted
- Performance plateau
- Diminishing returns
Production Implementation
Infrastructure
Unlabeled Data -> Selection Service -> Annotation Queue -> Labeled Data
| | |
(model inference) (human annotators) (retrain)
Monitoring
Track:
- Selection distribution
- Annotation quality
- Model improvement per batch
- Cost per accuracy point
Results
Typical improvements:
- 3-10x reduction in labeling cost
- Faster time to target accuracy
- Better coverage of edge cases
Apply active learning in your projects with our ML courses.