design pattern 2024-07-10 10 min read

Active Learning in Machine Learning: Efficient Data Labeling

How to use active learning to reduce labeling costs while maintaining model quality through intelligent sample selection.

active learning labeling data efficiency annotation

Introduction

Labeling data is expensive. Active learning helps you get more model performance per labeled example by selecting the most informative samples to label.

The Core Idea

Instead of random sampling:

Random: Label 1000 random samples -> Train -> 85% accuracy
Active: Label 300 selected samples -> Train -> 85% accuracy

Selection Strategies

Uncertainty Sampling

Label samples where the model is most uncertain:

def uncertainty_sampling(model, unlabeled_pool, n_samples):
    predictions = model.predict_proba(unlabeled_pool)
    # Entropy as uncertainty measure
    entropy = -np.sum(predictions * np.log(predictions + 1e-10), axis=1)
    return np.argsort(entropy)[-n_samples:]

Diversity Sampling

Ensure selected samples cover the feature space:

def diversity_sampling(embeddings, n_samples):
    # K-means++ style selection
    selected = []
    for _ in range(n_samples):
        if not selected:
            selected.append(np.random.randint(len(embeddings)))
        else:
            distances = cdist(embeddings, embeddings[selected]).min(axis=1)
            selected.append(np.argmax(distances))
    return selected

Combined Approach

Balance uncertainty and diversity:

def balanced_sampling(model, unlabeled, embeddings, n_samples):
    uncertainty_scores = get_uncertainty(model, unlabeled)
    diversity_scores = get_diversity(embeddings)
    combined = 0.5 * normalize(uncertainty_scores) + 0.5 * normalize(diversity_scores)
    return np.argsort(combined)[-n_samples:]

Active Learning Loop

Initialize model with small labeled set
Repeat:
    1. Train model on labeled data
    2. Score unlabeled pool
    3. Select top-k samples
    4. Get labels from annotators
    5. Add to labeled set
Until: Budget exhausted or performance goal met

Practical Considerations

Cold Start

Need initial labels to start:

  • Random sample to bootstrap
  • Use pretrained model for initial scoring
  • Expert-selected seed set

Batch Selection

Real-world constraints:

  • Can't label one at a time
  • Batch queries to annotators
  • Consider intra-batch diversity

Stopping Criteria

When to stop labeling:

  • Budget exhausted
  • Performance plateau
  • Diminishing returns

Production Implementation

Infrastructure

Unlabeled Data -> Selection Service -> Annotation Queue -> Labeled Data
                        |                    |                  |
                   (model inference)    (human annotators)   (retrain)

Monitoring

Track:

  • Selection distribution
  • Annotation quality
  • Model improvement per batch
  • Cost per accuracy point

Results

Typical improvements:

  • 3-10x reduction in labeling cost
  • Faster time to target accuracy
  • Better coverage of edge cases

Apply active learning in your projects with our ML courses.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.