Active Learning in Machine Learning: Efficient Data Labeling

Introduction

Labeling data is expensive. Active learning helps you get more model performance per labeled example by selecting the most informative samples to label.

The Core Idea

Instead of random sampling:

Random: Label 1000 random samples -> Train -> 85% accuracy
Active: Label 300 selected samples -> Train -> 85% accuracy

Selection Strategies

Uncertainty Sampling

Label samples where the model is most uncertain:

def uncertainty_sampling(model, unlabeled_pool, n_samples):
    predictions = model.predict_proba(unlabeled_pool)
    # Entropy as uncertainty measure
    entropy = -np.sum(predictions * np.log(predictions + 1e-10), axis=1)
    return np.argsort(entropy)[-n_samples:]

Diversity Sampling

Ensure selected samples cover the feature space:

def diversity_sampling(embeddings, n_samples):
    # K-means++ style selection
    selected = []
    for _ in range(n_samples):
        if not selected:
            selected.append(np.random.randint(len(embeddings)))
        else:
            distances = cdist(embeddings, embeddings[selected]).min(axis=1)
            selected.append(np.argmax(distances))
    return selected

Combined Approach

Balance uncertainty and diversity:

def balanced_sampling(model, unlabeled, embeddings, n_samples):
    uncertainty_scores = get_uncertainty(model, unlabeled)
    diversity_scores = get_diversity(embeddings)
    combined = 0.5 * normalize(uncertainty_scores) + 0.5 * normalize(diversity_scores)
    return np.argsort(combined)[-n_samples:]

Active Learning Loop

Initialize model with small labeled set
Repeat:
    1. Train model on labeled data
    2. Score unlabeled pool
    3. Select top-k samples
    4. Get labels from annotators
    5. Add to labeled set
Until: Budget exhausted or performance goal met

Practical Considerations

Cold Start

Need initial labels to start:

Random sample to bootstrap
Use pretrained model for initial scoring
Expert-selected seed set

Batch Selection

Real-world constraints:

Can't label one at a time
Batch queries to annotators
Consider intra-batch diversity

Stopping Criteria

When to stop labeling:

Budget exhausted
Performance plateau
Diminishing returns

Production Implementation

Infrastructure

Unlabeled Data -> Selection Service -> Annotation Queue -> Labeled Data
                        |                    |                  |
                   (model inference)    (human annotators)   (retrain)

Monitoring

Track:

Selection distribution
Annotation quality
Model improvement per batch
Cost per accuracy point

Results

Typical improvements:

3-10x reduction in labeling cost
Faster time to target accuracy
Better coverage of edge cases

Apply active learning in your projects with our ML courses.