tutorial 2025-03-25 10 min read

Scaling Laws for Practitioners: What They Mean for Your ML Work

Understand what neural network scaling laws actually mean for day-to-day ML engineering decisions. Learn how to apply Chinchilla and inference-scaling insights without needing a research background.

scaling laws LLM training compute model size

Why Scaling Laws Matter Beyond Research Labs

Scaling laws sound like an academic topic. They're not. They have direct implications for decisions you make in production ML:

  • Should I train a larger model with less data, or a smaller model with more?
  • Is my model undertrained or oversized for my compute budget?
  • When does throwing more compute at a problem actually help?
  • Which model should I use as my base for fine-tuning?

Understanding scaling laws gives you intuition for answering these questions.

The Core Finding

Performance on language tasks follows a power law with respect to three variables:

  1. Model parameters (N): more parameters → lower loss
  2. Training tokens (D): more data → lower loss
  3. Compute (C ≈ 6ND): more compute → lower loss

Crucially: these variables trade off against each other. Given a fixed compute budget, there's an optimal split between model size and training data.

Chinchilla: The Compute-Optimal Training Point

The 2022 DeepMind paper "Training Compute-Optimal Large Language Models" (the "Chinchilla paper") established a simple rule:

For compute-optimal training, scale model parameters and training tokens equally.

Chinchilla optimal: ~20 training tokens per parameter.

So for a 7B parameter model:

  • Chinchilla-optimal training: 7B × 20 = 140B tokens
  • GPT-3 (175B params) was trained on 300B tokens — undertrained by Chinchilla standards
def chinchilla_optimal_tokens(model_params: int) -> int:
    """Estimate compute-optimal training tokens for a given model size."""
    return 20 * model_params

def chinchilla_optimal_model_size(training_tokens: int) -> int:
    """Estimate optimal model size for a given data budget."""
    return training_tokens // 20

# Examples
print(chinchilla_optimal_tokens(7_000_000_000))    # 140B tokens
print(chinchilla_optimal_tokens(70_000_000_000))   # 1.4T tokens
print(chinchilla_optimal_model_size(1_000_000_000_000))  # 50B params

The Inference-Optimal Shift

Chinchilla optimizes for training compute. But in production, you pay for inference continuously while training is one-time.

LLaMA's insight: train a smaller model on far more tokens than Chinchilla suggests. The model is slightly worse during training than a Chinchilla-optimal model of the same compute, but it's smaller and therefore faster and cheaper to serve.

Training-optimal:  70B params, 1.4T tokens (Chinchilla)
Inference-optimal: 7B params, 1T+ tokens (LLaMA style)

Both use similar total compute.
The 7B model is 10x cheaper to serve and nearly as good.

This is why LLaMA-3-8B can outperform GPT-3 (175B) on many benchmarks despite having 20x fewer parameters — it's been trained on far more data.

What This Means for Your Work

Choosing a Base Model for Fine-tuning

Prefer heavily trained smaller models over undertrained larger models:

If you're fine-tuning for a specific task:

Good: LLaMA-3-8B (trained on 15T tokens) — well-initialized weights, efficient to serve
Risky: A 30B model trained on 200B tokens — large and still undertrained

Rule: tokens/params ratio matters. Higher is usually better up to a point.

Compute Budget Allocation

If you have a fixed compute budget for a new ML project:

def compute_budget_split(total_flops: float):
    """
    Given total compute budget, return compute-optimal
    model size and training tokens.

    Based on Chinchilla scaling laws:
    C = 6 * N * D
    Optimal: N = (C / 120)^0.5, D = 20 * N
    """
    import math
    N = math.sqrt(total_flops / 120)
    D = 20 * N
    return {"model_params": N, "training_tokens": D}

# Example: 1e23 FLOPs (roughly GPT-3 scale)
split = compute_budget_split(1e23)
print(f"Optimal model: {split['model_params']/1e9:.1f}B params")
print(f"Optimal data: {split['training_tokens']/1e12:.1f}T tokens")

When Scaling Doesn't Help

Scaling laws break down at:

  • Data quality issues: More bad data doesn't help, it hurts
  • Distribution mismatch: Scaling doesn't fix training-serving distribution gap
  • Task-specific needs: A 100B model trained on web text may underperform a 1B model fine-tuned on your domain
  • Emergent capabilities: Some capabilities appear only above certain scale thresholds — unpredictable

Reading Loss Curves Through a Scaling Lens

                    Scaling Curve
Val Loss │
         │\
         │ \
         │  \ ← Steep region: data/compute underspend
         │   \
         │    \___
         │        \___
         │             \_____ ← Flatter region: diminishing returns
         └──────────────────── Compute (log scale)

If your validation loss is still in the steep region of this curve, adding more data or compute will help a lot. If you're in the flat region, you may need architectural changes or better data.

Test-Time Compute Scaling

A newer dimension not in Chinchilla: inference-time compute. The insight from o1, DeepSeek-R1, and similar models: letting the model "think longer" at inference time can match or beat training more parameters.

Traditional scaling: more parameters → better
Test-time scaling:   more inference steps (chain-of-thought, search) → better

This is why reasoning models (o1, Gemini Thinking) often outperform larger base models on hard problems — they spend compute at inference time rather than encoding more knowledge in weights.

For practitioners: if you're working on a task where correctness matters more than latency (code generation, math, planning), test-time compute (sampling multiple solutions, verifying, selecting best) is often more cost-effective than a larger model.


See how scaling insights apply to production LLM deployment in our LLM inference guide.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.