design pattern 2025-01-25 13 min read

Speculative Decoding: How to Get 3-5x LLM Throughput Without Changing the Model

A practical guide to speculative decoding — the inference optimization technique used by Google, Meta, and others to dramatically accelerate LLM serving without any model quality loss.

speculative decoding LLM inference throughput optimization serving latency

Introduction

Serving large language models at scale is expensive. Each token requires a full forward pass through the model — for a 70B parameter model, that's billions of multiply-accumulate operations per token. Generating a 500-token response means 500 serial forward passes.

Speculative decoding breaks this bottleneck. By using a small "draft" model to propose tokens that a large "verifier" model accepts or rejects, it achieves 3-5x throughput improvements with zero quality degradation.

This post explains the algorithm, its variants, and how to implement it in production.

The Core Algorithm

The Problem with Autoregressive Decoding

Standard LLM generation is inherently serial:

token_1 = model(prompt)
token_2 = model(prompt + token_1)
token_3 = model(prompt + token_1 + token_2)
...

Each step depends on the previous output. You can't parallelize across the token dimension.

Speculative Decoding's Key Insight

The insight: what if we could check many candidate tokens at once?

LLM forward passes are memory-bandwidth bound, not compute bound. A single forward pass with a sequence of length K takes roughly the same wall-clock time as a forward pass with length 1, up to a point.

Speculative decoding exploits this:

1. Draft: small model generates K candidate tokens greedily
2. Verify: large model processes all K candidates in ONE forward pass
3. Accept/reject: accept tokens where draft distribution ≈ target distribution
4. Correction: sample the first rejected token from a corrected distribution
5. Repeat from accepted position

The expected number of tokens accepted per large-model forward pass > 1, so overall throughput increases.

The Mathematics of Acceptance

The acceptance criterion ensures the output distribution is identical to what the large model would have generated:

For position i, if draft token x_i is sampled:

Accept with probability: min(1, p_target(x_i) / p_draft(x_i))

If rejected, sample from a "corrected" distribution:

p_corrected(x) = max(0, p_target(x) - p_draft(x)) / normalization

This is a modified rejection sampling scheme that guarantees the accepted token comes from p_target.

Key property: The output distribution is provably identical to greedy/temperature sampling from the large model alone. No quality tradeoff.

Factors Affecting Speedup

1. Draft acceptance rate (α)

If the draft model accepts tokens with probability α, the expected tokens per large-model call is:

E[tokens] = (1 - α^(K+1)) / (1 - α)

With α=0.8 and K=5: ~3.3 tokens per call → ~3.3x speedup before overhead.

2. Draft model efficiency

The draft model must be:

  • Fast: Much smaller than the target model (typically 5-15x fewer parameters)
  • Aligned: Trained on similar data and objectives as the target model
  • Available: Often a smaller model from the same family (e.g., Llama-3-8B drafting for Llama-3-70B)

3. Sequence length K

Longer drafts amortize the verification cost more but increase the probability of rejection:

  • Too short: underutilizes the parallel verification
  • Too long: many tokens wasted on rejection

K=5-8 is typically optimal for code and structured outputs; K=3-4 for more open-ended generation.

Implementation Variants

Self-Speculative Decoding

No separate draft model needed: use early exit from the same model.

The large model runs a subset of its layers to produce draft logits, then runs all layers to verify. Saves the operational complexity of managing two models.

Tradeoff: The "draft" quality depends on how well early layers approximate the full model — typically lower acceptance rates than a well-matched external draft model.

Medusa

Instead of a separate draft model, Medusa adds multiple "heads" to the existing model, each predicting tokens K steps ahead:

LLM backbone → [head_0: next token]
             → [head_1: token+1]
             → [head_2: token+2]
             → ...

All heads run in parallel during verification. Training is cheap (only the heads are trained). Used in production by several inference providers.

Lookahead Decoding

Uses Jacobi iteration to generate candidate tokens without a separate draft model at all. Particularly useful when no good small model is available in the same family.

Tree-Based Speculation (SpecTree)

Instead of a single draft sequence, generate a tree of candidates:

         token_A
        /       token_B         token_C
   /                token_D           token_E

The verifier evaluates the entire tree in one pass using tree attention masks. Accepts the longest valid path. Higher acceptance probability but more complex implementation.

Production Considerations

Batching Complications

Speculative decoding is naturally designed for single-sequence generation. Batched inference complicates things:

  • Different sequences in the batch may have different acceptance rates
  • The batch is "blocked" until the shortest accepted prefix
  • Solutions: separate queues for single vs. batch requests, dynamic speculation depth

Memory Overhead

  • Draft model KV cache: adds memory pressure
  • Two sets of KV caches to manage
  • Solution: offload draft KV cache to CPU, or use a much smaller draft model

Latency vs. Throughput Tradeoff

Speculative decoding optimizes throughput (tokens/second across all requests). For latency-sensitive applications (time to first token), it doesn't help and may hurt slightly due to draft overhead.

Use speculative decoding for:

  • Batch offline inference
  • High-throughput serving with queuing
  • Code generation (high acceptance rates)

Avoid for:

  • Streaming where latency matters most
  • Highly creative tasks with diverse outputs (low acceptance rates)

Acceptance Rates by Task Type

From production data across several deployments:

Task Typical α Approx speedup
Code completion 0.85-0.92 4-6x
Text summarization 0.75-0.85 3-4x
Translation 0.80-0.88 3.5-5x
Creative writing 0.60-0.75 2-3x
Math/reasoning 0.65-0.80 2.5-3.5x

Frameworks Supporting Speculative Decoding

  • vLLM: Built-in speculative decoding with draft model support
  • TensorRT-LLM: NVIDIA's implementation optimized for GPU batching
  • HuggingFace TGI: Supports Medusa and standard speculation
  • SGLang: Fast serving with speculative decoding support

Conclusion

Speculative decoding is one of the most impactful inference optimizations available today: free quality, 3-5x throughput, and production-ready in major frameworks. The main cost is operational complexity — managing two models and tuning K for your workload.

For high-volume LLM serving, it's one of the first optimizations you should reach for.


Want to go deeper on LLM infrastructure? Explore our full LLM Inference at Scale curriculum.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.