Introduction
Serving large language models at scale is expensive. Each token requires a full forward pass through the model — for a 70B parameter model, that's billions of multiply-accumulate operations per token. Generating a 500-token response means 500 serial forward passes.
Speculative decoding breaks this bottleneck. By using a small "draft" model to propose tokens that a large "verifier" model accepts or rejects, it achieves 3-5x throughput improvements with zero quality degradation.
This post explains the algorithm, its variants, and how to implement it in production.
The Core Algorithm
The Problem with Autoregressive Decoding
Standard LLM generation is inherently serial:
token_1 = model(prompt)
token_2 = model(prompt + token_1)
token_3 = model(prompt + token_1 + token_2)
...
Each step depends on the previous output. You can't parallelize across the token dimension.
Speculative Decoding's Key Insight
The insight: what if we could check many candidate tokens at once?
LLM forward passes are memory-bandwidth bound, not compute bound. A single forward pass with a sequence of length K takes roughly the same wall-clock time as a forward pass with length 1, up to a point.
Speculative decoding exploits this:
1. Draft: small model generates K candidate tokens greedily
2. Verify: large model processes all K candidates in ONE forward pass
3. Accept/reject: accept tokens where draft distribution ≈ target distribution
4. Correction: sample the first rejected token from a corrected distribution
5. Repeat from accepted position
The expected number of tokens accepted per large-model forward pass > 1, so overall throughput increases.
The Mathematics of Acceptance
The acceptance criterion ensures the output distribution is identical to what the large model would have generated:
For position i, if draft token x_i is sampled:
Accept with probability: min(1, p_target(x_i) / p_draft(x_i))
If rejected, sample from a "corrected" distribution:
p_corrected(x) = max(0, p_target(x) - p_draft(x)) / normalization
This is a modified rejection sampling scheme that guarantees the accepted token comes from p_target.
Key property: The output distribution is provably identical to greedy/temperature sampling from the large model alone. No quality tradeoff.
Factors Affecting Speedup
1. Draft acceptance rate (α)
If the draft model accepts tokens with probability α, the expected tokens per large-model call is:
E[tokens] = (1 - α^(K+1)) / (1 - α)
With α=0.8 and K=5: ~3.3 tokens per call → ~3.3x speedup before overhead.
2. Draft model efficiency
The draft model must be:
- Fast: Much smaller than the target model (typically 5-15x fewer parameters)
- Aligned: Trained on similar data and objectives as the target model
- Available: Often a smaller model from the same family (e.g., Llama-3-8B drafting for Llama-3-70B)
3. Sequence length K
Longer drafts amortize the verification cost more but increase the probability of rejection:
- Too short: underutilizes the parallel verification
- Too long: many tokens wasted on rejection
K=5-8 is typically optimal for code and structured outputs; K=3-4 for more open-ended generation.
Implementation Variants
Self-Speculative Decoding
No separate draft model needed: use early exit from the same model.
The large model runs a subset of its layers to produce draft logits, then runs all layers to verify. Saves the operational complexity of managing two models.
Tradeoff: The "draft" quality depends on how well early layers approximate the full model — typically lower acceptance rates than a well-matched external draft model.
Medusa
Instead of a separate draft model, Medusa adds multiple "heads" to the existing model, each predicting tokens K steps ahead:
LLM backbone → [head_0: next token]
→ [head_1: token+1]
→ [head_2: token+2]
→ ...
All heads run in parallel during verification. Training is cheap (only the heads are trained). Used in production by several inference providers.
Lookahead Decoding
Uses Jacobi iteration to generate candidate tokens without a separate draft model at all. Particularly useful when no good small model is available in the same family.
Tree-Based Speculation (SpecTree)
Instead of a single draft sequence, generate a tree of candidates:
token_A
/ token_B token_C
/ token_D token_E
The verifier evaluates the entire tree in one pass using tree attention masks. Accepts the longest valid path. Higher acceptance probability but more complex implementation.
Production Considerations
Batching Complications
Speculative decoding is naturally designed for single-sequence generation. Batched inference complicates things:
- Different sequences in the batch may have different acceptance rates
- The batch is "blocked" until the shortest accepted prefix
- Solutions: separate queues for single vs. batch requests, dynamic speculation depth
Memory Overhead
- Draft model KV cache: adds memory pressure
- Two sets of KV caches to manage
- Solution: offload draft KV cache to CPU, or use a much smaller draft model
Latency vs. Throughput Tradeoff
Speculative decoding optimizes throughput (tokens/second across all requests). For latency-sensitive applications (time to first token), it doesn't help and may hurt slightly due to draft overhead.
Use speculative decoding for:
- Batch offline inference
- High-throughput serving with queuing
- Code generation (high acceptance rates)
Avoid for:
- Streaming where latency matters most
- Highly creative tasks with diverse outputs (low acceptance rates)
Acceptance Rates by Task Type
From production data across several deployments:
| Task | Typical α | Approx speedup |
|---|---|---|
| Code completion | 0.85-0.92 | 4-6x |
| Text summarization | 0.75-0.85 | 3-4x |
| Translation | 0.80-0.88 | 3.5-5x |
| Creative writing | 0.60-0.75 | 2-3x |
| Math/reasoning | 0.65-0.80 | 2.5-3.5x |
Frameworks Supporting Speculative Decoding
- vLLM: Built-in speculative decoding with draft model support
- TensorRT-LLM: NVIDIA's implementation optimized for GPU batching
- HuggingFace TGI: Supports Medusa and standard speculation
- SGLang: Fast serving with speculative decoding support
Conclusion
Speculative decoding is one of the most impactful inference optimizations available today: free quality, 3-5x throughput, and production-ready in major frameworks. The main cost is operational complexity — managing two models and tuning K for your workload.
For high-volume LLM serving, it's one of the first optimizations you should reach for.
Want to go deeper on LLM infrastructure? Explore our full LLM Inference at Scale curriculum.