LLM Optimizations

Wondering how to go beyond the basics of Transformers and RNNs? This guide will help you master LLM optimizations, covering cutting-edge techniques and strategies used to improve model performance for both training and inference.

Whether you're just getting started or looking to optimize your current LLM models, this comprehensive guide has you covered with real-world code samples and hands-on training.

Below, you can find the table of contents with some additional information for all the topics.

Looking to access the whole guide? Explore the premium option.

Full PDF (coming soon) Interested in a live high-end course? Let me know!

Table of Contents

II. Training Optimization Techniques

  1. Attention Optimizations
    1. Flash Attention v1
    2. Flash Attention v2
    3. Flash Attention v3
  2. Distributed Training
    1. Distributed Flash Attention (LightSeq)
  3. Memory Efficiency
    1. Gradient Checkpointing
    2. Mixed Precision Training

III. Inference Optimization Techniques

  1. KV cache: basics
    1. Introduction to KV caching
    2. PagedAttention
  2. Attention Optimizations
    1. Multi-Query Attention
    2. PagedAttention
    3. Hybrid Attention: local + global
    4. Ring Attention
    5. Infiniti Attention
    6. Block transformer
    7. Long former attention
    8. RadixAttention
  3. Decoding Strategies
    1. Speculative Decoding
    2. Staged Speculative Decoding
    3. Lookahead Decoding
    4. Non autoregressive decoding
  4. Memory Optimizations
    1. KV Cache Optimizations
      1. Predictive Caching
      2. Parallel KV cache generatoin
      3. Cross layer KV sharing
      4. RadixAttention (re use KV cache)
    2. Model compression
      1. AWQ
      2. SqueezeLLM
      3. 8bit optimizers via block-wise quantization
  5. Computation Reduction
    1. Early Exiting
    2. Cascade Inference
  6. Batching techniques
    1. Static batching
    2. Dynamic batching
    3. Continuous batching
    4. Speculative batching
    5. Adaptive batching