LLM optimizations

Wondering how to go beyond the basics of Transformers and RNNs? This guide will help you master LLM optimizations, covering cutting-edge techniques and strategies used to improve model performance for both training and inference.

Whether you're just getting started or looking to optimize your current LLM models, this comprehensive guide has you covered with real-world code samples and hands-on training.

Below, you can find the table of contents with some additional information for all the topics.

Looking to access the whole guide? Explore the premium option.

Full PDF (coming soon) Interested in a live high-end course? Let me know!

I. Fundamentals

II. Training Optimization Techniques

Attention Optimizations

Flash Attention v1
Flash Attention v2
Flash Attention v3

Distributed Training

Distributed Flash Attention (LightSeq)

Memory Efficiency

Gradient Checkpointing
Mixed Precision Training

III. Inference Optimization Techniques

KV cache: basics

Attention Optimizations

Multi-Query Attention
PagedAttention
Hybrid Attention: local + global
Ring Attention
Infiniti Attention
Block transformer
Long former attention
RadixAttention

Decoding Strategies

Speculative Decoding
Staged Speculative Decoding
Lookahead Decoding
Non autoregressive decoding

Memory Optimizations

KV Cache Optimizations

Predictive Caching
Parallel KV cache generatoin
Cross layer KV sharing
RadixAttention (re use KV cache)

Model compression

AWQ
SqueezeLLM
8bit optimizers via block-wise quantization

Computation Reduction

Early Exiting
Cascade Inference

Batching techniques

Static batching
Dynamic batching
Continuous batching
Speculative batching
Adaptive batching