Pipeline parallelism speeds up training of AI models by splitting a massive model across multiple GPUs and processing data like an assembly line, ensuring no single device has to hold the entire model in memory.
This course teaches pipeline parallelism from scratch, building a distributed training system step-by-step. Starting with a simple monolithic MLP, you'll learn to manually partition models, implement distributed communication primitives, and progressively build three pipeline schedules: naive stop-and-wait, GPipe with micro-batching, and the interleaved 1F1B algorithm. Kian Kyars created this course.
Here are the sections in this course:
Introduction, Repository Setup & Syllabus
Step 0: The Monolith Baseline
Step 1: Manual Model Partitioning
Step 2: Distributed Communication Primitives
Step 3: Distributed Ping Pong Lab
Step 4: Building the Sharded Model
Step 5: The Main Training Orchestrator
Step 6a: Naive Pipeline Parallelism
Step 6b: GPipe & Micro-batching
Step 6c: 1F1B Theory & Spreadsheet Derivation
Step 6c: Implementing 1F1B & Async Sends
Watch the full course on the freeCodeCamp.org YouTube channel (3-hour watch).