Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile

Nov 01, 2024

Ruisi Zhang, Tianyu Liu, Will Feng, Andrew Gu, Sanket Purandare, Wanchao Liang, Francisco Massa

Figure 1 for SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile

Figure 2 for SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile

Figure 3 for SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile

Figure 4 for SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile

Share this with someone who'll enjoy it:

Abstract:Distributed training of large models consumes enormous computation resources and requires substantial engineering efforts to compose various training techniques. This paper presents SimpleFSDP, a PyTorch-native compiler-based Fully Sharded Data Parallel (FSDP) framework, which has a simple implementation for maintenance and composability, allows full computation-communication graph tracing, and brings performance enhancement via compiler backend optimizations. SimpleFSDP's novelty lies in its unique torch.compile-friendly implementation of collective communications using existing PyTorch primitives, namely parametrizations, selective activation checkpointing, and DTensor. It also features the first-of-its-kind intermediate representation (IR) nodes bucketing and reordering in the TorchInductor backend for effective computation-communication overlapping. As a result, users can employ the aforementioned optimizations to automatically or manually wrap model components for minimal communication exposure. Extensive evaluations of SimpleFSDP on Llama 3 models (including the ultra-large 405B) using TorchTitan demonstrate up to 28.54% memory reduction and 68.67% throughput improvement compared to the most widely adopted FSDP2 eager framework, when composed with other distributed training techniques.

View paper on

Share this with someone who'll enjoy it:

Title:SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile

Paper and Code