Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Scale up with Order: Finding Good Data Permutations for Distributed Training

Feb 02, 2023

Wentao Guo, Khiem Pham, Yucheng Lu, Tiancheng Yuan, Charlie F. Ruan, Christopher De Sa

Figure 1 for Scale up with Order: Finding Good Data Permutations for Distributed Training

Figure 2 for Scale up with Order: Finding Good Data Permutations for Distributed Training

Figure 3 for Scale up with Order: Finding Good Data Permutations for Distributed Training

Figure 4 for Scale up with Order: Finding Good Data Permutations for Distributed Training

Share this with someone who'll enjoy it:

Abstract:Gradient Balancing (GraB) is a recently proposed technique that finds provably better data permutations when training models with multiple epochs over a finite dataset. It converges at a faster rate than the widely adopted Random Reshuffling, by minimizing the discrepancy of the gradients on adjacently selected examples. However, GraB only operates under critical assumptions such as small batch sizes and centralized data, leaving open the question of how to order examples at large scale -- i.e. distributed learning with decentralized data. To alleviate the limitation, in this paper we propose D-GraB that involves two novel designs: (1) $\textsf{PairBalance}$ that eliminates the requirement to use stale gradient mean in GraB which critically relies on small learning rates; (2) an ordering protocol that runs $\textsf{PairBalance}$ in a distributed environment with negligible overhead, which benefits from both data ordering and parallelism. We prove D-GraB enjoys linear speed up at rate $\tilde{O}((mnT)^{-2/3})$ on smooth non-convex objectives and $\tilde{O}((mnT)^{-2})$ under PL condition, where $n$ denotes the number of parallel workers, $m$ denotes the number of examples per worker and $T$ denotes the number of epochs. Empirically, we show on various applications including GLUE, CIFAR10 and WikiText-2 that D-GraB outperforms naive parallel GraB and Distributed Random Reshuffling in terms of both training and validation performance.

View paper on

Share this with someone who'll enjoy it:

Title:Scale up with Order: Finding Good Data Permutations for Distributed Training

Paper and Code