Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

T. S. Eugene Ng

Zen: Near-Optimal Sparse Tensor Synchronization for Distributed DNN Training

Sep 23, 2023

Zhuang Wang, Zhaozhuo Xu, Anshumali Shrivastava, T. S. Eugene Ng

Abstract:Distributed training is the de facto standard to scale up the training of Deep Neural Networks (DNNs) with multiple GPUs. The performance bottleneck of distributed training lies in communications for gradient synchronization. Recently, practitioners have observed sparsity in gradient tensors, suggesting the potential to reduce the traffic volume in communication and improve end-to-end training efficiency. Yet, the optimal communication scheme to fully leverage sparsity is still missing. This paper aims to address this gap. We first analyze the characteristics of sparse tensors in popular DNN models to understand the fundamentals of sparsity. We then systematically explore the design space of communication schemes for sparse tensors and find the optimal one. % We then find the optimal scheme based on the characteristics by systematically exploring the design space. We also develop a gradient synchronization system called Zen that approximately realizes it for sparse tensors. We demonstrate that Zen can achieve up to 5.09x speedup in communication time and up to 2.48x speedup in training throughput compared to the state-of-the-art methods.

Via

Access Paper or Ask Questions

ByteComp: Revisiting Gradient Compression in Distributed Training

Jun 06, 2022

Zhuang Wang, Haibin Lin, Yibo Zhu, T. S. Eugene Ng

Figure 1 for ByteComp: Revisiting Gradient Compression in Distributed Training

Figure 2 for ByteComp: Revisiting Gradient Compression in Distributed Training

Figure 3 for ByteComp: Revisiting Gradient Compression in Distributed Training

Figure 4 for ByteComp: Revisiting Gradient Compression in Distributed Training

Abstract:Gradient compression (GC) is a promising approach to addressing the communication bottleneck in distributed deep learning (DDL). However, it is challenging to find the optimal compression strategy for applying GC to DDL because of the intricate interactions among tensors. To fully unleash the benefits of GC, two questions must be addressed: 1) How to express all compression strategies and the corresponding interactions among tensors of any DDL training job? 2) How to quickly select a near-optimal compression strategy? In this paper, we propose ByteComp to answer these questions. It first designs a decision tree abstraction to express all the compression strategies and develops empirical models to timeline tensor computation, communication, and compression to enable ByteComp to derive the intricate interactions among tensors. It then designs a compression decision algorithm that analyzes tensor interactions to eliminate and prioritize strategies and optimally offloads compression to CPUs. Experimental evaluations show that ByteComp can improve the training throughput over the start-of-the-art compression-enabled system by up to 77% for representative DDL training jobs. Moreover, the computational time needed to select the compression strategy is measured in milliseconds, and the selected strategy is only a few percent from optimal.

Via

Access Paper or Ask Questions

MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

Mar 28, 2021

Zhuang Wang, Xinyu Wu, T. S. Eugene Ng

Figure 1 for MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

Figure 2 for MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

Figure 3 for MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

Figure 4 for MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

Abstract:Large-scale distributed training is increasingly becoming communication bound. Many gradient compression algorithms have been proposed to reduce the communication overhead and improve scalability. However, it has been observed that in some cases gradient compression may even harm the performance of distributed training. In this paper, we propose MergeComp, a compression scheduler to optimize the scalability of communication-efficient distributed training. It automatically schedules the compression operations to optimize the performance of compression algorithms without the knowledge of model architectures or system parameters. We have applied MergeComp to nine popular compression algorithms. Our evaluations show that MergeComp can improve the performance of compression algorithms by up to 3.83x without losing accuracy. It can even achieve a scaling factor of distributed training up to 99% over high-speed networks.

* 8 papes

Via

Access Paper or Ask Questions