Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daniele De Sensi

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Aug 26, 2024

Daniele De Sensi, Lorenzo Pichetti, Flavio Vella, Tiziano De Matteis, Zebin Ren, Luigi Fusco, Matteo Turisini, Daniele Cesarini, Kurt Lust, Animesh Trivedi(+4 more)

Figure 1 for Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Figure 2 for Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Figure 3 for Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Figure 4 for Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Abstract:Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This paper comprehensively characterizes three supercomputers - Alps, Leonardo, and LUMI - each with a unique architecture and design. We focus on performance evaluation of intra-node and inter-node interconnects on up to 4096 GPUs, using a mix of intra-node and inter-node benchmarks. By analyzing its limitations and opportunities, we aim to offer practical guidance to researchers, system architects, and software developers dealing with multi-GPU supercomputing. Our results show that there is untapped bandwidth, and there are still many opportunities for optimization, ranging from network to software optimization.

* Published in Proceedings of The International Conference for High Performance Computing Networking, Storage, and Analysis (SC '24) (2024)

Via

Access Paper or Ask Questions

Swing: Short-cutting Rings for Higher Bandwidth Allreduce

Jan 17, 2024

Daniele De Sensi, Tommaso Bonato, David Saam, Torsten Hoefler

Abstract:The allreduce collective operation accounts for a significant fraction of the runtime of workloads running on distributed systems. One factor determining its performance is the distance between communicating nodes, especially on networks like torus, where a higher distance implies multiple messages being forwarded on the same link, thus reducing the allreduce bandwidth. Torus networks are widely used on systems optimized for machine learning workloads (e.g., Google TPUs and Amazon Trainium devices), as well as on some of the Top500 supercomputers. To improve allreduce performance on torus networks we introduce Swing, a new algorithm that keeps a low distance between communicating nodes by swinging between torus directions. Our analysis and experimental evaluation show that Swing outperforms by up to 3x existing allreduce algorithms for vectors ranging from 32B to 128MiB, on different types of torus and torus-like topologies, regardless of their shape and size.

* NSDI 2024

Via

Access Paper or Ask Questions

HammingMesh: A Network Topology for Large-Scale Deep Learning

Sep 03, 2022

Torsten Hoefler, Tommaso Bonato, Daniele De Sensi, Salvatore Di Girolamo, Shigang Li, Marco Heddes, Jon Belk, Deepak Goel, Miguel Castro, Steve Scott

Figure 1 for HammingMesh: A Network Topology for Large-Scale Deep Learning

Figure 2 for HammingMesh: A Network Topology for Large-Scale Deep Learning

Figure 3 for HammingMesh: A Network Topology for Large-Scale Deep Learning

Figure 4 for HammingMesh: A Network Topology for Large-Scale Deep Learning

Abstract:Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investigate data-movement characteristics of large-scale training at full system scale. Based on our workload analysis, we design HammingMesh, a novel network topology that provides high bandwidth at low cost with high job scheduling flexibility. Specifically, HammingMesh can support full bandwidth and isolation to deep learning training jobs with two dimensions of parallelism. Furthermore, it also supports high global bandwidth for generic traffic. Thus, HammingMesh will power future large-scale deep learning systems with extreme bandwidth requirements.

* to appear at ACM/IEEE Supercomputing (SC22)

Via

Access Paper or Ask Questions