Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Venkata N. Padmanabhan

Microsoft Research India

Improving training time and GPU utilization in geo-distributed language model training

Nov 16, 2024

Palak, Rohan Gandhi, Karan Tandon, Debopam Bhattacherjee, Venkata N. Padmanabhan

Abstract:The widespread adoption of language models (LMs) across multiple industries has caused huge surge in demand for GPUs. Training LMs requires tens of thousands of GPUs and housing them in the same datacenter (DCs) is becoming challenging. We focus on training such models across multiple DCs connected via Wide-Area-Network (WAN). We build ATLAS that speeds up such training time using novel temporal bandwidth sharing and many other design choices. While ATLAS improves the training time, it does not eliminate the bubbles (idle GPU cycles). We built BUBBLETEA that runs prefill-as-a-service (part of LM inference) during the bubbles that improves the GPU utilization substantially without any impact of training. Together, ATLAS and BUBBLETEA improve training time by up to 17X and achieve GPU utilization of up to 94%.

Via

Access Paper or Ask Questions

Simulating Network Paths with Recurrent Buffering Units

Feb 23, 2022

Divyam Anshumaan, Sriram Balasubramanian, Shubham Tiwari, Nagarajan Natarajan, Sundararajan Sellamanickam, Venkata N. Padmanabhan

Figure 1 for Simulating Network Paths with Recurrent Buffering Units

Figure 2 for Simulating Network Paths with Recurrent Buffering Units

Figure 3 for Simulating Network Paths with Recurrent Buffering Units

Figure 4 for Simulating Network Paths with Recurrent Buffering Units

Abstract:Simulating physical network paths (e.g., Internet) is a cornerstone research problem in the emerging sub-field of AI-for-networking. We seek a model that generates end-to-end packet delay values in response to the time-varying load offered by a sender, which is typically a function of the previously output delays. We formulate an ML problem at the intersection of dynamical systems, sequential decision making, and time-series generative modeling. We propose a novel grey-box approach to network simulation that embeds the semantics of physical network path in a new RNN-style architecture called Recurrent Buffering Unit, providing the interpretability of standard network simulator tools, the power of neural models, the efficiency of SGD-based techniques for learning, and yielding promising results on synthetic and real-world network traces.

* 20 pages, 12 figures

Via

Access Paper or Ask Questions