Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhisheng Ye

Characterization of Large Language Model Development in the Datacenter

Mar 12, 2024

Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo(+2 more)

Abstract:Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs, often riddled with numerous challenges such as frequent hardware failures, intricate parallelization strategies, and imbalanced resource utilization. In this paper, we present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme. Specifically, we investigate discrepancies between LLMs and prior task-specific Deep Learning (DL) workloads, explore resource utilization patterns, and identify the impact of various job failures. Our analysis summarizes hurdles we encountered and uncovers potential opportunities to optimize systems tailored for LLMs. Furthermore, we introduce our system efforts: (1) fault-tolerant pretraining, which enhances fault tolerance through LLM-involved failure diagnosis and automatic recovery. (2) decoupled scheduling for evaluation, which achieves timely performance feedback via trial decomposition and scheduling optimization.

Via

Access Paper or Ask Questions

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Jun 01, 2022

Wei Gao, Qinghao Hu, Zhisheng Ye, Peng Sun, Xiaolin Wang, Yingwei Luo, Tianwei Zhang, Yonggang Wen

Figure 1 for Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Figure 2 for Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Figure 3 for Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Figure 4 for Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Abstract:Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for such GPU datacenter is crucially important to reduce the operational cost and improve resource utilization. However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources. Recently, substantial schedulers are proposed to tailor for DL workloads in GPU datacenters. This paper surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features. Finally, we prospect several promising future research directions. More detailed summary with the surveyed paper and code links can be found at our project website: https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers

* Submitted to ACM Computing Surveys

Via

Access Paper or Ask Questions

The Internet of Federated Things : A Vision for the Future and In-depth Survey of Data-driven Approaches for Federated Learning

Nov 09, 2021

Raed Kontar, Naichen Shi, Xubo Yue, Seokhyun Chung, Eunshin Byon, Mosharaf Chowdhury, Judy Jin, Wissam Kontar, Neda Masoud, Maher Noueihed(+5 more)

Figure 1 for The Internet of Federated Things : A Vision for the Future and In-depth Survey of Data-driven Approaches for Federated Learning

Figure 2 for The Internet of Federated Things : A Vision for the Future and In-depth Survey of Data-driven Approaches for Federated Learning

Figure 3 for The Internet of Federated Things : A Vision for the Future and In-depth Survey of Data-driven Approaches for Federated Learning

Figure 4 for The Internet of Federated Things : A Vision for the Future and In-depth Survey of Data-driven Approaches for Federated Learning

Abstract:The Internet of Things (IoT) is on the verge of a major paradigm shift. In the IoT system of the future, IoFT, the cloud will be substituted by the crowd where model training is brought to the edge, allowing IoT devices to collaboratively extract knowledge and build smart analytics/models while keeping their personal data stored locally. This paradigm shift was set into motion by the tremendous increase in computational power on IoT devices and the recent advances in decentralized and privacy-preserving model training, coined as federated learning (FL). This article provides a vision for IoFT and a systematic overview of current efforts towards realizing this vision. Specifically, we first introduce the defining characteristics of IoFT and discuss FL data-driven approaches, opportunities, and challenges that allow decentralized inference within three dimensions: (i) a global model that maximizes utility across all IoT devices, (ii) a personalized model that borrows strengths across all devices yet retains its own model, (iii) a meta-learning model that quickly adapts to new devices or learning tasks. We end by describing the vision and challenges of IoFT in reshaping different industries through the lens of domain experts. Those industries include manufacturing, transportation, energy, healthcare, quality & reliability, business, and computing.

* Accepted at IEEE

Via

Access Paper or Ask Questions

A Unifying Framework for Variance Reduction Algorithms for Finding Zeroes of Monotone Operators

Jun 22, 2019

Xun Zhang, William B. Haskell, Zhisheng Ye

Figure 1 for A Unifying Framework for Variance Reduction Algorithms for Finding Zeroes of Monotone Operators

Abstract:A wide range of optimization problems can be recast as monotone inclusion problems. We propose a unifying framework for solving the monotone inclusion problem with randomized Forward-Backward algorithms. Our framework covers many existing deterministic and stochastic algorithms. Under various conditions, we can establish both sublinear and linear convergence rates in expectation for the algorithms covered by this framework. In addition, we consider algorithm design as well as asynchronous randomized Forward algorithms. Numerical experiments demonstrate the worth of the new algorithms that emerge from our framework

Via

Access Paper or Ask Questions