Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Emanuele Carlini

Power- and Fragmentation-aware Online Scheduling for GPU Datacenters

Dec 23, 2024

Francesco Lettich, Emanuele Carlini, Franco Maria Nardini, Raffaele Perego, Salvatore Trani

Figure 1 for Power- and Fragmentation-aware Online Scheduling for GPU Datacenters

Figure 2 for Power- and Fragmentation-aware Online Scheduling for GPU Datacenters

Figure 3 for Power- and Fragmentation-aware Online Scheduling for GPU Datacenters

Figure 4 for Power- and Fragmentation-aware Online Scheduling for GPU Datacenters

Abstract:The rise of Artificial Intelligence and Large Language Models is driving increased GPU usage in data centers for complex training and inference tasks, impacting operational costs, energy demands, and the environmental footprint of large-scale computing infrastructures. This work addresses the online scheduling problem in GPU datacenters, which involves scheduling tasks without knowledge of their future arrivals. We focus on two objectives: minimizing GPU fragmentation and reducing power consumption. GPU fragmentation occurs when partial GPU allocations hinder the efficient use of remaining resources, especially as the datacenter nears full capacity. A recent scheduling policy, Fragmentation Gradient Descent (FGD), leverages a fragmentation metric to address this issue. Reducing power consumption is also crucial due to the significant power demands of GPUs. To this end, we propose PWR, a novel scheduling policy to minimize power usage by selecting power-efficient GPU and CPU combinations. This involves a simplified model for measuring power consumption integrated into a Kubernetes score plugin. Through an extensive experimental evaluation in a simulated cluster, we show how PWR, when combined with FGD, achieves a balanced trade-off between reducing power consumption and minimizing GPU fragmentation.

* This work has been submitted to the IEEE for possible publication

Via

Access Paper or Ask Questions

TEACHING -- Trustworthy autonomous cyber-physical applications through human-centred intelligence

Jul 14, 2021

Davide Bacciu, Siranush Akarmazyan, Eric Armengaud, Manlio Bacco, George Bravos, Calogero Calandra, Emanuele Carlini, Antonio Carta, Pietro Cassara, Massimo Coppola(+25 more)

Figure 1 for TEACHING -- Trustworthy autonomous cyber-physical applications through human-centred intelligence

Figure 2 for TEACHING -- Trustworthy autonomous cyber-physical applications through human-centred intelligence

Figure 3 for TEACHING -- Trustworthy autonomous cyber-physical applications through human-centred intelligence

Abstract:This paper discusses the perspective of the H2020 TEACHING project on the next generation of autonomous applications running in a distributed and highly heterogeneous environment comprising both virtual and physical resources spanning the edge-cloud continuum. TEACHING puts forward a human-centred vision leveraging the physiological, emotional, and cognitive state of the users as a driver for the adaptation and optimization of the autonomous applications. It does so by building a distributed, embedded and federated learning system complemented by methods and tools to enforce its dependability, security and privacy preservation. The paper discusses the main concepts of the TEACHING approach and singles out the main AI-related research challenges associated with it. Further, we provide a discussion of the design choices for the TEACHING system to tackle the aforementioned challenges

Via

Access Paper or Ask Questions