Abstract:Energy consumption has become a critical design metric and a limiting factor in the development of future computing architectures, from small wearable devices to large-scale leadership computing facilities. The predominant methods in energy management optimization are focused on CPUs. However, GPUs are increasingly significant and account for the majority of energy consumption in heterogeneous high performance computing (HPC) systems. Moreover, they typically rely on either purely offline training or a hybrid of offline and online training, which are impractical and lead to energy loss during data collection. Therefore, this paper studies a novel and practical online energy optimization problem for GPUs in HPC scenarios. The problem is challenging due to the inherent performance-energy trade-offs of GPUs, the exploration & exploitation dilemma across frequencies, and the lack of explicit performance counters in GPUs. To address these challenges, we formulate the online energy consumption optimization problem as a multi-armed bandit framework and develop a novel bandit based framework EnergyUCB. EnergyUCB is designed to dynamically adjust GPU core frequencies in real-time, reducing energy consumption with minimal impact on performance. Specifically, the proposed framework EnergyUCB (1) balances the performance-energy trade-off in the reward function, (2) effectively navigates the exploration & exploitation dilemma when adjusting GPU core frequencies online, and (3) leverages the ratio of GPU core utilization to uncore utilization as a real-time GPU performance metric. Experiments on a wide range of real-world HPC benchmarks demonstrate that EnergyUCB can achieve substantial energy savings. The code of EnergyUCB is available at https://github.com/XiongxiaoXu/EnergyUCB-Bandit.