Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhiling Lan

Leveraging LLMs to Automate Energy-Aware Refactoring of Parallel Scientific Codes

May 04, 2025

Matthew T. Dearing, Yiheng Tao, Xingfu Wu, Zhiling Lan, Valerie Taylor

Abstract:While large language models (LLMs) are increasingly used for generating parallel scientific code, most current efforts emphasize functional correctness, often overlooking performance and energy considerations. In this work, we propose LASSI-EE, an automated LLM-based refactoring framework that generates energy-efficient parallel code on a target parallel system for a given parallel code as input. Through a multi-stage, iterative pipeline process, LASSI-EE achieved an average energy reduction of 47% across 85% of the 20 HeCBench benchmarks tested on NVIDIA A100 GPUs. Our findings demonstrate the broader potential of LLMs, not only for generating correct code but also for enabling energy-aware programming. We also address key insights and limitations within the framework, offering valuable guidance for future improvements.

* 11 pages, 4 figures

Via

Access Paper or Ask Questions

LASSI: An LLM-based Automated Self-Correcting Pipeline for Translating Parallel Scientific Codes

Jun 30, 2024

Matthew T. Dearing, Yiheng Tao, Xingfu Wu, Zhiling Lan, Valerie Taylor

Abstract:This paper addresses the problem of providing a novel approach to sourcing significant training data for LLMs focused on science and engineering. In particular, a crucial challenge is sourcing parallel scientific codes in the ranges of millions to billions of codes. To tackle this problem, we propose an automated pipeline framework, called LASSI, designed to translate between parallel programming languages by bootstrapping existing closed- or open-source LLMs. LASSI incorporates autonomous enhancement through self-correcting loops where errors encountered during compilation and execution of generated code are fed back to the LLM through guided prompting for debugging and refactoring. We highlight the bi-directional translation of existing GPU benchmarks between OpenMP target offload and CUDA to validate LASSI. The results of evaluating LASSI with different application codes across four LLMs demonstrate the effectiveness of LASSI for generating executable parallel codes, with 80% of OpenMP to CUDA translations and 85% of CUDA to OpenMP translations producing the expected output. We also observe approximately 78% of OpenMP to CUDA translations and 62% of CUDA to OpenMP translations execute within 10% of or at a faster runtime than the original benchmark code in the same language.

Via

Access Paper or Ask Questions

Integrating Mamba and Transformer for Long-Short Range Time Series Forecasting

Apr 23, 2024

Xiongxiao Xu, Yueqing Liang, Baixiang Huang, Zhiling Lan, Kai Shu

Figure 1 for Integrating Mamba and Transformer for Long-Short Range Time Series Forecasting

Figure 2 for Integrating Mamba and Transformer for Long-Short Range Time Series Forecasting

Figure 3 for Integrating Mamba and Transformer for Long-Short Range Time Series Forecasting

Figure 4 for Integrating Mamba and Transformer for Long-Short Range Time Series Forecasting

Abstract:Time series forecasting is an important problem and plays a key role in a variety of applications including weather forecasting, stock market, and scientific simulations. Although transformers have proven to be effective in capturing dependency, its quadratic complexity of attention mechanism prevents its further adoption in long-range time series forecasting, thus limiting them attend to short-range range. Recent progress on state space models (SSMs) have shown impressive performance on modeling long range dependency due to their subquadratic complexity. Mamba, as a representative SSM, enjoys linear time complexity and has achieved strong scalability on tasks that requires scaling to long sequences, such as language, audio, and genomics. In this paper, we propose to leverage a hybrid framework Mambaformer that internally combines Mamba for long-range dependency, and Transformer for short range dependency, for long-short range forecasting. To the best of our knowledge, this is the first paper to combine Mamba and Transformer architecture in time series data. We investigate possible hybrid architectures to combine Mamba layer and attention layer for long-short range time series forecasting. The comparative study shows that the Mambaformer family can outperform Mamba and Transformer in long-short range time series forecasting problem. The code is available at https://github.com/XiongxiaoXu/Mambaformerin-Time-Series.

Via

Access Paper or Ask Questions

Interpretable Modeling of Deep Reinforcement Learning Driven Scheduling

Mar 24, 2024

Boyang Li, Zhiling Lan, Michael E. Papka

Abstract:In the field of high-performance computing (HPC), there has been recent exploration into the use of deep reinforcement learning for cluster scheduling (DRL scheduling), which has demonstrated promising outcomes. However, a significant challenge arises from the lack of interpretability in deep neural networks (DNN), rendering them as black-box models to system managers. This lack of model interpretability hinders the practical deployment of DRL scheduling. In this work, we present a framework called IRL (Interpretable Reinforcement Learning) to address the issue of interpretability of DRL scheduling. The core idea is to interpret DNN (i.e., the DRL policy) as a decision tree by utilizing imitation learning. Unlike DNN, decision tree models are non-parametric and easily comprehensible to humans. To extract an effective and efficient decision tree, IRL incorporates the Dataset Aggregation (DAgger) algorithm and introduces the notion of critical state to prune the derived decision tree. Through trace-based experiments, we demonstrate that IRL is capable of converting a black-box DNN policy into an interpretable rulebased decision tree while maintaining comparable scheduling performance. Additionally, IRL can contribute to the setting of rewards in DRL scheduling.

Via

Access Paper or Ask Questions

DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling

May 16, 2021

Yuping Fan, Zhiling Lan

Figure 1 for DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling

Figure 2 for DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling

Figure 3 for DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling

Abstract:For decades, system administrators have been striving to design and tune cluster scheduling policies to improve the performance of high performance computing (HPC) systems. However, the increasingly complex HPC systems combined with highly diverse workloads make such manual process challenging, time-consuming, and error-prone. We present a reinforcement learning based HPC scheduling framework named DRAS-CQSim to automatically learn optimal scheduling policy. DRAS-CQSim encapsulates simulation environments, agents, hyperparameter tuning options, and different reinforcement learning algorithms, which allows the system administrators to quickly obtain customized scheduling policies.

* Software Impacts 2021

Via

Access Paper or Ask Questions

Deep Reinforcement Agent for Scheduling in HPC

Feb 11, 2021

Yuping Fan, Zhiling Lan, Taylor Childers, Paul Rich, William Allcock, Michael E. Papka

Figure 1 for Deep Reinforcement Agent for Scheduling in HPC

Figure 2 for Deep Reinforcement Agent for Scheduling in HPC

Figure 3 for Deep Reinforcement Agent for Scheduling in HPC

Figure 4 for Deep Reinforcement Agent for Scheduling in HPC

Abstract:Cluster scheduler is crucial in high-performance computing (HPC). It determines when and which user jobs should be allocated to available system resources. Existing cluster scheduling heuristics are developed by human experts based on their experience with specific HPC systems and workloads. However, the increasing complexity of computing systems and the highly dynamic nature of application workloads have placed tremendous burden on manually designed and tuned scheduling heuristics. More aggressive optimization and automation are needed for cluster scheduling in HPC. In this work, we present an automated HPC scheduling agent named DRAS (Deep Reinforcement Agent for Scheduling) by leveraging deep reinforcement learning. DRAS is built on a novel, hierarchical neural network incorporating special HPC scheduling features such as resource reservation and backfilling. A unique training strategy is presented to enable DRAS to rapidly learn the target environment. Once being provided a specific scheduling objective given by system manager, DRAS automatically learns to improve its policy through interaction with the scheduling environment and dynamically adjusts its policy as workload changes. The experiments with different production workloads demonstrate that DRAS outperforms the existing heuristic and optimization approaches by up to 45%.

* 35th IEEE International Parallel & Distributed Processing Symposium (2021)
* Accepted by IPDPS 2021

Via

Access Paper or Ask Questions

Performance and Power Modeling and Prediction Using MuMMI and Ten Machine Learning Methods

Nov 12, 2020

Xingfu Wu, Valerie Taylor, Zhiling Lan

Figure 1 for Performance and Power Modeling and Prediction Using MuMMI and Ten Machine Learning Methods

Figure 2 for Performance and Power Modeling and Prediction Using MuMMI and Ten Machine Learning Methods

Figure 3 for Performance and Power Modeling and Prediction Using MuMMI and Ten Machine Learning Methods

Figure 4 for Performance and Power Modeling and Prediction Using MuMMI and Ten Machine Learning Methods

Abstract:In this paper, we use modeling and prediction tool MuMMI (Multiple Metrics Modeling Infrastructure) and ten machine learning methods to model and predict performance and power and compare their prediction error rates. We use a fault-tolerant linear algebra code and a fault-tolerant heat distribution code to conduct our modeling and prediction study on the Cray XC40 Theta and IBM BG/Q Mira at Argonne National Laboratory and the Intel Haswell cluster Shepard at Sandia National Laboratories. Our experiment results show that the prediction error rates in performance and power using MuMMI are less than 10% for most cases. Based on the models for runtime, node power, CPU power, and memory power, we identify the most significant performance counters for potential optimization efforts associated with the application characteristics and the target architectures, and we predict theoretical outcomes of the potential optimizations. When we compare the prediction accuracy using MuMMI with that using 10 machine learning methods, we observe that MuMMI not only results in more accurate prediction in both performance and power but also presents how performance counters impact the performance and power models. This provides some insights about how to fine-tune the applications and/or systems for energy efficiency.

* to be published in Cray User Group Conference 2020

Via

Access Paper or Ask Questions