Abstract:Recent research efforts focus on reducing the computational and memory overheads of Large Language Models (LLMs) to make them feasible on resource-constrained devices. Despite advancements in compression techniques, non-linear operators like Softmax and Layernorm remain bottlenecks due to their sensitivity to quantization. We propose SoftmAP, a software-hardware co-design methodology that implements an integer-only low-precision Softmax using In-Memory Compute (IMC) hardware. Our method achieves up to three orders of magnitude improvement in the energy-delay product compared to A100 and RTX3090 GPUs, making LLMs more deployable without compromising performance.
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across various fields, from natural language understanding to text generation. Compared to non-generative LLMs like BERT and DeBERTa, generative LLMs like GPT series and Llama series are currently the main focus due to their superior algorithmic performance. The advancements in generative LLMs are closely intertwined with the development of hardware capabilities. Various hardware platforms exhibit distinct hardware characteristics, which can help improve LLM inference performance. Therefore, this paper comprehensively surveys efficient generative LLM inference on different hardware platforms. First, we provide an overview of the algorithm architecture of mainstream generative LLMs and delve into the inference process. Then, we summarize different optimization methods for different platforms such as CPU, GPU, FPGA, ASIC, and PIM/NDP, and provide inference results for generative LLMs. Furthermore, we perform a qualitative and quantitative comparison of inference performance with batch sizes 1 and 8 on different hardware platforms by considering hardware power consumption, absolute inference speed (tokens/s), and energy efficiency (tokens/J). We compare the performance of the same optimization methods across different hardware platforms, the performance across different hardware platforms, and the performance of different methods on the same hardware platform. This provides a systematic and comprehensive summary of existing inference acceleration work by integrating software optimization methods and hardware platforms, which can point to the future trends and potential developments of generative LLMs and hardware technology for edge-side scenarios.
Abstract:We propose a Mamba accelerator with reconfigurable architecture, MARCA.We propose three novel approaches in this paper. (1) Reduction alternative PE array architecture for both linear and element-wise operations. For linear operations, the reduction tree connected to PE arrays is enabled and executes the reduction operation. For element-wise operations, the reduction tree is disabled and the output bypasses. (2) Reusable nonlinear function unit based on the reconfigurable PE. We decompose the exponential function into element-wise operations and a shift operation by a fast biased exponential algorithm, and the activation function (SiLU) into a range detection and element-wise operations by a piecewise approximation algorithm. Thus, the reconfigurable PEs are reused to execute nonlinear functions with negligible accuracy loss.(3) Intra-operation and inter-operation buffer management strategy. We propose intra-operation buffer management strategy to maximize input data sharing for linear operations within operations, and inter-operation strategy for element-wise operations between operations. We conduct extensive experiments on Mamba model families with different sizes.MARCA achieves up to 463.22$\times$/11.66$\times$ speedup and up to 9761.42$\times$/242.52$\times$ energy efficiency compared to Intel Xeon 8358P CPU and NVIDIA Tesla A100 GPU implementations, respectively.
Abstract:It has recently been discovered that using a pre-trained vision-language model (VLM), e.g., CLIP, to align a whole query image with several finer text descriptions generated by a large language model can significantly enhance zero-shot performance. However, in this paper, we empirically find that the finer descriptions tend to align more effectively with local areas of the query image rather than the whole image, and then we theoretically validate this finding. Thus, we present a method called weighted visual-text cross alignment (WCA). This method begins with a localized visual prompting technique, designed to identify local visual areas within the query image. The local visual areas are then cross-aligned with the finer descriptions by creating a similarity matrix using the pre-trained VLM. To determine how well a query image aligns with each category, we develop a score function based on the weighted similarities in this matrix. Extensive experiments demonstrate that our method significantly improves zero-shot performance across various datasets, achieving results that are even comparable to few-shot learning methods.
Abstract:The battery energy storage system (BESS) has immense potential for enhancing grid reliability and security through its participation in the electricity market. BESS often seeks various revenue streams by taking part in multiple markets to unlock its full potential, but effective algorithms for joint-market participation under price uncertainties are insufficiently explored in the existing research. To bridge this gap, we develop a novel BESS joint bidding strategy that utilizes deep reinforcement learning (DRL) to bid in the spot and contingency frequency control ancillary services (FCAS) markets. Our approach leverages a transformer-based temporal feature extractor to effectively respond to price fluctuations in seven markets simultaneously and helps DRL learn the best BESS bidding strategy in joint-market participation. Additionally, unlike conventional "black-box" DRL model, our approach is more interpretable and provides valuable insights into the temporal bidding behavior of BESS in the dynamic electricity market. We validate our method using realistic market prices from the Australian National Electricity Market. The results show that our strategy outperforms benchmarks, including both optimization-based and other DRL-based strategies, by substantial margins. Our findings further suggest that effective temporal-aware bidding can significantly increase profits in the spot and contingency FCAS markets compared to individual market participation.
Abstract:This paper studies the synergy of solar-battery energy storage system (BESS) and develops a viable strategy for the BESS to unlock its economic potential by serving as a backup to reduce solar curtailments while also participating in the electricity market. We model the real-time bidding of the solar-battery system as two Markov decision processes for the solar farm and the BESS, respectively. We develop a novel deep reinforcement learning (DRL) algorithm to solve the problem by leveraging attention mechanism (AC) and multi-grained feature convolution to process DRL input for better bidding decisions. Simulation results demonstrate that our AC-DRL outperforms two optimization-based and one DRL-based benchmarks by generating 23%, 20%, and 11% higher revenue, as well as improving curtailment responses. The excess solar generation can effectively charge the BESS to bid in the market, significantly reducing solar curtailments by 76% and creating synergy for the solar-battery system to be more viable.
Abstract:Large language models (LLMs) have demonstrated impressive abilities in various domains while the inference cost is expensive. The state-of-the-art methods use 2-bit quantization for mainstream LLMs. However, challenges still exist: (1) Nonnegligible accuracy loss for 2-bit quantization. Weights are quantized by groups, while the ranges of weights are large in some groups, resulting in large quantization errors and nonnegligible accuracy loss (e.g. >3% for Llama2-7b with 2-bit quantization in GPTQ and Greenbit). (2) Limited accuracy improvement by adding 4-bit weights. Increasing 10% extra average bit more 4-bit weights only leads to <0.5% accuracy improvement on a quantized Llama2-7b. (3) Time-consuming dequantization operations on GPUs. The dequantization operations lead to >50% execution time, hindering the potential of reducing LLM inference cost. To tackle these challenges, we propose the following techniques: (1) We only quantize a small fraction of groups with the larger range using 4-bit with memory alignment consideration on GPUs. (2) We point out that the distribution of the sparse outliers with larger weights is different in 2-bit and 4-bit groups, and only a small fraction of outliers require 16-bit quantization. Such design leads to >0.5% accuracy improvement with <3% average increased bit for Llama2-7b. (3) We design the asynchronous dequantization on GPUs, leading to up to 3.92X speedup. We conduct extensive experiments on different model families and model sizes. We achieve 2.85-bit for each weight and the end-to-end speedup for Llama2-7b is 1.74X over the original model, and we reduce both runtime cost and hardware cost by up to 2.70X and 2.81X with less GPU requirements.
Abstract:Wind energy has been rapidly gaining popularity as a means for combating climate change. However, the variable nature of wind generation can undermine system reliability and lead to wind curtailment, causing substantial economic losses to wind power producers. Battery energy storage systems (BESS) that serve as onsite backup sources are among the solutions to mitigate wind curtailment. However, such an auxiliary role of the BESS might severely weaken its economic viability. This paper addresses the issue by proposing joint wind curtailment reduction and energy arbitrage for the BESS. We decouple the market participation of the co-located wind-battery system and develop a joint-bidding framework for the wind farm and BESS. It is challenging to optimize the joint-bidding because of the stochasticity of energy prices and wind generation. Therefore, we leverage deep reinforcement learning to maximize the overall revenue from the spot market while unlocking the BESS's potential in concurrently reducing wind curtailment and conducting energy arbitrage. We validate the proposed strategy using realistic wind farm data and demonstrate that our joint-bidding strategy responds better to wind curtailment and generates higher revenues than the optimization-based benchmark. Our simulations also reveal that the extra wind generation used to be curtailed can be an effective power source to charge the BESS, resulting in additional financial returns.
Abstract:Global power systems are increasingly reliant on wind energy as a mitigation strategy for climate change. However, the variability of wind energy causes system reliability to erode, resulting in the wind being curtailed and, ultimately, leading to substantial economic losses for wind farm owners. Wind curtailment can be reduced using battery energy storage systems (BESS) that serve as onsite backup sources. Yet, this auxiliary role may significantly hamper the BESS's capacity to generate revenues from the electricity market, particularly in conducting energy arbitrage in the Spot market and providing frequency control ancillary services (FCAS) in the FCAS markets. Ideal BESS scheduling should effectively balance the BESS's role in absorbing onsite wind curtailment and trading in the electricity market, but it is difficult in practice because of the underlying coordination complexity and the stochastic nature of energy prices and wind generation. In this study, we investigate the bidding strategy of a wind-battery system co-located and participating simultaneously in both the Spot and Regulation FCAS markets. We propose a deep reinforcement learning (DRL)-based approach that decouples the market participation of the wind-battery system into two related Markov decision processes for each facility, enabling the BESS to absorb onsite wind curtailment while simultaneously bidding in the wholesale Spot and FCAS markets to maximize overall operational revenues. Using realistic wind farm data, we validated the coordinated bidding strategy for the wind-battery system and find that our strategy generates significantly higher revenue and responds better to wind curtailment compared to an optimization-based benchmark. Our results show that joint-market bidding can significantly improve the financial performance of wind-battery systems compared to individual market participation.
Abstract:The rapid adoption of residential solar photovoltaics (PV) has resulted in regular overvoltage events, due to correlated reverse power flows. Currently, PV inverters prevent damage to electronics by curtailing energy production in response to overvoltage. However, this disproportionately affects households at the far end of the feeder, leading to an unfair allocation of the potential value of energy produced. Globally optimizing for fair curtailment requires accurate feeder parameters, which are often unknown. This paper investigates reinforcement learning, which gradually optimizes a fair PV curtailment strategy by interacting with the system. We evaluate six fairness metrics on how well they can be learned compared to an optimal solution oracle. We show that all definitions permit efficient learning, suggesting that reinforcement learning is a promising approach to achieving both safe and fair PV coordination.