Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Karthick Gopalswamy

Collage: Light-Weight Low-Precision Strategy for LLM Training

May 06, 2024

Tao Yu, Gaurav Gupta, Karthick Gopalswamy, Amith Mamidala, Hao Zhou, Jeffrey Huynh, Youngsuk Park, Ron Diamant, Anoop Deoras, Luke Huan

Figure 1 for Collage: Light-Weight Low-Precision Strategy for LLM Training

Figure 2 for Collage: Light-Weight Low-Precision Strategy for LLM Training

Figure 3 for Collage: Light-Weight Low-Precision Strategy for LLM Training

Figure 4 for Collage: Light-Weight Low-Precision Strategy for LLM Training

Abstract:Large models training is plagued by the intense compute cost and limited hardware memory. A practical solution is low-precision representation but is troubled by loss in numerical accuracy and unstable training rendering the model less useful. We argue that low-precision floating points can perform well provided the error is properly compensated at the critical locations in the training process. We propose Collage which utilizes multi-component float representation in low-precision to accurately perform operations with numerical errors accounted. To understand the impact of imprecision to training, we propose a simple and novel metric which tracks the lost information during training as well as differentiates various precision strategies. Our method works with commonly used low-precision such as half-precision ($16$-bit floating points) and can be naturally extended to work with even lower precision such as $8$-bit. Experimental results show that pre-training using Collage removes the requirement of using $32$-bit floating-point copies of the model and attains similar/better training performance compared to $(16, 32)$-bit mixed-precision strategy, with up to $3.7\times$ speedup and $\sim 15\%$ to $23\%$ less memory usage in practice.

* ICML 2024

Via

Access Paper or Ask Questions

Pessimistic Off-Policy Multi-Objective Optimization

Oct 28, 2023

Shima Alizadeh, Aniruddha Bhargava, Karthick Gopalswamy, Lalit Jain, Branislav Kveton, Ge Liu

Figure 1 for Pessimistic Off-Policy Multi-Objective Optimization

Figure 2 for Pessimistic Off-Policy Multi-Objective Optimization

Figure 3 for Pessimistic Off-Policy Multi-Objective Optimization

Figure 4 for Pessimistic Off-Policy Multi-Objective Optimization

Abstract:Multi-objective optimization is a type of decision making problems where multiple conflicting objectives are optimized. We study offline optimization of multi-objective policies from data collected by an existing policy. We propose a pessimistic estimator for the multi-objective policy values that can be easily plugged into existing formulas for hypervolume computation and optimized. The estimator is based on inverse propensity scores (IPS), and improves upon a naive IPS estimator in both theory and experiments. Our analysis is general, and applies beyond our IPS estimators and methods for optimizing them. The pessimistic estimator can be optimized by policy gradients and performs well in all of our experiments.

Via

Access Paper or Ask Questions

Cross-Frequency Time Series Meta-Forecasting

Feb 04, 2023

Mike Van Ness, Huibin Shen, Hao Wang, Xiaoyong Jin, Danielle C. Maddix, Karthick Gopalswamy

Figure 1 for Cross-Frequency Time Series Meta-Forecasting

Figure 2 for Cross-Frequency Time Series Meta-Forecasting

Figure 3 for Cross-Frequency Time Series Meta-Forecasting

Figure 4 for Cross-Frequency Time Series Meta-Forecasting

Abstract:Meta-forecasting is a newly emerging field which combines meta-learning and time series forecasting. The goal of meta-forecasting is to train over a collection of source time series and generalize to new time series one-at-a-time. Previous approaches in meta-forecasting achieve competitive performance, but with the restriction of training a separate model for each sampling frequency. In this work, we investigate meta-forecasting over different sampling frequencies, and introduce a new model, the Continuous Frequency Adapter (CFA), specifically designed to learn frequency-invariant representations. We find that CFA greatly improves performance when generalizing to unseen frequencies, providing a first step towards forecasting over larger multi-frequency datasets.

Via

Access Paper or Ask Questions

First De-Trend then Attend: Rethinking Attention for Time-Series Forecasting

Dec 15, 2022

Xiyuan Zhang, Xiaoyong Jin, Karthick Gopalswamy, Gaurav Gupta, Youngsuk Park, Xingjian Shi, Hao Wang, Danielle C. Maddix, Yuyang Wang

Figure 1 for First De-Trend then Attend: Rethinking Attention for Time-Series Forecasting

Figure 2 for First De-Trend then Attend: Rethinking Attention for Time-Series Forecasting

Figure 3 for First De-Trend then Attend: Rethinking Attention for Time-Series Forecasting

Figure 4 for First De-Trend then Attend: Rethinking Attention for Time-Series Forecasting

Abstract:Transformer-based models have gained large popularity and demonstrated promising results in long-term time-series forecasting in recent years. In addition to learning attention in time domain, recent works also explore learning attention in frequency domains (e.g., Fourier domain, wavelet domain), given that seasonal patterns can be better captured in these domains. In this work, we seek to understand the relationships between attention models in different time and frequency domains. Theoretically, we show that attention models in different domains are equivalent under linear conditions (i.e., linear kernel to attention scores). Empirically, we analyze how attention models of different domains show different behaviors through various synthetic experiments with seasonality, trend and noise, with emphasis on the role of softmax operation therein. Both these theoretical and empirical analyses motivate us to propose a new method: TDformer (Trend Decomposition Transformer), that first applies seasonal-trend decomposition, and then additively combines an MLP which predicts the trend component with Fourier attention which predicts the seasonal component to obtain the final prediction. Extensive experiments on benchmark time-series forecasting datasets demonstrate that TDformer achieves state-of-the-art performance against existing attention-based models.

* NeurIPS 2022 All Things Attention Workshop

Via

Access Paper or Ask Questions