Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ziqing Wen

Breaking Memory Limits: Gradient Wavelet Transform Enhances LLMs Training

Jan 13, 2025

Ziqing Wen, Ping Luo, Jiahuan Wang, Xiaoge Deng, Jinping Zou, Kun Yuan, Tao Sun, Dongsheng Li

Figure 1 for Breaking Memory Limits: Gradient Wavelet Transform Enhances LLMs Training

Figure 2 for Breaking Memory Limits: Gradient Wavelet Transform Enhances LLMs Training

Figure 3 for Breaking Memory Limits: Gradient Wavelet Transform Enhances LLMs Training

Figure 4 for Breaking Memory Limits: Gradient Wavelet Transform Enhances LLMs Training

Abstract:Large language models (LLMs) have shown impressive performance across a range of natural language processing tasks. However, their vast number of parameters introduces significant memory challenges during training, particularly when using memory-intensive optimizers like Adam. Existing memory-efficient algorithms often rely on techniques such as singular value decomposition projection or weight freezing. While these approaches help alleviate memory constraints, they generally produce suboptimal results compared to full-rank updates. In this paper, we investigate the memory-efficient method beyond low-rank training, proposing a novel solution called Gradient Wavelet Transform (GWT), which applies wavelet transforms to gradients in order to significantly reduce the memory requirements for maintaining optimizer states. We demonstrate that GWT can be seamlessly integrated with memory-intensive optimizers, enabling efficient training without sacrificing performance. Through extensive experiments on both pre-training and fine-tuning tasks, we show that GWT achieves state-of-the-art performance compared with advanced memory-efficient optimizers and full-rank approaches in terms of both memory usage and training performance.

Via

Access Paper or Ask Questions

Federated Prediction-Powered Inference from Decentralized Data

Sep 03, 2024

Ping Luo, Xiaoge Deng, Ziqing Wen, Tao Sun, Dongsheng Li

Figure 1 for Federated Prediction-Powered Inference from Decentralized Data

Figure 2 for Federated Prediction-Powered Inference from Decentralized Data

Figure 3 for Federated Prediction-Powered Inference from Decentralized Data

Figure 4 for Federated Prediction-Powered Inference from Decentralized Data

Abstract:In various domains, the increasing application of machine learning allows researchers to access inexpensive predictive data, which can be utilized as auxiliary data for statistical inference. Although such data are often unreliable compared to gold-standard datasets, Prediction-Powered Inference (PPI) has been proposed to ensure statistical validity despite the unreliability. However, the challenge of `data silos' arises when the private gold-standard datasets are non-shareable for model training, leading to less accurate predictive models and invalid inferences. In this paper, we introduces the Federated Prediction-Powered Inference (Fed-PPI) framework, which addresses this challenge by enabling decentralized experimental data to contribute to statistically valid conclusions without sharing private information. The Fed-PPI framework involves training local models on private data, aggregating them through Federated Learning (FL), and deriving confidence intervals using PPI computation. The proposed framework is evaluated through experiments, demonstrating its effectiveness in producing valid confidence intervals.

Via

Access Paper or Ask Questions

Score-based Generative Models with Adaptive Momentum

May 22, 2024

Ziqing Wen, Xiaoge Deng, Ping Luo, Tao Sun, Dongsheng Li

Figure 1 for Score-based Generative Models with Adaptive Momentum

Figure 2 for Score-based Generative Models with Adaptive Momentum

Figure 3 for Score-based Generative Models with Adaptive Momentum

Figure 4 for Score-based Generative Models with Adaptive Momentum

Abstract:Score-based generative models have demonstrated significant practical success in data-generating tasks. The models establish a diffusion process that perturbs the ground truth data to Gaussian noise and then learn the reverse process to transform noise into data. However, existing denoising methods such as Langevin dynamic and numerical stochastic differential equation solvers enjoy randomness but generate data slowly with a large number of score function evaluations, and the ordinary differential equation solvers enjoy faster sampling speed but no randomness may influence the sample quality. To this end, motivated by the Stochastic Gradient Descent (SGD) optimization methods and the high connection between the model sampling process with the SGD, we propose adaptive momentum sampling to accelerate the transforming process without introducing additional hyperparameters. Theoretically, we proved our method promises convergence under given conditions. In addition, we empirically show that our sampler can produce more faithful images/graphs in small sampling steps with 2 to 5 times speed up and obtain competitive scores compared to the baselines on image and graph generation tasks.

Via

Access Paper or Ask Questions

Accelerating Federated Learning by Selecting Beneficial Herd of Local Gradients

Mar 25, 2024

Ping Luo, Xiaoge Deng, Ziqing Wen, Tao Sun, Dongsheng Li

Figure 1 for Accelerating Federated Learning by Selecting Beneficial Herd of Local Gradients

Figure 2 for Accelerating Federated Learning by Selecting Beneficial Herd of Local Gradients

Figure 3 for Accelerating Federated Learning by Selecting Beneficial Herd of Local Gradients

Figure 4 for Accelerating Federated Learning by Selecting Beneficial Herd of Local Gradients

Abstract:Federated Learning (FL) is a distributed machine learning framework in communication network systems. However, the systems' Non-Independent and Identically Distributed (Non-IID) data negatively affect the convergence efficiency of the global model, since only a subset of these data samples are beneficial for model convergence. In pursuit of this subset, a reliable approach involves determining a measure of validity to rank the samples within the dataset. In this paper, We propose the BHerd strategy which selects a beneficial herd of local gradients to accelerate the convergence of the FL model. Specifically, we map the distribution of the local dataset to the local gradients and use the Herding strategy to obtain a permutation of the set of gradients, where the more advanced gradients in the permutation are closer to the average of the set of gradients. These top portion of the gradients will be selected and sent to the server for global aggregation. We conduct experiments on different datasets, models and scenarios by building a prototype system, and experimental results demonstrate that our BHerd strategy is effective in selecting beneficial local gradients to mitigate the effects brought by the Non-IID dataset, thus accelerating model convergence.

Via

Access Paper or Ask Questions