Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Linbo Qiao

A Survey on Memory-Efficient Large-Scale Model Training in AI for Science

Jan 21, 2025

Kaiyuan Tian, Linbo Qiao, Baihui Liu, Gongqingjian Jiang, Dongsheng Li

Abstract:Scientific research faces high costs and inefficiencies with traditional methods, but the rise of deep learning and large language models (LLMs) offers innovative solutions. This survey reviews LLM applications across scientific fields such as biology, medicine, chemistry, and meteorology, underscoring their role in advancing research. However, the continuous expansion of model size has led to significant memory demands, hindering further development and application of LLMs for science. To address this, we review memory-efficient training techniques for LLMs based on the transformer architecture, including distributed training, mixed precision training, and gradient checkpointing. Using AlphaFold 2 as an example, we demonstrate how tailored memory optimization methods can reduce storage needs while preserving prediction accuracy. We also discuss the challenges of memory optimization in practice and potential future directions, hoping to provide valuable insights for researchers and engineers.

Via

Access Paper or Ask Questions

Two-stage Generative Question Answering on Temporal Knowledge Graph Using Large Language Models

Feb 26, 2024

Yifu Gao, Linbo Qiao, Zhigang Kan, Zhihua Wen, Yongquan He, Dongsheng Li

Abstract:Temporal knowledge graph question answering (TKGQA) poses a significant challenge task, due to the temporal constraints hidden in questions and the answers sought from dynamic structured knowledge. Although large language models (LLMs) have made considerable progress in their reasoning ability over structured data, their application to the TKGQA task is a relatively unexplored area. This paper first proposes a novel generative temporal knowledge graph question answering framework, GenTKGQA, which guides LLMs to answer temporal questions through two phases: Subgraph Retrieval and Answer Generation. First, we exploit LLM's intrinsic knowledge to mine temporal constraints and structural links in the questions without extra training, thus narrowing down the subgraph search space in both temporal and structural dimensions. Next, we design virtual knowledge indicators to fuse the graph neural network signals of the subgraph and the text representations of the LLM in a non-shallow way, which helps the open-source LLM deeply understand the temporal order and structural dependencies among the retrieved facts through instruction tuning. Experimental results demonstrate that our model outperforms state-of-the-art baselines, even achieving 100\% on the metrics for the simple question type.

Via

Access Paper or Ask Questions

TFDMNet: A Novel Network Structure Combines the Time Domain and Frequency Domain Features

Jan 29, 2024

Hengyue Pan, Yixin Chen, Zhiliang Tian, Peng Qiao, Linbo Qiao, Dongsheng Li

Abstract:Convolutional neural network (CNN) has achieved impressive success in computer vision during the past few decades. The image convolution operation helps CNNs to get good performance on image-related tasks. However, it also has high computation complexity and hard to be parallelized. This paper proposes a novel Element-wise Multiplication Layer (EML) to replace convolution layers, which can be trained in the frequency domain. Theoretical analyses show that EMLs lower the computation complexity and easier to be parallelized. Moreover, we introduce a Weight Fixation mechanism to alleviate the problem of over-fitting, and analyze the working behavior of Batch Normalization and Dropout in the frequency domain. To get the balance between the computation complexity and memory usage, we propose a new network structure, namely Time-Frequency Domain Mixture Network (TFDMNet), which combines the advantages of both convolution layers and EMLs. Experimental results imply that TFDMNet achieves good performance on MNIST, CIFAR-10 and ImageNet databases with less number of operations comparing with corresponding CNNs.

* This paper is the updated edition of our paper Learning Convolutional Neural Networks in the Frequency Domain (arXiv:2204.06718). Comparing with the previous edition, we design a mixture model to get the balance between the computation complexity and memory usage

Via

Access Paper or Ask Questions

A Unified Generative Framework based on Prompt Learning for Various Information Extraction Tasks

Sep 23, 2022

Zhigang Kan, Linhui Feng, Zhangyue Yin, Linbo Qiao, Xipeng Qiu, Dongsheng Li

Figure 1 for A Unified Generative Framework based on Prompt Learning for Various Information Extraction Tasks

Figure 2 for A Unified Generative Framework based on Prompt Learning for Various Information Extraction Tasks

Figure 3 for A Unified Generative Framework based on Prompt Learning for Various Information Extraction Tasks

Figure 4 for A Unified Generative Framework based on Prompt Learning for Various Information Extraction Tasks

Abstract:Prompt learning is an effective paradigm that bridges gaps between the pre-training tasks and the corresponding downstream applications. Approaches based on this paradigm have achieved great transcendent results in various applications. However, it still needs to be answered how to design a unified framework based on the prompt learning paradigm for various information extraction tasks. In this paper, we propose a novel composable prompt-based generative framework, which could be applied to a wide range of tasks in the field of Information Extraction. Specifically, we reformulate information extraction tasks into the form of filling slots in pre-designed type-specific prompts, which consist of one or multiple sub-prompts. A strategy of constructing composable prompts is proposed to enhance the generalization ability to extract events in data-scarce scenarios. Furthermore, to fit this framework, we transform Relation Extraction into the task of determining semantic consistency in prompts. The experimental results demonstrate that our approach surpasses compared baselines on real-world datasets in data-abundant and data-scarce scenarios. Further analysis of the proposed framework is presented, as well as numerical experiments conducted to investigate impact factors of performance on various tasks.

Via

Access Paper or Ask Questions

Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models

Jun 21, 2022

Zhiquan Lai, Shengwei Li, Xudong Tang, Keshi Ge, Weijie Liu, Yabo Duan, Linbo Qiao, Dongsheng Li

Figure 1 for Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models

Figure 2 for Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models

Figure 3 for Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models

Figure 4 for Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models

Abstract:Foundation models are becoming the dominant deep learning technologies. Pretraining a foundation model is always time-consumed due to the large scale of both the model parameter and training dataset. Besides being computing-intensive, the training process is extremely memory-intensive and communication-intensive. These features make it necessary to apply 3D parallelism, which integrates data parallelism, pipeline model parallelism and tensor model parallelism, to achieve high training efficiency. To achieve this goal, some custom software frameworks such as Megatron-LM and DeepSpeed are developed. However, current 3D parallelism frameworks still meet two issues: i) they are not transparent to model developers, which need to manually modify the model to parallelize training. ii) their utilization of computation, GPU memory and network bandwidth are not sufficient. We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization. Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model. Merak also presents the non-intrusive API for scaling out foundation model training with minimal code modification. In addition, we design a high-performance 3D parallel runtime engine in Merak. It uses several techniques to exploit available training resources, including shifted critical path pipeline schedule that brings a higher computation utilization, stage-aware recomputation that makes use of idle worker memory, and sub-pipelined tensor model parallelism that overlaps communication and computation. Experiments on 64 GPUs show Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.

Via

Access Paper or Ask Questions

DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation

Mar 30, 2022

Yu Tang, Chenyu Wang, Yufan Zhang, Yuliang Liu, Xingcheng Zhang, Linbo Qiao, Zhiquan Lai, Dongsheng Li

Figure 1 for DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation

Figure 2 for DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation

Figure 3 for DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation

Figure 4 for DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation

Abstract:The further development of deep neural networks is hampered by the limited GPU memory resource. Therefore, the optimization of GPU memory resources is highly demanded. Swapping and recomputation are commonly applied to make better use of GPU memory in deep learning. However, as an emerging domain, several challenges remain:1)The efficiency of recomputation is limited for both static and dynamic methods. 2)Swapping requires offloading parameters manually, which incurs a great time cost. 3) There is no such dynamic and fine-grained method that involves tensor swapping together with tensor recomputation nowadays. To remedy the above issues, we propose a novel scheduler manager named DELTA(Dynamic tEnsor offLoad and recompuTAtion). To the best of our knowledge, we are the first to make a reasonable dynamic runtime scheduler on the combination of tensor swapping and tensor recomputation without user oversight. In DELTA, we propose a filter algorithm to select the optimal tensors to be released out of GPU memory and present a director algorithm to select a proper action for each of these tensors. Furthermore, prefetching and overlapping are deliberately considered to overcome the time cost caused by swapping and recomputing tensors. Experimental results show that DELTA not only saves 40%-70% of GPU memory, surpassing the state-of-the-art method to a great extent but also gets comparable convergence results as the baseline with acceptable time delay. Also, DELTA gains 2.04$\times$ maximum batchsize when training ResNet-50 and 2.25$\times$ when training ResNet-101 compared with the baseline. Besides, comparisons between the swapping cost and recomputation cost in our experiments demonstrate the importance of making a reasonable dynamic scheduler on tensor swapping and tensor recomputation, which refutes the arguments in some related work that swapping should be the first and best choice.

* 12 pages, 6 figures

Via

Access Paper or Ask Questions

Inertial Proximal Deep Learning Alternating Minimization for Efficient Neutral Network Training

Jan 30, 2021

Linbo Qiao, Tao Sun, Hengyue Pan, Dongsheng Li

Figure 1 for Inertial Proximal Deep Learning Alternating Minimization for Efficient Neutral Network Training

Figure 2 for Inertial Proximal Deep Learning Alternating Minimization for Efficient Neutral Network Training

Figure 3 for Inertial Proximal Deep Learning Alternating Minimization for Efficient Neutral Network Training

Figure 4 for Inertial Proximal Deep Learning Alternating Minimization for Efficient Neutral Network Training

Abstract:In recent years, the Deep Learning Alternating Minimization (DLAM), which is actually the alternating minimization applied to the penalty form of the deep neutral networks training, has been developed as an alternative algorithm to overcome several drawbacks of Stochastic Gradient Descent (SGD) algorithms. This work develops an improved DLAM by the well-known inertial technique, namely iPDLAM, which predicts a point by linearization of current and last iterates. To obtain further training speed, we apply a warm-up technique to the penalty parameter, that is, starting with a small initial one and increasing it in the iterations. Numerical results on real-world datasets are reported to demonstrate the efficiency of our proposed algorithm.

Via

Access Paper or Ask Questions

ADMMiRNN: Training RNN with Stable Convergence via An Efficient ADMM Approach

Jun 17, 2020

Yu Tang, Zhigang Kan, Dequan Sun, Linbo Qiao, Jingjing Xiao, Zhiquan Lai, Dongsheng Li

Figure 1 for ADMMiRNN: Training RNN with Stable Convergence via An Efficient ADMM Approach

Figure 2 for ADMMiRNN: Training RNN with Stable Convergence via An Efficient ADMM Approach

Figure 3 for ADMMiRNN: Training RNN with Stable Convergence via An Efficient ADMM Approach

Figure 4 for ADMMiRNN: Training RNN with Stable Convergence via An Efficient ADMM Approach

Abstract:It is hard to train Recurrent Neural Network (RNN) with stable convergence and avoid gradient vanishing and exploding, as the weights in the recurrent unit are repeated from iteration to iteration. Moreover, RNN is sensitive to the initialization of weights and bias, which brings difficulty in the training phase. With the gradient-free feature and immunity to poor conditions, the Alternating Direction Method of Multipliers (ADMM) has become a promising algorithm to train neural networks beyond traditional stochastic gradient algorithms. However, ADMM could not be applied to train RNN directly since the state in the recurrent unit is repetitively updated over timesteps. Therefore, this work builds a new framework named ADMMiRNN upon the unfolded form of RNN to address the above challenges simultaneously and provides novel update rules and theoretical convergence analysis. We explicitly specify key update rules in the iterations of ADMMiRNN with deliberately constructed approximation techniques and solutions to each subproblem instead of vanilla ADMM. Numerical experiments are conducted on MNIST and text classification tasks, where ADMMiRNN achieves convergent results and outperforms compared baselines. Furthermore, ADMMiRNN trains RNN in a more stable way without gradient vanishing or exploding compared to the stochastic gradient algorithms. Source code has been available at https://github.com/TonyTangYu/ADMMiRNN.

* 17 pages, 11 figures

Via

Access Paper or Ask Questions

Event Arguments Extraction via Dilate Gated Convolutional Neural Network with Enhanced Local Features

Jun 02, 2020

Zhigang Kan, Linbo Qiao, Sen Yang, Feng Liu, Feng Huang

Figure 1 for Event Arguments Extraction via Dilate Gated Convolutional Neural Network with Enhanced Local Features

Figure 2 for Event Arguments Extraction via Dilate Gated Convolutional Neural Network with Enhanced Local Features

Figure 3 for Event Arguments Extraction via Dilate Gated Convolutional Neural Network with Enhanced Local Features

Figure 4 for Event Arguments Extraction via Dilate Gated Convolutional Neural Network with Enhanced Local Features

Abstract:Event Extraction plays an important role in information-extraction to understand the world. Event extraction could be split into two subtasks: one is event trigger extraction, the other is event arguments extraction. However, the F-Score of event arguments extraction is much lower than that of event trigger extraction, i.e. in the most recent work, event trigger extraction achieves 80.7%, while event arguments extraction achieves only 58%. In pipelined structures, the difficulty of event arguments extraction lies in its lack of classification feature, and the much higher computation consumption. In this work, we proposed a novel Event Extraction approach based on multi-layer Dilate Gated Convolutional Neural Network (EE-DGCNN) which has fewer parameters. In addition, enhanced local information is incorporated into word features, to assign event arguments roles for triggers predicted by the first subtask. The numerical experiments demonstrated significant performance improvement beyond state-of-art event extraction approaches on real-world datasets. Further analysis of extraction procedure is presented, as well as experiments are conducted to analyze impact factors related to the performance improvement.

Via

Access Paper or Ask Questions

An Efficient ADMM-Based Algorithm to Nonconvex Penalized Support Vector Machines

Sep 11, 2018

Lei Guan, Linbo Qiao, Dongsheng Li, Tao Sun, Keshi Ge, Xicheng Lu

Figure 1 for An Efficient ADMM-Based Algorithm to Nonconvex Penalized Support Vector Machines

Figure 2 for An Efficient ADMM-Based Algorithm to Nonconvex Penalized Support Vector Machines

Figure 3 for An Efficient ADMM-Based Algorithm to Nonconvex Penalized Support Vector Machines

Figure 4 for An Efficient ADMM-Based Algorithm to Nonconvex Penalized Support Vector Machines

Abstract:Support vector machines (SVMs) with sparsity-inducing nonconvex penalties have received considerable attentions for the characteristics of automatic classification and variable selection. However, it is quite challenging to solve the nonconvex penalized SVMs due to their nondifferentiability, nonsmoothness and nonconvexity. In this paper, we propose an efficient ADMM-based algorithm to the nonconvex penalized SVMs. The proposed algorithm covers a large class of commonly used nonconvex regularization terms including the smooth clipped absolute deviation (SCAD) penalty, minimax concave penalty (MCP), log-sum penalty (LSP) and capped-$\ell_1$ penalty. The computational complexity analysis shows that the proposed algorithm enjoys low computational cost. Moreover, the convergence of the proposed algorithm is guaranteed. Extensive experimental evaluations on five benchmark datasets demonstrate the superior performance of the proposed algorithm to other three state-of-the-art approaches.

Via

Access Paper or Ask Questions