Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yiwu Yao

Dynamic Low-Rank Sparse Adaptation for Large Language Models

Feb 20, 2025

Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Yang Liu, Jing Lin, Yiwu Yao, Rongrong Ji

Abstract:Despite the efficacy of network sparsity in alleviating the deployment strain of Large Language Models (LLMs), it endures significant performance degradation. Applying Low-Rank Adaptation (LoRA) to fine-tune the sparse LLMs offers an intuitive approach to counter this predicament, while it holds shortcomings include: 1) The inability to integrate LoRA weights into sparse LLMs post-training, and 2) Insufficient performance recovery at high sparsity ratios. In this paper, we introduce dynamic Low-rank Sparse Adaptation (LoSA), a novel method that seamlessly integrates low-rank adaptation into LLM sparsity within a unified framework, thereby enhancing the performance of sparse LLMs without increasing the inference latency. In particular, LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning, thus guaranteeing that the LoRA module can be integrated into the sparse LLMs post-training. Besides, LoSA leverages Representation Mutual Information (RMI) as an indicator to determine the importance of layers, thereby efficiently determining the layer-wise sparsity rates during fine-tuning. Predicated on this, LoSA adjusts the rank of the LoRA module based on the variability in layer-wise reconstruction errors, allocating an appropriate fine-tuning for each layer to reduce the output discrepancies between dense and sparse LLMs. Extensive experiments tell that LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden. For example, LoSA reduced the perplexity of sparse LLaMA-2-7B by 68.73 and increased zero-shot accuracy by 16.32$\%$, achieving a 2.60$\times$ speedup on CPU and 2.23$\times$ speedup on GPU, requiring only 45 minutes of fine-tuning on a single NVIDIA A100 80GB GPU. Code is available at https://github.com/wzhuang-xmu/LoSA.

* Accepted to ICLR 2025

Via

Access Paper or Ask Questions

KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

Feb 06, 2025

Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu, Yiwu Yao, Sinno Jialin Pan, Mingxuan Yuan

Figure 1 for KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

Figure 2 for KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

Figure 3 for KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

Figure 4 for KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

Abstract:KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints. Therefore, we thoroughly analyze the inherent correlation of layer-wise transformer attention patterns to KV cache quantization errors and study why key cache is more important than value cache for quantization error reduction. We further propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline searched configurations during online inference. To reduce the computational cost of offline calibration, we utilize the intra-layer KV precision pair pruning and inter-layer clustering to reduce the search space. Experimental results show that we can achieve nearly lossless 3.25-bit mixed precision KV cache quantization for LLMs like Llama-3.1-8B-Instruct and 4.0-bit for sensitive models like Qwen2.5-7B-Instruct on mathematical reasoning tasks. The maximum inference throughput can be improved by 38.3% compared with KV8 quantization over various context lengths.

Via

Access Paper or Ask Questions

RazorAttention: Efficient KV Cache Compression Through Retrieval Heads

Jul 22, 2024

Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, Gongyi Wang

Abstract:The memory and computational demands of Key-Value (KV) cache present significant challenges for deploying long-context language models. Previous approaches attempt to mitigate this issue by selectively dropping tokens, which irreversibly erases critical information that might be needed for future queries. In this paper, we propose a novel compression technique for KV cache that preserves all token information. Our investigation reveals that: i) Most attention heads primarily focus on the local context; ii) Only a few heads, denoted as retrieval heads, can essentially pay attention to all input tokens. These key observations motivate us to use separate caching strategy for attention heads. Therefore, we propose RazorAttention, a training-free KV cache compression algorithm, which maintains a full cache for these crucial retrieval heads and discards the remote tokens in non-retrieval heads. Furthermore, we introduce a novel mechanism involving a "compensation token" to further recover the information in the dropped tokens. Extensive evaluations across a diverse set of large language models (LLMs) demonstrate that RazorAttention achieves a reduction in KV cache size by over 70% without noticeable impacts on performance. Additionally, RazorAttention is compatible with FlashAttention, rendering it an efficient and plug-and-play solution that enhances LLM inference efficiency without overhead or retraining of the original model.

Via

Access Paper or Ask Questions

Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs

Oct 17, 2023

Yuxin Zhang, Lirui Zhao, Mingbao Lin, Yunyun Sun, Yiwu Yao, Xingjia Han, Jared Tanner, Shiwei Liu, Rongrong Ji

Figure 1 for Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs

Figure 2 for Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs

Figure 3 for Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs

Figure 4 for Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs

Abstract:The ever-increasing large language models (LLMs), though opening a potential path for the upcoming artificial general intelligence, sadly drops a daunting obstacle on the way towards their on-device deployment. As one of the most well-established pre-LLMs approaches in reducing model complexity, network pruning appears to lag behind in the era of LLMs, due mostly to its costly fine-tuning (or re-training) necessity under the massive volumes of model parameter and training data. To close this industry-academia gap, we introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach that slightly updates sparse LLMs without the expensive backpropagation and any weight updates. Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs, in the fashion of performing iterative weight pruning-and-growing on top of sparse LLMs. To accomplish this purpose, DSnoT particularly takes into account the anticipated reduction in reconstruction error for pruning and growing, as well as the variance w.r.t. different input data for growing each weight. This practice can be executed efficiently in linear time since its obviates the need of backpropagation for fine-tuning LLMs. Extensive experiments on LLaMA-V1/V2, Vicuna, and OPT across various benchmarks demonstrate the effectiveness of DSnoT in enhancing the performance of sparse LLMs, especially at high sparsity levels. For instance, DSnoT is able to outperform the state-of-the-art Wanda by 26.79 perplexity at 70% sparsity with LLaMA-7B. Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs. Codes are available at https://github.com/zyxxmu/DSnoT.

Via

Access Paper or Ask Questions

Extremely Low Footprint End-to-End ASR System for Smart Device

Apr 26, 2021

Zhifu Gao, Yiwu Yao, Shiliang Zhang, Jun Yang, Ming Lei, Ian McLoughlin

Figure 1 for Extremely Low Footprint End-to-End ASR System for Smart Device

Figure 2 for Extremely Low Footprint End-to-End ASR System for Smart Device

Figure 3 for Extremely Low Footprint End-to-End ASR System for Smart Device

Figure 4 for Extremely Low Footprint End-to-End ASR System for Smart Device

Abstract:Recently, end-to-end (E2E) speech recognition has become popular, since it can integrate the acoustic, pronunciation and language models into a single neural network, as well as outperforms conventional models. Among E2E approaches, attention-based models, $e.g.$ Transformer, have emerged as being superior. The E2E models have opened the door of deployment of ASR on smart device, however it still suffers from large amount model parameters. This work proposes an extremely low footprint E2E ASR system for smart device, to achieve the goal of satisfying resource constraints without sacrificing recognition accuracy. We adopt cross-layer weight sharing to improve parameter-efficiency. We further exploit the model compression methods including sparsification and quantization, to reduce the memory storage and boost the decoding efficiency on smart device. We have evaluated our approach on the public AISHELL-1 and AISHELL-2 benchmarks. On the AISHELL-2 task, the proposed method achieves more than 10x compression (model size from 248MB to 24MB) while shuffer from small performance loss (CER from 6.49% to 6.92%).

* 5 pages, 2 figures, submitted to INTERSPEECH2021

Via

Access Paper or Ask Questions

INT8 Winograd Acceleration for Conv1D Equipped ASR Models Deployed on Mobile Devices

Oct 28, 2020

Yiwu Yao, Yuchao Li, Chengyu Wang, Tianhang Yu, Houjiang Chen, Xiaotang Jiang, Jun Yang, Jun Huang, Wei Lin, Hui Shu(+1 more)

Figure 1 for INT8 Winograd Acceleration for Conv1D Equipped ASR Models Deployed on Mobile Devices

Figure 2 for INT8 Winograd Acceleration for Conv1D Equipped ASR Models Deployed on Mobile Devices

Figure 3 for INT8 Winograd Acceleration for Conv1D Equipped ASR Models Deployed on Mobile Devices

Figure 4 for INT8 Winograd Acceleration for Conv1D Equipped ASR Models Deployed on Mobile Devices

Abstract:The intensive computation of Automatic Speech Recognition (ASR) models obstructs them from being deployed on mobile devices. In this paper, we present a novel quantized Winograd optimization pipeline, which combines the quantization and fast convolution to achieve efficient inference acceleration on mobile devices for ASR models. To avoid the information loss due to the combination of quantization and Winograd convolution, a Range-Scaled Quantization (RSQ) training method is proposed to expand the quantized numerical range and to distill knowledge from high-precision values. Moreover, an improved Conv1D equipped DFSMN (ConvDFSMN) model is designed for mobile deployment. We conduct extensive experiments on both ConvDFSMN and Wav2letter models. Results demonstrate the models can be effectively optimized with the proposed pipeline. Especially, Wav2letter achieves 1.48* speedup with an approximate 0.07% WER decrease on ARMv7-based mobile devices.

Via

Access Paper or Ask Questions

Fully Parallel Architecture for Semi-global Stereo Matching with Refined Rank Method

May 07, 2019

Yiwu Yao, Yuhua Cheng

Figure 1 for Fully Parallel Architecture for Semi-global Stereo Matching with Refined Rank Method

Figure 2 for Fully Parallel Architecture for Semi-global Stereo Matching with Refined Rank Method

Figure 3 for Fully Parallel Architecture for Semi-global Stereo Matching with Refined Rank Method

Figure 4 for Fully Parallel Architecture for Semi-global Stereo Matching with Refined Rank Method

Abstract:Fully parallel architecture at disparity-level for efficient semi-global matching (SGM) with refined rank method is presented. The improved SGM algorithm is implemented with the non-parametric unified rank model which is the combination of Rank filter/AD and Rank SAD. Rank SAD is a novel definition by introducing the constraints of local image structure into the rank method. As a result, the unified rank model with Rank SAD can make up for the defects of Rank filter/AD. Experimental results show both excellent subjective quality and objective performance of the refined SGM algorithm. The fully parallel construction for hardware implementation of SGM is architected with reasonable strategies at disparity-level. The parallelism of the data-stream allows proper throughput for specific applications with acceptable maximum frequency. The results of RTL emulation and synthesis ensure that the proposed parallel architecture is suitable for VLSI implementation.

* stereo matching; SGM; Rank SAD; fully parallel architecture

Via

Access Paper or Ask Questions

Creating Lightweight Object Detectors with Model Compression for Deployment on Edge Devices

May 06, 2019

Yiwu Yao, Weiqiang Yang, Haoqi Zhu

Figure 1 for Creating Lightweight Object Detectors with Model Compression for Deployment on Edge Devices

Figure 2 for Creating Lightweight Object Detectors with Model Compression for Deployment on Edge Devices

Figure 3 for Creating Lightweight Object Detectors with Model Compression for Deployment on Edge Devices

Figure 4 for Creating Lightweight Object Detectors with Model Compression for Deployment on Edge Devices

Abstract:To achieve lightweight object detectors for deployment on the edge devices, an effective model compression pipeline is proposed in this paper. The compression pipeline consists of automatic channel pruning for the backbone, fixed channel deletion for the branch layers and knowledge distillation for the guidance learning. As results, the Resnet50-v1d is auto-pruned and fine-tuned on ImageNet to attain a compact base model as the backbone of object detector. Then, lightweight object detectors are implemented with proposed compression pipeline. For instance, the SSD-300 with model size=16.3MB, FLOPS=2.31G, and mAP=71.2 is created, revealing a better result than SSD-300-MobileNet.

* lightweight detector, automatic channel pruning, fixed channel deletion, knowledge distillation

Via

Access Paper or Ask Questions