Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wang Xu

Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design

May 29, 2025

Yudi Zhang, Weilin Zhao, Xu Han, Tiejun Zhao, Wang Xu, Hailong Cao, Conghui Zhu

Abstract:Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which increases computational effort. Quantization achieves this optimization by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications. To further leverage their strengths, we investigate the integration of these two techniques. Surprisingly, experiments applying the advanced speculative decoding method EAGLE-2 to various quantized models reveal that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding. Specifically, verifying a tree-style draft incurs significantly more time overhead than a single-token forward pass on 4-bit weight quantized models. This finding led to our new speculative decoding design: a hierarchical framework that employs a small model as an intermediate stage to turn tree-style drafts into sequence drafts, leveraging the memory access benefits of the target quantized model. Experimental results show that our hierarchical approach achieves a 2.78$\times$ speedup across various tasks for the 4-bit weight Llama-3-70B model on an A100 GPU, outperforming EAGLE-2 by 1.31$\times$. Code available at https://github.com/AI9Stars/SpecMQuant.

* 12 pages, 5 figures

Via

Access Paper or Ask Questions

Enabling Real-Time Conversations with Minimal Training Costs

Sep 18, 2024

Wang Xu, Shuo Wang, Weilin Zhao, Xu Han, Yukun Yan, Yudi Zhang, Zhe Tao, Zhiyuan Liu, Wanxiang Che

Figure 1 for Enabling Real-Time Conversations with Minimal Training Costs

Figure 2 for Enabling Real-Time Conversations with Minimal Training Costs

Figure 3 for Enabling Real-Time Conversations with Minimal Training Costs

Figure 4 for Enabling Real-Time Conversations with Minimal Training Costs

Abstract:Large language models (LLMs) have demonstrated the ability to improve human efficiency through conversational interactions. Conventional LLM-powered dialogue systems, operating on a turn-based paradigm, preclude real-time interaction during response generation. To address this limitation, researchers have proposed duplex models. These models can dynamically adapt to user input, facilitating real-time interactive feedback. However, these methods typically require substantial computational resources to acquire the ability. To reduce overhead, this paper presents a new duplex decoding approach that enhances LLMs with duplex ability, requiring minimal additional training. Specifically, our method employs parallel decoding of queries and responses in conversations, effectively implementing a channel-division-multiplexing decoding strategy. Experimental results indicate that our proposed method significantly enhances the naturalness and human-likeness of user-AI interactions with minimal training costs.

* 7pages, 6 figures, 1 table

Via

Access Paper or Ask Questions

PVF (Parameter Vulnerability Factor): A Quantitative Metric Measuring AI Vulnerability and Resilience Against Parameter Corruptions

May 02, 2024

Xun Jiao, Fred Lin, Harish D. Dixit, Joel Coburn, Abhinav Pandey, Han Wang, Jianyu Huang, Venkat Ramesh, Wang Xu, Daniel Moore(+1 more)

Figure 1 for PVF (Parameter Vulnerability Factor): A Quantitative Metric Measuring AI Vulnerability and Resilience Against Parameter Corruptions

Figure 2 for PVF (Parameter Vulnerability Factor): A Quantitative Metric Measuring AI Vulnerability and Resilience Against Parameter Corruptions

Figure 3 for PVF (Parameter Vulnerability Factor): A Quantitative Metric Measuring AI Vulnerability and Resilience Against Parameter Corruptions

Figure 4 for PVF (Parameter Vulnerability Factor): A Quantitative Metric Measuring AI Vulnerability and Resilience Against Parameter Corruptions

Abstract:Reliability of AI systems is a fundamental concern for the successful deployment and widespread adoption of AI technologies. Unfortunately, the escalating complexity and heterogeneity of AI hardware systems make them inevitably and increasingly susceptible to hardware faults (e.g., bit flips) that can potentially corrupt model parameters. Given this challenge, this paper aims to answer a critical question: How likely is a parameter corruption to result in an incorrect model output? To systematically answer this question, we propose a novel quantitative metric, Parameter Vulnerability Factor (PVF), inspired by architectural vulnerability factor (AVF) in computer architecture community, aiming to standardize the quantification of AI model resilience/vulnerability against parameter corruptions. We define a model parameter's PVF as the probability that a corruption in that particular model parameter will result in an incorrect output. Similar to AVF, this statistical concept can be derived from statistically extensive and meaningful fault injection (FI) experiments. In this paper, we present several use cases on applying PVF to three types of tasks/models during inference -- recommendation (DLRM), vision classification (CNN), and text classification (BERT). PVF can provide pivotal insights to AI hardware designers in balancing the tradeoff between fault protection and performance/efficiency such as mapping vulnerable AI parameter components to well-protected hardware modules. PVF metric is applicable to any AI model and has a potential to help unify and standardize AI vulnerability/resilience evaluation practice.

Via

Access Paper or Ask Questions

LoRA-drop: Efficient LoRA Parameter Pruning based on Output Evaluation

Feb 12, 2024

Hongyun Zhou, Xiangyu Lu, Wang Xu, Conghui Zhu, Tiejun Zhao

Abstract:Low-Rank Adaptation (LoRA) introduces auxiliary parameters for each layer to fine-tune the pre-trained model under limited computing resources. But it still faces challenges of resource consumption when scaling up to larger models. Previous studies employ pruning techniques by evaluating the importance of LoRA parameters for different layers to address the problem. However, these efforts only analyzed parameter features to evaluate their importance. Indeed, the output of LoRA related to the parameters and data is the factor that directly impacts the frozen model. To this end, we propose LoRA-drop which evaluates the importance of the parameters by analyzing the LoRA output. We retain LoRA for important layers and the LoRA of the other layers share the same parameters. Abundant experiments on NLU and NLG tasks demonstrate the effectiveness of LoRA-drop.

* 12 pages, 11 figures

Via

Access Paper or Ask Questions

Chinese Fine-Grained Financial Sentiment Analysis with Large Language Models

Jul 06, 2023

Yinyu Lan, Yanru Wu, Wang Xu, Weiqiang Feng, Youhao Zhang

Figure 1 for Chinese Fine-Grained Financial Sentiment Analysis with Large Language Models

Figure 2 for Chinese Fine-Grained Financial Sentiment Analysis with Large Language Models

Figure 3 for Chinese Fine-Grained Financial Sentiment Analysis with Large Language Models

Figure 4 for Chinese Fine-Grained Financial Sentiment Analysis with Large Language Models

Abstract:Entity-level fine-grained sentiment analysis in the financial domain is a crucial subtask of sentiment analysis and currently faces numerous challenges. The primary challenge stems from the lack of high-quality and large-scale annotated corpora specifically designed for financial text sentiment analysis, which in turn limits the availability of data necessary for developing effective text processing techniques. Recent advancements in large language models (LLMs) have yielded remarkable performance in natural language processing tasks, primarily centered around language pattern matching. In this paper, we propose a novel and extensive Chinese fine-grained financial sentiment analysis dataset, FinChina SA, for enterprise early warning. We thoroughly evaluate and experiment with well-known existing open-source LLMs using our dataset. We firmly believe that our dataset will serve as a valuable resource to advance the exploration of real-world financial sentiment analysis tasks, which should be the focus of future research. Our dataset and all code to replicate the experimental results will be released.

* FinLLM Symposium at IJCAI 2023

Via

Access Paper or Ask Questions

Document-Level Relation Extraction with Sentences Importance Estimation and Focusing

Apr 27, 2022

Wang Xu, Kehai Chen, Lili Mou, Tiejun Zhao

Figure 1 for Document-Level Relation Extraction with Sentences Importance Estimation and Focusing

Figure 2 for Document-Level Relation Extraction with Sentences Importance Estimation and Focusing

Figure 3 for Document-Level Relation Extraction with Sentences Importance Estimation and Focusing

Figure 4 for Document-Level Relation Extraction with Sentences Importance Estimation and Focusing

Abstract:Document-level relation extraction (DocRE) aims to determine the relation between two entities from a document of multiple sentences. Recent studies typically represent the entire document by sequence- or graph-based models to predict the relations of all entity pairs. However, we find that such a model is not robust and exhibits bizarre behaviors: it predicts correctly when an entire test document is fed as input, but errs when non-evidence sentences are removed. To this end, we propose a Sentence Importance Estimation and Focusing (SIEF) framework for DocRE, where we design a sentence importance score and a sentence focusing loss, encouraging DocRE models to focus on evidence sentences. Experimental results on two domains show that our SIEF not only improves overall performance, but also makes DocRE models more robust. Moreover, SIEF is a general framework, shown to be effective when combined with a variety of base DocRE models.

* Accepted to NAACL 2022

Via

Access Paper or Ask Questions

Discriminative Reasoning for Document-level Relation Extraction

Jun 03, 2021

Wang Xu, Kehai Chen, Tiejun Zhao

Figure 1 for Discriminative Reasoning for Document-level Relation Extraction

Figure 2 for Discriminative Reasoning for Document-level Relation Extraction

Figure 3 for Discriminative Reasoning for Document-level Relation Extraction

Figure 4 for Discriminative Reasoning for Document-level Relation Extraction

Abstract:Document-level relation extraction (DocRE) models generally use graph networks to implicitly model the reasoning skill (i.e., pattern recognition, logical reasoning, coreference reasoning, etc.) related to the relation between one entity pair in a document. In this paper, we propose a novel discriminative reasoning framework to explicitly model the paths of these reasoning skills between each entity pair in this document. Thus, a discriminative reasoning network is designed to estimate the relation probability distribution of different reasoning paths based on the constructed graph and vectorized document contexts for each entity pair, thereby recognizing their relation. Experimental results show that our method outperforms the previous state-of-the-art performance on the large-scale DocRE dataset. The code is publicly available at https://github.com/xwjim/DRN.

* 11 pages, 4 figures, 5 tables. Findings of ACL 2021

Via

Access Paper or Ask Questions

Document-Level Relation Extraction with Reconstruction

Dec 21, 2020

Wang Xu, Kehai Chen, Tiejun Zhao

Figure 1 for Document-Level Relation Extraction with Reconstruction

Figure 2 for Document-Level Relation Extraction with Reconstruction

Figure 3 for Document-Level Relation Extraction with Reconstruction

Figure 4 for Document-Level Relation Extraction with Reconstruction

Abstract:In document-level relation extraction (DocRE), graph structure is generally used to encode relation information in the input document to classify the relation category between each entity pair, and has greatly advanced the DocRE task over the past several years. However, the learned graph representation universally models relation information between all entity pairs regardless of whether there are relationships between these entity pairs. Thus, those entity pairs without relationships disperse the attention of the encoder-classifier DocRE for ones with relationships, which may further hind the improvement of DocRE. To alleviate this issue, we propose a novel encoder-classifier-reconstructor model for DocRE. The reconstructor manages to reconstruct the ground-truth path dependencies from the graph representation, to ensure that the proposed DocRE model pays more attention to encode entity pairs with relationships in the training. Furthermore, the reconstructor is regarded as a relationship indicator to assist relation classification in the inference, which can further improve the performance of DocRE model. Experimental results on a large-scale DocRE dataset show that the proposed model can significantly improve the accuracy of relation extraction on a strong heterogeneous graph-based baseline.

* 9 pages, 5 figures, 6 tables. Accepted by AAAI 2021 (Long Paper)

Via

Access Paper or Ask Questions