Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiangxiang Gao

Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree

Dec 17, 2024

Xiangxiang Gao, Weisheng Xie, Yiwei Xiang, Feng Ji

Abstract:Striking an optimal balance between minimal drafting latency and high speculation accuracy to enhance the inference speed of Large Language Models remains a significant challenge in speculative decoding. In this paper, we introduce Falcon, an innovative semi-autoregressive speculative decoding framework fashioned to augment both the drafter's parallelism and output quality. Falcon incorporates the Coupled Sequential Glancing Distillation technique, which fortifies inter-token dependencies within the same block, leading to increased speculation accuracy. We offer a comprehensive theoretical analysis to illuminate the underlying mechanisms. Additionally, we introduce a Custom-Designed Decoding Tree, which permits the drafter to generate multiple tokens in a single forward pass and accommodates multiple forward passes as needed, thereby boosting the number of drafted tokens and significantly improving the overall acceptance rate. Comprehensive evaluations on benchmark datasets such as MT-Bench, HumanEval, and GSM8K demonstrate Falcon's superior acceleration capabilities. The framework achieves a lossless speedup ratio ranging from 2.91x to 3.51x when tested on the Vicuna and LLaMA2-Chat model series. These results outstrip existing speculative decoding methods for LLMs, including Eagle, Medusa, Lookahead, SPS, and PLD, while maintaining a compact drafter architecture equivalent to merely two Transformer layers.

* AAAI 2025 Accepted

Via

Access Paper or Ask Questions

F-PABEE: Flexible-patience-based Early Exiting for Single-label and Multi-label text Classification Tasks

May 21, 2023

Xiangxiang Gao, Wei Zhu, Jiasheng Gao, Congrui Yin

Figure 1 for F-PABEE: Flexible-patience-based Early Exiting for Single-label and Multi-label text Classification Tasks

Figure 2 for F-PABEE: Flexible-patience-based Early Exiting for Single-label and Multi-label text Classification Tasks

Figure 3 for F-PABEE: Flexible-patience-based Early Exiting for Single-label and Multi-label text Classification Tasks

Figure 4 for F-PABEE: Flexible-patience-based Early Exiting for Single-label and Multi-label text Classification Tasks

Abstract:Computational complexity and overthinking problems have become the bottlenecks for pre-training language models (PLMs) with millions or even trillions of parameters. A Flexible-Patience-Based Early Exiting method (F-PABEE) has been proposed to alleviate the problems mentioned above for single-label classification (SLC) and multi-label classification (MLC) tasks. F-PABEE makes predictions at the classifier and will exit early if predicted distributions of cross-layer are consecutively similar. It is more flexible than the previous state-of-the-art (SOTA) early exiting method PABEE because it can simultaneously adjust the similarity score thresholds and the patience parameters. Extensive experiments show that: (1) F-PABEE makes a better speedup-accuracy balance than existing early exiting strategies on both SLC and MLC tasks. (2) F-PABEE achieves faster inference and better performances on different PLMs such as BERT and ALBERT. (3) F-PABEE-JSKD performs best for F-PABEE with different similarity measures.

* accepted by ICASSP-2023

Via

Access Paper or Ask Questions