Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Jan 19, 2024

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao

Figure 1 for Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Figure 2 for Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Figure 3 for Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Figure 4 for Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Share this with someone who'll enjoy it:

Abstract:The inference process in Large Language Models (LLMs) is often limited due to the absence of parallelism in the auto-regressive decoding process, resulting in most operations being restricted by the memory bandwidth of accelerators. While methods such as speculative decoding have been suggested to address this issue, their implementation is impeded by the challenges associated with acquiring and maintaining a separate draft model. In this paper, we present Medusa, an efficient method that augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. Using a tree-based attention mechanism, Medusa constructs multiple candidate continuations and verifies them simultaneously in each decoding step. By leveraging parallel processing, Medusa introduces only minimal overhead in terms of single-step latency while substantially reducing the number of decoding steps required. We present two levels of fine-tuning procedures for Medusa to meet the needs of different use cases: Medusa-1: Medusa is directly fine-tuned on top of a frozen backbone LLM, enabling lossless inference acceleration. Medusa-2: Medusa is fine-tuned together with the backbone LLM, enabling better prediction accuracy of Medusa heads and higher speedup but needing a special training recipe that preserves the backbone model's capabilities. Moreover, we propose several extensions that improve or expand the utility of Medusa, including a self-distillation to handle situations where no training data is available and a typical acceptance scheme to boost the acceptance rate while maintaining generation quality. We evaluate Medusa on models of various sizes and training procedures. Our experiments demonstrate that Medusa-1 can achieve over 2.2x speedup without compromising generation quality, while Medusa-2 further improves the speedup to 2.3-3.6x.

* The code for this implementation is available at https://github.com/FasterDecoding/Medusa

View paper on

Share this with someone who'll enjoy it:

Title:Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Paper and Code