Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Decoding Speculative Decoding

Feb 02, 2024

Minghao Yan, Saurabh Agarwal, Shivaram Venkataraman

Figure 1 for Decoding Speculative Decoding

Figure 2 for Decoding Speculative Decoding

Figure 3 for Decoding Speculative Decoding

Figure 4 for Decoding Speculative Decoding

Share this with someone who'll enjoy it:

Abstract:Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without modifying its outcome. When performing inference on an LLM, speculative decoding uses a smaller draft model which generates speculative tokens and then uses the target LLM to verify those draft tokens. The speedup provided by speculative decoding heavily depends on the choice of the draft model. It has been widely suggested to select a draft model that provides a high probability of the generated token being accepted by the LLM to achieve the highest throughput. However, our experiments indicate the contrary with throughput diminishing as the probability of generated tokens to be accepted by the target model increases. To understand this phenomenon, we perform extensive experiments to characterize the different factors that affect speculative decoding and how those factors interact and affect the speedups. Based on our experiments we describe an analytical model which can be used to decide the right draft model for a given workload. Further, using our insights we design a new draft model for LLaMA-65B which can provide 30% higher throughput than existing draft models.

View paper on

Share this with someone who'll enjoy it:

Title:Decoding Speculative Decoding

Paper and Code