Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation

Jul 16, 2024

Branden Butler, Sixing Yu, Arya Mazaheri, Ali Jannesari

Figure 1 for PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation

Figure 2 for PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation

Figure 3 for PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation

Figure 4 for PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation

Share this with someone who'll enjoy it:

Abstract:Inference of Large Language Models (LLMs) across computer clusters has become a focal point of research in recent times, with many acceleration techniques taking inspiration from CPU speculative execution. These techniques reduce bottlenecks associated with memory bandwidth, but also increase end-to-end latency per inference run, requiring high speculation acceptance rates to improve performance. Combined with a variable rate of acceptance across tasks, speculative inference techniques can result in reduced performance. Additionally, pipeline-parallel designs require many user requests to maintain maximum utilization. As a remedy, we propose PipeInfer, a pipelined speculative acceleration technique to reduce inter-token latency and improve system utilization for single-request scenarios while also improving tolerance to low speculation acceptance rates and low-bandwidth interconnects. PipeInfer exhibits up to a 2.15$\times$ improvement in generation speed over standard speculative inference. PipeInfer achieves its improvement through Continuous Asynchronous Speculation and Early Inference Cancellation, the former improving latency and generation speed by running single-token inference simultaneously with several speculative runs, while the latter improves speed and latency by skipping the computation of invalidated runs, even in the middle of inference.

* 11 pages, submitted to SC24 conference

View paper on

Share this with someone who'll enjoy it:

Title:PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation

Paper and Code