Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

May 22, 2023

Zangwei Zheng, Xiaozhe Ren, Fuzhao Xue, Yang Luo, Xin Jiang, Yang You

Figure 1 for Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

Figure 2 for Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

Figure 3 for Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

Figure 4 for Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

Share this with someone who'll enjoy it:

Abstract:Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks. However, the inference process for LLMs comes with significant computational costs. In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs. Our approach begins by tapping into the potential of LLMs to accurately perceive and predict the response length with minimal overhead. By leveraging this information, we introduce an efficient sequence scheduling technique that groups queries with similar response lengths into micro-batches. We evaluate our approach on real-world instruction datasets using the LLaMA-based model, and our results demonstrate an impressive 86% improvement in inference throughput without compromising effectiveness. Notably, our method is orthogonal to other inference acceleration techniques, making it a valuable addition to many existing toolkits (e.g., FlashAttention, Quantization) for LLM inference.

View paper on

Share this with someone who'll enjoy it:

Title:Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

Paper and Code