Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

Aug 21, 2024

Elias Frantar, Roberto L. Castro, Jiale Chen, Torsten Hoefler, Dan Alistarh

Figure 1 for MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

Figure 2 for MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

Figure 3 for MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

Figure 4 for MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

Share this with someone who'll enjoy it:

Abstract:As inference on Large Language Models (LLMs) emerges as an important workload in machine learning applications, weight quantization has become a standard technique for efficient GPU deployment. Quantization not only reduces model size, but has also been shown to yield substantial speedups for single-user inference, due to reduced memory movement, with low accuracy impact. Yet, it remains open whether speedups are achievable also in \emph{batched} settings with multiple parallel clients, which are highly relevant for practical serving. It is unclear whether GPU kernels can be designed to remain practically memory-bound, while supporting the substantially increased compute requirements of batched workloads. This paper resolves this question positively by describing the design of Mixed-precision Auto-Regressive LINear kernels, called MARLIN. Concretely, given a model whose weights are compressed via quantization to, e.g., 4 bits per element, MARLIN shows that batchsizes up to 16-32 can be supported with close to maximum ($4\times$) quantization speedup, and larger batchsizes up to 64-128 with gradually decreasing, but still significant, acceleration. MARLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining, and bespoke quantization support. Our experiments show that MARLIN's near-optimal performance on individual LLM layers across different scenarios can also lead to end-to-end LLM inference speedups (of up to $2.8\times$) when integrated with the popular vLLM serving engine. Finally, MARLIN is extensible to further compression techniques, like NVIDIA 2:4 sparsity, leading to additional speedups.

View paper on

Share this with someone who'll enjoy it:

Title:MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

Paper and Code