Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

David Tarjan

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Mar 14, 2024

Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti

Figure 1 for Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Figure 2 for Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Figure 3 for Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Figure 4 for Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Abstract:Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for on-line key-value cache compression at inference time. Most importantly, the model learns to apply different compression rates in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to ~3.7x throughput increase in auto-regressive inference on a NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. We find that DMC preserves the original downstream performance with up to 4x cache compression, outperforming up-trained grouped-query attention (GQA). GQA and DMC can be even combined to obtain compounded gains. As a result DMC fits longer contexts and larger batches within any given memory budget.

Via

Access Paper or Ask Questions

SDCNet: Video Prediction Using Spatially-Displaced Convolution

Nov 02, 2018

Fitsum A. Reda, Guilin Liu, Kevin J. Shih, Robert Kirby, Jon Barker, David Tarjan, Andrew Tao, Bryan Catanzaro

Figure 1 for SDCNet: Video Prediction Using Spatially-Displaced Convolution

Figure 2 for SDCNet: Video Prediction Using Spatially-Displaced Convolution

Figure 3 for SDCNet: Video Prediction Using Spatially-Displaced Convolution

Figure 4 for SDCNet: Video Prediction Using Spatially-Displaced Convolution

Abstract:We present an approach for high-resolution video frame prediction by conditioning on both past frames and past optical flows. Previous approaches rely on resampling past frames, guided by a learned future optical flow, or on direct generation of pixels. Resampling based on flow is insufficient because it cannot deal with disocclusions. Generative models currently lead to blurry results. Recent approaches synthesis a pixel by convolving input patches with a predicted kernel. However, their memory requirement increases with kernel size. Here, we spatially-displaced convolution (SDC) module for video frame prediction. We learn a motion vector and a kernel for each pixel and synthesize a pixel by applying the kernel at a displaced location in the source image, defined by the predicted motion vector. Our approach inherits the merits of both vector-based and kernel-based approaches, while ameliorating their respective disadvantages. We train our model on 428K unlabelled 1080p video game frames. Our approach produces state-of-the-art results, achieving an SSIM score of 0.904 on high-definition YouTube-8M videos, 0.918 on Caltech Pedestrian videos. Our model handles large motion effectively and synthesizes crisp frames with consistent motion.

* Published in ECCV 2018

Via

Access Paper or Ask Questions