Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Beyond Autoregression: Fast LLMs via Self-Distillation Through Time

Oct 28, 2024

Justin Deschenaux, Caglar Gulcehre

Figure 1 for Beyond Autoregression: Fast LLMs via Self-Distillation Through Time

Figure 2 for Beyond Autoregression: Fast LLMs via Self-Distillation Through Time

Figure 3 for Beyond Autoregression: Fast LLMs via Self-Distillation Through Time

Figure 4 for Beyond Autoregression: Fast LLMs via Self-Distillation Through Time

Share this with someone who'll enjoy it:

Abstract:Autoregressive (AR) Large Language Models (LLMs) have demonstrated significant success across numerous tasks. However, the AR modeling paradigm presents certain limitations; for instance, contemporary autoregressive LLMs are trained to generate one token at a time, which can result in noticeable latency. Recent advances have indicated that search and repeated sampling can enhance performance in various applications, such as theorem proving, code generation, and alignment, by utilizing greater computational resources during inference. In this study, we demonstrate that diffusion language models are capable of generating at least 32 tokens simultaneously, while exceeding the performance of AR models in text quality and on the LAMBADA natural language understanding benchmark. This outcome is achieved through a novel distillation method for discrete diffusion models, which reduces the number of inference steps by a factor of 32-64. Practically, our models, even without caching, can generate tokens at a rate that is up to 8 times faster than AR models employing KV caching, and we anticipate further improvements with the inclusion of caching. Moreover, we demonstrate the efficacy of our approach for diffusion language models with up to 860M parameters.

View paper on

Share this with someone who'll enjoy it:

Title:Beyond Autoregression: Fast LLMs via Self-Distillation Through Time

Paper and Code