Abstract:Generative large language models (LLMs) have opened up numerous novel possibilities, but due to their significant computational requirements their ubiquitous use remains challenging. Some of the most useful applications require processing large numbers of samples at a time and using long contexts, both significantly increasing the memory communication load of the models. We introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by reducing the memory bandwidth requirements within the attention blocks through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show how SparQ Attention can decrease the attention memory bandwidth requirements up to eight times without any loss in accuracy by evaluating Llama 2 and Pythia models on a wide range of downstream tasks.
Abstract:Analog, low-voltage electronics show great promise in producing silicon neurons (SiNs) with unprecedented levels of energy efficiency. Yet, their inherently high susceptibility to process, voltage and temperature (PVT) variations, and noise has long been recognised as a major bottleneck in developing effective neuromorphic solutions. Inspired by spike transmission studies in biophysical, neocortical neurons, we demonstrate that the inherent noise and variability can coexist with reliable spike transmission in analog SiNs, similarly to biological neurons. We illustrate this property on a recent neuromorphic model of a bursting neuron by showcasing three different relevant types of reliable event transmission: single spike transmission, burst transmission, and the on-off control of a half-centre oscillator (HCO) network.