Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sebastian Siegel

IMSSA: Deploying modern state-space models on memristive in-memory compute hardware

Dec 28, 2024

Sebastian Siegel, Ming-Jay Yang, John-Paul Strachan

Abstract:Processing long temporal sequences is a key challenge in deep learning. In recent years, Transformers have become state-of-the-art for this task, but suffer from excessive memory requirements due to the need to explicitly store the sequences. To address this issue, structured state-space sequential (S4) models recently emerged, offering a fixed memory state while still enabling the processing of very long sequence contexts. The recurrent linear update of the state in these models makes them highly efficient on modern graphics processing units (GPU) by unrolling the recurrence into a convolution. However, this approach demands significant memory and massively parallel computation, which is only available on the latest GPUs. In this work, we aim to bring the power of S4 models to edge hardware by significantly reducing the size and computational demand of an S4D model through quantization-aware training, even achieving ternary weights for a simple real-world task. To this end, we extend conventional quantization-aware training to tailor it for analog in-memory compute hardware. We then demonstrate the deployment of recurrent S4D kernels on memrisitve crossbar arrays, enabling their computation in an in-memory compute fashion. To our knowledge, this is the first implementation of S4 kernels on in-memory compute hardware.

* 5 pages, 4 figures, submitted to IEEE ISCAS 2025

Via

Access Paper or Ask Questions

Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

Sep 28, 2024

Nathan Leroux, Paul-Philipp Manea, Chirag Sudarshan, Jan Finkbeiner, Sebastian Siegel, John Paul Strachan, Emre Neftci

Figure 1 for Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

Figure 2 for Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

Figure 3 for Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

Figure 4 for Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

Abstract:Transformer neural networks, driven by self-attention mechanisms, are core components of foundational and Large Language Models. In generative transformers, self-attention uses cache memory to store token projections, avoiding recomputation at each time step. However, GPU-stored projections must be loaded into SRAM for each new generation step, causing latency and energy bottlenecks for long sequences. In this work, we propose a fast and energy-efficient hardware implementation of self-attention using analog in-memory computing based on gain cell memories. Volatile gain cell memories can be efficiently written to store new tokens during sequence generation, while performing analog signed weight multiplications to compute the dot-products required for self-attention. We implement Sliding Window Attention, which keeps memory of a finite set of past steps. A charge-to-pulse converter for array readout eliminates the need for analog-to-digital conversion between self-attention stages. Using a co-designed initialization algorithm to adapt pre-trained weights to gain cell non-idealities, we achieve NLP performance comparable to ChatGPT-2 with minimal training iterations, despite hardware constraints. Our end-to-end hardware design includes digital controls, estimating area, latency, and energy. The system reduces attention latency by up to two orders of magnitude and energy consumption by up to five orders compared to GPUs, marking a significant step toward ultra-fast, low-power sequence generation in Large Language Models.

* 25 pages, 6 figures, 1 table

Via

Access Paper or Ask Questions

Synaptogen: A cross-domain generative device model for large-scale neuromorphic circuit design

Apr 09, 2024

Tyler Hennen, Leon Brackmann, Tobias Ziegler, Sebastian Siegel, Stephan Menzel, Rainer Waser, Dirk J. Wouters, Daniel Bedau

Abstract:We present a fast generative modeling approach for resistive memories that reproduces the complex statistical properties of real-world devices. To enable efficient modeling of analog circuits, the model is implemented in Verilog-A. By training on extensive measurement data of integrated 1T1R arrays (6,000 cycles of 512 devices), an autoregressive stochastic process accurately accounts for the cross-correlations between the switching parameters, while non-linear transformations ensure agreement with both cycle-to-cycle (C2C) and device-to-device (D2D) variability. Benchmarks show that this statistically comprehensive model achieves read/write throughputs exceeding those of even highly simplified and deterministic compact models.

* This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Code is available at https://zenodo.org/doi/10.5281/zenodo.10942560

Via

Access Paper or Ask Questions

Sequence learning in a spiking neuronal network with memristive synapses

Nov 29, 2022

Younes Bouhadjar, Sebastian Siegel, Tom Tetzlaff, Markus Diesmann, Rainer Waser, Dirk J. Wouters

Abstract:Brain-inspired computing proposes a set of algorithmic principles that hold promise for advancing artificial intelligence. They endow systems with self learning capabilities, efficient energy usage, and high storage capacity. A core concept that lies at the heart of brain computation is sequence learning and prediction. This form of computation is essential for almost all our daily tasks such as movement generation, perception, and language. Understanding how the brain performs such a computation is not only important to advance neuroscience but also to pave the way to new technological brain-inspired applications. A previously developed spiking neural network implementation of sequence prediction and recall learns complex, high-order sequences in an unsupervised manner by local, biologically inspired plasticity rules. An emerging type of hardware that holds promise for efficiently running this type of algorithm is neuromorphic hardware. It emulates the way the brain processes information and maps neurons and synapses directly into a physical substrate. Memristive devices have been identified as potential synaptic elements in neuromorphic hardware. In particular, redox-induced resistive random access memories (ReRAM) devices stand out at many aspects. They permit scalability, are energy efficient and fast, and can implement biological plasticity rules. In this work, we study the feasibility of using ReRAM devices as a replacement of the biological synapses in the sequence learning model. We implement and simulate the model including the ReRAM plasticity using the neural simulator NEST. We investigate the effect of different device properties on the performance characteristics of the sequence learning model, and demonstrate resilience with respect to different on-off ratios, conductance resolutions, device variability, and synaptic failure.

* 23 pages, 13 Figures

Via

Access Paper or Ask Questions