Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marta Andronic

NeuraLUT-Assemble: Hardware-aware Assembling of Sub-Neural Networks for Efficient LUT Inference

Apr 01, 2025

Marta Andronic, George A. Constantinides

Abstract:Efficient neural networks (NNs) leveraging lookup tables (LUTs) have demonstrated significant potential for emerging AI applications, particularly when deployed on field-programmable gate arrays (FPGAs) for edge computing. These architectures promise ultra-low latency and reduced resource utilization, broadening neural network adoption in fields such as particle physics. However, existing LUT-based designs suffer from accuracy degradation due to the large fan-in required by neurons being limited by the exponential scaling of LUT resources with input width. In practice, in prior work this tension has resulted in the reliance on extremely sparse models. We present NeuraLUT-Assemble, a novel framework that addresses these limitations by combining mixed-precision techniques with the assembly of larger neurons from smaller units, thereby increasing connectivity while keeping the number of inputs of any given LUT manageable. Additionally, we introduce skip-connections across entire LUT structures to improve gradient flow. NeuraLUT-Assemble closes the accuracy gap between LUT-based methods and (fully-connected) MLP-based models, achieving competitive accuracy on tasks such as network intrusion detection, digit classification, and jet classification, demonstrating up to $8.42\times$ reduction in the area-delay product compared to the state-of-the-art at the time of the publication.

Via

Access Paper or Ask Questions

PolyLUT: Ultra-low Latency Polynomial Inference with Hardware-Aware Structured Pruning

Jan 14, 2025

Marta Andronic, Jiawen Li, George A. Constantinides

Figure 1 for PolyLUT: Ultra-low Latency Polynomial Inference with Hardware-Aware Structured Pruning

Figure 2 for PolyLUT: Ultra-low Latency Polynomial Inference with Hardware-Aware Structured Pruning

Figure 3 for PolyLUT: Ultra-low Latency Polynomial Inference with Hardware-Aware Structured Pruning

Figure 4 for PolyLUT: Ultra-low Latency Polynomial Inference with Hardware-Aware Structured Pruning

Abstract:Standard deep neural network inference involves the computation of interleaved linear maps and nonlinear activation functions. Prior work for ultra-low latency implementations has hardcoded these operations inside FPGA lookup tables (LUTs). However, FPGA LUTs can implement a much greater variety of functions. In this paper, we propose a novel approach to training DNNs for FPGA deployment using multivariate polynomials as the basic building block. Our method takes advantage of the flexibility offered by the soft logic, hiding the polynomial evaluation inside the LUTs with minimal overhead. By using polynomial building blocks, we achieve the same accuracy using considerably fewer layers of soft logic than by using linear functions, leading to significant latency and area improvements. LUT-based implementations also face a significant challenge: the LUT size grows exponentially with the number of inputs. Prior work relies on a priori fixed sparsity, with results heavily dependent on seed selection. To address this, we propose a structured pruning strategy using a bespoke hardware-aware group regularizer that encourages a particular sparsity pattern that leads to a small number of inputs per neuron. We demonstrate the effectiveness of PolyLUT on three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and MNIST.

* arXiv admin note: text overlap with arXiv:2309.02334

Via

Access Paper or Ask Questions

ReducedLUT: Table Decomposition with "Don't Care" Conditions

Dec 24, 2024

Oliver Cassidy, Marta Andronic, Samuel Coward, George A. Constantinides

Abstract:Lookup tables (LUTs) are frequently used to efficiently store arrays of precomputed values for complex mathematical computations. When used in the context of neural networks, these functions exhibit a lack of recognizable patterns which presents an unusual challenge for conventional logic synthesis techniques. Several approaches are known to break down a single large lookup table into multiple smaller ones that can be recombined. Traditional methods, such as plain tabulation, piecewise linear approximation, and multipartite table methods, often yield inefficient hardware solutions when applied to LUT-based NNs. This paper introduces ReducedLUT, a novel method to reduce the footprint of the LUTs by injecting don't cares into the compression process. This additional freedom introduces more self-similarities which can be exploited using known decomposition techniques. We then demonstrate a particular application to machine learning; by replacing unobserved patterns within the training data of neural network models with don't cares, we enable greater compression with minimal model accuracy degradation. In practice, we achieve up to $1.63\times$ reduction in Physical LUT utilization, with a test accuracy drop of no more than $0.01$ accuracy points.

* Proceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '25), February 27--March 1, 2025, Monterey, CA, USA

Via

Access Paper or Ask Questions

BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration

Nov 18, 2024

Yuzong Chen, Ahmed F. AbouElhamayed, Xilai Dai, Yang Wang, Marta Andronic, George A. Constantinides, Mohamed S. Abdelfattah

Figure 1 for BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration

Figure 2 for BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration

Figure 3 for BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration

Figure 4 for BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration

Abstract:Large language models (LLMs) have demonstrated remarkable performance across various machine learning tasks. Yet the substantial memory footprint of LLMs significantly hinders their deployment. In this paper, we improve the accessibility of LLMs through BitMoD, an algorithm-hardware co-design solution that enables efficient LLM acceleration at low weight precision. On the algorithm side, BitMoD introduces fine-grained data type adaptation that uses a different numerical data type to quantize a group of (e.g., 128) weights. Through the careful design of these new data types, BitMoD is able to quantize LLM weights to very low precision (e.g., 4 bits and 3 bits) while maintaining high accuracy. On the hardware side, BitMoD employs a bit-serial processing element to easily support multiple numerical precisions and data types; our hardware design includes two key innovations: First, it employs a unified representation to process different weight data types, thus reducing the hardware cost. Second, it adopts a bit-serial dequantization unit to rescale the per-group partial sum with minimal hardware overhead. Our evaluation on six representative LLMs demonstrates that BitMoD significantly outperforms state-of-the-art LLM quantization and acceleration methods. For discriminative tasks, BitMoD can quantize LLM weights to 4-bit with $<\!0.5\%$ accuracy loss on average. For generative tasks, BitMoD is able to quantize LLM weights to 3-bit while achieving better perplexity than prior LLM quantization scheme. Combining the superior model performance with an efficient accelerator design, BitMoD achieves an average of $1.69\times$ and $1.48\times$ speedups compared to prior LLM accelerators ANT and OliVe, respectively.

* HPCA 2025

Via

Access Paper or Ask Questions

NeuraLUT: Hiding Neural Network Density in Boolean Synthesizable Functions

Feb 29, 2024

Marta Andronic, George A. Constantinides

Abstract:Field-Programmable Gate Array (FPGA) accelerators have proven successful in handling latency- and resource-critical deep neural network (DNN) inference tasks. Among the most computationally intensive operations in a neural network (NN) is the dot product between the feature and weight vectors. Thus, some previous FPGA acceleration works have proposed mapping neurons with quantized inputs and outputs directly to lookup tables (LUTs) for hardware implementation. In these works, the boundaries of the neurons coincide with the boundaries of the LUTs. We propose relaxing these boundaries and mapping entire sub-networks to a single LUT. As the sub-networks are absorbed within the LUT, the NN topology and precision within a partition do not affect the size of the lookup tables generated. Therefore, we utilize fully connected layers with floating-point precision inside each partition, which benefit from being universal function approximators, with rigid sparsity and quantization enforced only between partitions, where the NN topology becomes exposed to the circuit topology. Although cheap to implement, this approach can lead to very deep NNs, and so to tackle challenges like vanishing gradients, we also introduce skip connections inside the partitions. The resulting methodology can be seen as training DNNs with a specific sparsity pattern that allows them to be mapped to much shallower circuit-level networks, thereby significantly improving latency. We validate our proposed method on a known latency-critical task, jet substructure tagging, and on the classical computer vision task, the digit classification using MNIST. Our approach allows for greater function expressivity within the LUTs compared to existing work, leading to lower latency NNs for the same accuracy.

Via

Access Paper or Ask Questions

PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA LUT-based Inference

Sep 05, 2023

Marta Andronic, George A. Constantinides

Figure 1 for PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA LUT-based Inference

Figure 2 for PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA LUT-based Inference

Figure 3 for PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA LUT-based Inference

Figure 4 for PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA LUT-based Inference

Abstract:Field-programmable gate arrays (FPGAs) are widely used to implement deep learning inference. Standard deep neural network inference involves the computation of interleaved linear maps and nonlinear activation functions. Prior work for ultra-low latency implementations has hardcoded the combination of linear maps and nonlinear activations inside FPGA lookup tables (LUTs). Our work is motivated by the idea that the LUTs in an FPGA can be used to implement a much greater variety of functions than this. In this paper, we propose a novel approach to training neural networks for FPGA deployment using multivariate polynomials as the basic building block. Our method takes advantage of the flexibility offered by the soft logic, hiding the polynomial evaluation inside the LUTs with zero overhead. We show that by using polynomial building blocks, we can achieve the same accuracy using considerably fewer layers of soft logic than by using linear functions, leading to significant latency and area improvements. We demonstrate the effectiveness of this approach in three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and handwritten digit recognition using the MNIST dataset.

Via

Access Paper or Ask Questions