Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shivam Aggarwal

HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration

Feb 27, 2025

Rohan Juneja, Shivam Aggarwal, Safeen Huda, Tulika Mitra, Li-Shiuan Peh

Figure 1 for HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration

Figure 2 for HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration

Figure 3 for HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration

Figure 4 for HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration

Abstract:Quantization is critical for realizing efficient inference of LLMs. Traditional quantization methods are hardware-agnostic, limited to bit-width constraints, and lacking circuit-level insights, such as timing and energy characteristics of Multiply-Accumulate (MAC) units. We introduce HALO, a versatile framework that adapts to various hardware through a Hardware-Aware Post-Training Quantization (PTQ) approach. By leveraging MAC unit properties, HALO minimizes critical-path delays and enables dynamic frequency scaling. Deployed on LLM accelerators like TPUs and GPUs, HALO achieves on average 270% performance gains and 51% energy savings, all with minimal accuracy drop.

Via

Access Paper or Ask Questions

Condensed Sample-Guided Model Inversion for Knowledge Distillation

Aug 25, 2024

Kuluhan Binici, Shivam Aggarwal, Cihan Acar, Nam Trung Pham, Karianto Leman, Gim Hee Lee, Tulika Mitra

Figure 1 for Condensed Sample-Guided Model Inversion for Knowledge Distillation

Figure 2 for Condensed Sample-Guided Model Inversion for Knowledge Distillation

Figure 3 for Condensed Sample-Guided Model Inversion for Knowledge Distillation

Figure 4 for Condensed Sample-Guided Model Inversion for Knowledge Distillation

Abstract:Knowledge distillation (KD) is a key element in neural network compression that allows knowledge transfer from a pre-trained teacher model to a more compact student model. KD relies on access to the training dataset, which may not always be fully available due to privacy concerns or logistical issues related to the size of the data. To address this, "data-free" KD methods use synthetic data, generated through model inversion, to mimic the target data distribution. However, conventional model inversion methods are not designed to utilize supplementary information from the target dataset, and thus, cannot leverage it to improve performance, even when it is available. In this paper, we consider condensed samples, as a form of supplementary information, and introduce a method for using them to better approximate the target data distribution, thereby enhancing the KD performance. Our approach is versatile, evidenced by improvements of up to 11.4% in KD accuracy across various datasets and model inversion-based methods. Importantly, it remains effective even when using as few as one condensed sample per class, and can also enhance performance in few-shot scenarios where only limited real data samples are available.

Via

Access Paper or Ask Questions

CRISP: Hybrid Structured Sparsity for Class-aware Model Pruning

Nov 24, 2023

Shivam Aggarwal, Kuluhan Binici, Tulika Mitra

Figure 1 for CRISP: Hybrid Structured Sparsity for Class-aware Model Pruning

Figure 2 for CRISP: Hybrid Structured Sparsity for Class-aware Model Pruning

Figure 3 for CRISP: Hybrid Structured Sparsity for Class-aware Model Pruning

Figure 4 for CRISP: Hybrid Structured Sparsity for Class-aware Model Pruning

Abstract:Machine learning pipelines for classification tasks often train a universal model to achieve accuracy across a broad range of classes. However, a typical user encounters only a limited selection of classes regularly. This disparity provides an opportunity to enhance computational efficiency by tailoring models to focus on user-specific classes. Existing works rely on unstructured pruning, which introduces randomly distributed non-zero values in the model, making it unsuitable for hardware acceleration. Alternatively, some approaches employ structured pruning, such as channel pruning, but these tend to provide only minimal compression and may lead to reduced model accuracy. In this work, we propose CRISP, a novel pruning framework leveraging a hybrid structured sparsity pattern that combines both fine-grained N:M structured sparsity and coarse-grained block sparsity. Our pruning strategy is guided by a gradient-based class-aware saliency score, allowing us to retain weights crucial for user-specific classes. CRISP achieves high accuracy with minimal memory consumption for popular models like ResNet-50, VGG-16, and MobileNetV2 on ImageNet and CIFAR-100 datasets. Moreover, CRISP delivers up to 14$\times$ reduction in latency and energy consumption compared to existing pruning methods while maintaining comparable accuracy. Our code is available at https://github.com/shivmgg/CRISP/.

* 6 pages, accepted in Design, Automation & Test in Europe Conference & Exhibition (DATE) 2024

Via

Access Paper or Ask Questions

Post-Training Quantization with Low-precision Minifloats and Integers on FPGAs

Nov 21, 2023

Shivam Aggarwal, Alessandro Pappalardo, Hans Jakob Damsgaard, Giuseppe Franco, Thomas B. Preußer, Michaela Blott, Tulika Mitra

Figure 1 for Post-Training Quantization with Low-precision Minifloats and Integers on FPGAs

Figure 2 for Post-Training Quantization with Low-precision Minifloats and Integers on FPGAs

Figure 3 for Post-Training Quantization with Low-precision Minifloats and Integers on FPGAs

Figure 4 for Post-Training Quantization with Low-precision Minifloats and Integers on FPGAs

Abstract:Post-Training Quantization (PTQ) is a powerful technique for model compression, reducing the precision of neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point quantization (FP8) in the context of PTQ for model inference. However, the exploration of floating-point formats smaller than 8 bits and their comparison with integer quantization remains relatively limited. In this work, we present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model while approaching full-precision model accuracy. Our work presents a novel PTQ design-space exploration, comparing minifloat and integer quantization schemes across a range of 3 to 8 bits for both weights and activations. We examine the applicability of various PTQ techniques to minifloats, including weight equalization, bias correction, SmoothQuant, gradient-based learned rounding, and the GPTQ method. Our experiments validate the effectiveness of low-precision minifloats when compared to their integer counterparts across a spectrum of accuracy-precision trade-offs on a set of reference deep learning vision workloads. Finally, we evaluate our results against an FPGA-based hardware cost model, showing that integer quantization often remains the Pareto-optimal option, given its relatively smaller hardware resource footprint.

Via

Access Paper or Ask Questions

Robust and Resource-Efficient Data-Free Knowledge Distillation by Generative Pseudo Replay

Jan 09, 2022

Kuluhan Binici, Shivam Aggarwal, Nam Trung Pham, Karianto Leman, Tulika Mitra

Figure 1 for Robust and Resource-Efficient Data-Free Knowledge Distillation by Generative Pseudo Replay

Figure 2 for Robust and Resource-Efficient Data-Free Knowledge Distillation by Generative Pseudo Replay

Figure 3 for Robust and Resource-Efficient Data-Free Knowledge Distillation by Generative Pseudo Replay

Figure 4 for Robust and Resource-Efficient Data-Free Knowledge Distillation by Generative Pseudo Replay

Abstract:Data-Free Knowledge Distillation (KD) allows knowledge transfer from a trained neural network (teacher) to a more compact one (student) in the absence of original training data. Existing works use a validation set to monitor the accuracy of the student over real data and report the highest performance throughout the entire process. However, validation data may not be available at distillation time either, making it infeasible to record the student snapshot that achieved the peak accuracy. Therefore, a practical data-free KD method should be robust and ideally provide monotonically increasing student accuracy during distillation. This is challenging because the student experiences knowledge degradation due to the distribution shift of the synthetic data. A straightforward approach to overcome this issue is to store and rehearse the generated samples periodically, which increases the memory footprint and creates privacy concerns. We propose to model the distribution of the previously observed synthetic samples with a generative network. In particular, we design a Variational Autoencoder (VAE) with a training objective that is customized to learn the synthetic data representations optimally. The student is rehearsed by the generative pseudo replay technique, with samples produced by the VAE. Hence knowledge degradation can be prevented without storing any samples. Experiments on image classification benchmarks show that our method optimizes the expected value of the distilled model accuracy while eliminating the large memory overhead incurred by the sample-storing methods.

* Accepted by the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

Via

Access Paper or Ask Questions

Synthetic Video Generation for Robust Hand Gesture Recognition in Augmented Reality Applications

Dec 06, 2019

Varun Jain, Shivam Aggarwal, Suril Mehta, Ramya Hebbalaguppe

Figure 1 for Synthetic Video Generation for Robust Hand Gesture Recognition in Augmented Reality Applications

Figure 2 for Synthetic Video Generation for Robust Hand Gesture Recognition in Augmented Reality Applications

Figure 3 for Synthetic Video Generation for Robust Hand Gesture Recognition in Augmented Reality Applications

Figure 4 for Synthetic Video Generation for Robust Hand Gesture Recognition in Augmented Reality Applications

Abstract:Hand gestures are a natural means of interaction in Augmented Reality and Virtual Reality (AR/VR) applications. Recently, there has been an increased focus on removing the dependence of accurate hand gesture recognition on complex sensor setup found in expensive proprietary devices such as the Microsoft HoloLens, Daqri and Meta Glasses. Most such solutions either rely on multi-modal sensor data or deep neural networks that can benefit greatly from abundance of labelled data. Datasets are an integral part of any deep learning based research. They have been the principal reason for the substantial progress in this field, both, in terms of providing enough data for the training of these models, and, for benchmarking competing algorithms. However, it is becoming increasingly difficult to generate enough labelled data for complex tasks such as hand gesture recognition. The goal of this work is to introduce a framework capable of generating photo-realistic videos that have labelled hand bounding box and fingertip that can help in designing, training, and benchmarking models for hand-gesture recognition in AR/VR applications. We demonstrate the efficacy of our framework in generating videos with diverse backgrounds.

* Presented at the ICCV 2019 Workshop: The 5th International Workshop on Observing And Understanding Hands In Action

Via

Access Paper or Ask Questions