Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ian Colbert

Qronos: Correcting the Past by Shaping the Future... in Post-Training Quantization

May 16, 2025

Shihao Zhang, Haoyu Zhang, Ian Colbert, Rayan Saab

Abstract:We introduce Qronos -- a new state-of-the-art post-training quantization algorithm that sequentially rounds and updates neural network weights. Qronos not only explicitly corrects errors due to both weight and activation quantization, but also errors resulting from quantizing previous layers. Our iterative algorithm is based on an interpretable and disciplined optimization framework that subsumes and surpasses existing data-driven approaches. At each step, Qronos alternates between error correction and diffusion via optimal update rules. Importantly, we prove that Qronos admits an efficient implementation that uses the Cholesky decomposition for solving least-squares problems. We also demonstrate that Qronos is compatible with existing transformation techniques such as Hadamard-based incoherence processing and weight-activation scaling equalization, among others. We evaluate Qronos using recent autoregressive language generation models in the Llama3 family; Qronos consistently outperforms previous state-of-the-art adaptive rounding methods when quantizing the weights, activations, and/or KV caches.

Via

Access Paper or Ask Questions

Exploiting Unstructured Sparsity in Fully Homomorphic Encrypted DNNs

Mar 12, 2025

Aidan Ferguson, Perry Gibson, Lara D'Agata, Parker McLeod, Ferhat Yaman, Amitabh Das, Ian Colbert, José Cano

Abstract:The deployment of deep neural networks (DNNs) in privacy-sensitive environments is constrained by computational overheads in fully homomorphic encryption (FHE). This paper explores unstructured sparsity in FHE matrix multiplication schemes as a means of reducing this burden while maintaining model accuracy requirements. We demonstrate that sparsity can be exploited in arbitrary matrix multiplication, providing runtime benefits compared to a baseline naive algorithm at all sparsity levels. This is a notable departure from the plaintext domain, where there is a trade-off between sparsity and the overhead of the sparse multiplication algorithm. In addition, we propose three sparse multiplication schemes in FHE based on common plaintext sparse encodings. We demonstrate the performance gain is scheme-invariant; however, some sparse schemes vastly reduce the memory storage requirements of the encrypted matrix at high sparsity values. Our proposed sparse schemes yield an average performance gain of 2.5x at 50% unstructured sparsity, with our multi-threading scheme providing a 32.5x performance increase over the equivalent single-threaded sparse computation when utilizing 64 cores.

* Accepted to 5th Workshop on Machine Learning and Systems (EuroMLSys) co-located with EuroSys '25

Via

Access Paper or Ask Questions

Accumulator-Aware Post-Training Quantization

Sep 25, 2024

Ian Colbert, Fabian Grob, Giuseppe Franco, Jinjie Zhang, Rayan Saab

Abstract:Several recent studies have investigated low-precision accumulation, reporting improvements in throughput, power, and area across various platforms. However, the accompanying proposals have only considered the quantization-aware training (QAT) paradigm, in which models are fine-tuned or trained from scratch with quantization in the loop. As models continue to grow in size, QAT techniques become increasingly more expensive, which has motivated the recent surge in post-training quantization (PTQ) research. To the best of our knowledge, ours marks the first formal study of accumulator-aware quantization in the PTQ setting. To bridge this gap, we introduce AXE, a practical framework of accumulator-aware extensions designed to endow overflow avoidance guarantees to existing layer-wise PTQ algorithms. We theoretically motivate AXE and demonstrate its flexibility by implementing it on top of two state-of-the-art PTQ algorithms: GPFQ and OPTQ. We further generalize AXE to support multi-stage accumulation for the first time, opening the door for full datapath optimization and scaling to large language models (LLMs). We evaluate AXE across image classification and language generation models, and observe significant improvements in the trade-off between accumulator bit width and model accuracy over baseline methods.

Via

Access Paper or Ask Questions

A2Q+: Improving Accumulator-Aware Weight Quantization

Jan 19, 2024

Ian Colbert, Alessandro Pappalardo, Jakoba Petri-Koenig, Yaman Umuroglu

Abstract:Quantization techniques commonly reduce the inference costs of neural networks by restricting the precision of weights and activations. Recent studies show that also reducing the precision of the accumulator can further improve hardware efficiency at the risk of numerical overflow, which introduces arithmetic errors that can degrade model accuracy. To avoid numerical overflow while maintaining accuracy, recent work proposed accumulator-aware quantization (A2Q), a quantization-aware training method that constrains model weights during training to safely use a target accumulator bit width during inference. Although this shows promise, we demonstrate that A2Q relies on an overly restrictive constraint and a sub-optimal weight initialization strategy that each introduce superfluous quantization error. To address these shortcomings, we introduce: (1) an improved bound that alleviates accumulator constraints without compromising overflow avoidance; and (2) a new strategy for initializing quantized weights from pre-trained floating-point checkpoints. We combine these contributions with weight normalization to introduce A2Q+. We support our analysis with experiments that show A2Q+ significantly improves the trade-off between accumulator bit width and model accuracy and characterize new trade-offs that arise as a consequence of accumulator constraints.

Via

Access Paper or Ask Questions

A2Q: Accumulator-Aware Quantization with Guaranteed Overflow Avoidance

Aug 25, 2023

Ian Colbert, Alessandro Pappalardo, Jakoba Petri-Koenig

Abstract:We present accumulator-aware quantization (A2Q), a novel weight quantization method designed to train quantized neural networks (QNNs) to avoid overflow when using low-precision accumulators during inference. A2Q introduces a unique formulation inspired by weight normalization that constrains the L1-norm of model weights according to accumulator bit width bounds that we derive. Thus, in training QNNs for low-precision accumulation, A2Q also inherently promotes unstructured weight sparsity to guarantee overflow avoidance. We apply our method to deep learning-based computer vision tasks to show that A2Q can train QNNs for low-precision accumulators while maintaining model accuracy competitive with a floating-point baseline. In our evaluations, we consider the impact of A2Q on both general-purpose platforms and programmable hardware. However, we primarily target model deployment on FPGAs because they can be programmed to fully exploit custom accumulator bit widths. Our experimentation shows accumulator bit width significantly impacts the resource efficiency of FPGA-based accelerators. On average across our benchmarks, A2Q offers up to a 2.3x reduction in resource utilization over 32-bit accumulator counterparts with 99.2% of the floating-point model accuracy.

* arXiv admin note: substantial text overlap with arXiv:2301.13376

Via

Access Paper or Ask Questions

Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance

Jan 31, 2023

Ian Colbert, Alessandro Pappalardo, Jakoba Petri-Koenig

Abstract:We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We leverage weight normalization as a means of constraining parameters during training using accumulator bit width bounds that we derive. We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline. We then show that this reduction translates to increased design efficiency for custom FPGA-based accelerators. Finally, we show that our algorithm not only constrains weights to fit into an accumulator of user-defined bit width, but also increases the sparsity and compressibility of the resulting weights. Across all of our benchmark models trained with 8-bit weights and activations, we observe that constraining the hidden layers of quantized neural networks to fit into 16-bit accumulators yields an average 98.2% sparsity with an estimated compression rate of 46.5x all while maintaining 99.2% of the floating-point performance.

Via

Access Paper or Ask Questions

Robust Transferable Feature Extractors: Learning to Defend Pre-Trained Networks Against White Box Adversaries

Sep 14, 2022

Alexander Cann, Ian Colbert, Ihab Amer

Figure 1 for Robust Transferable Feature Extractors: Learning to Defend Pre-Trained Networks Against White Box Adversaries

Figure 2 for Robust Transferable Feature Extractors: Learning to Defend Pre-Trained Networks Against White Box Adversaries

Figure 3 for Robust Transferable Feature Extractors: Learning to Defend Pre-Trained Networks Against White Box Adversaries

Figure 4 for Robust Transferable Feature Extractors: Learning to Defend Pre-Trained Networks Against White Box Adversaries

Abstract:The widespread adoption of deep neural networks in computer vision applications has brought forth a significant interest in adversarial robustness. Existing research has shown that maliciously perturbed inputs specifically tailored for a given model (i.e., adversarial examples) can be successfully transferred to another independently trained model to induce prediction errors. Moreover, this property of adversarial examples has been attributed to features derived from predictive patterns in the data distribution. Thus, we are motivated to investigate the following question: Can adversarial defenses, like adversarial examples, be successfully transferred to other independently trained models? To this end, we propose a deep learning-based pre-processing mechanism, which we refer to as a robust transferable feature extractor (RTFE). After examining theoretical motivation and implications, we experimentally show that our method can provide adversarial robustness to multiple independently pre-trained classifiers that are otherwise ineffective against an adaptive white box adversary. Furthermore, we show that RTFEs can even provide one-shot adversarial robustness to models independently trained on different datasets.

Via

Access Paper or Ask Questions

Human-Like Navigation Behavior: A Statistical Evaluation Framework

Mar 10, 2022

Ian Colbert, Mehdi Saeedi

Figure 1 for Human-Like Navigation Behavior: A Statistical Evaluation Framework

Figure 2 for Human-Like Navigation Behavior: A Statistical Evaluation Framework

Figure 3 for Human-Like Navigation Behavior: A Statistical Evaluation Framework

Figure 4 for Human-Like Navigation Behavior: A Statistical Evaluation Framework

Abstract:Recent advancements in deep reinforcement learning have brought forth an impressive display of highly skilled artificial agents capable of complex intelligent behavior. In video games, these artificial agents are increasingly deployed as non-playable characters (NPCs) designed to enhance the experience of human players. However, while it has been shown that the convincing human-like behavior of NPCs leads to increased engagement in video games, the believability of an artificial agent's behavior is most often measured solely by its proficiency at a given task. Recent work has hinted that proficiency alone is not sufficient to discern human-like behavior. Motivated by this, we build a non-parametric two-sample hypothesis test designed to compare the behaviors of artificial agents to those of human players. We show that the resulting $p$-value not only aligns with anonymous human judgment of human-like behavior, but also that it can be used as a measure of similarity.

Via

Access Paper or Ask Questions

Generating GPU Compiler Heuristics using Reinforcement Learning

Nov 23, 2021

Ian Colbert, Jake Daly, Norm Rubin

Figure 1 for Generating GPU Compiler Heuristics using Reinforcement Learning

Figure 2 for Generating GPU Compiler Heuristics using Reinforcement Learning

Figure 3 for Generating GPU Compiler Heuristics using Reinforcement Learning

Figure 4 for Generating GPU Compiler Heuristics using Reinforcement Learning

Abstract:GPU compilers are complex software programs with many optimizations specific to target hardware. These optimizations are often controlled by heuristics hand-designed by compiler experts using time- and resource-intensive processes. In this paper, we developed a GPU compiler autotuning framework that uses off-policy deep reinforcement learning to generate heuristics that improve the frame rates of graphics applications. Furthermore, we demonstrate the resilience of these learned heuristics to frequent compiler updates by analyzing their stability across a year of code check-ins without retraining. We show that our machine learning-based compiler autotuning framework matches or surpasses the frame rates for 98% of graphics benchmarks with an average uplift of 1.6% up to 15.8%.

Via

Access Paper or Ask Questions

Training Deep Neural Networks with Joint Quantization and Pruning of Weights and Activations

Nov 01, 2021

Xinyu Zhang, Ian Colbert, Ken Kreutz-Delgado, Srinjoy Das

Figure 1 for Training Deep Neural Networks with Joint Quantization and Pruning of Weights and Activations

Figure 2 for Training Deep Neural Networks with Joint Quantization and Pruning of Weights and Activations

Figure 3 for Training Deep Neural Networks with Joint Quantization and Pruning of Weights and Activations

Figure 4 for Training Deep Neural Networks with Joint Quantization and Pruning of Weights and Activations

Abstract:Quantization and pruning are core techniques used to reduce the inference costs of deep neural networks. State-of-the-art quantization techniques are currently applied to both the weights and activations; however, pruning is most often applied to only the weights of the network. In this work, we jointly apply novel uniform quantization and unstructured pruning methods to both the weights and activations of deep neural networks during training. Using our methods, we empirically evaluate the currently accepted prune-then-quantize paradigm across a wide range of computer vision tasks and observe a non-commutative nature when applied to both the weights and activations of deep neural networks. Informed by these observations, we articulate the non-commutativity hypothesis: for a given deep neural network being trained for a specific task, there exists an exact training schedule in which quantization and pruning can be introduced to optimize network performance. We identify that this optimal ordering not only exists, but also varies across discriminative and generative tasks. Using the optimal training schedule within our training framework, we demonstrate increased performance per memory footprint over existing solutions.

Via

Access Paper or Ask Questions