Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Warren J. Gross

Memory-Efficient FPGA Implementation of Stochastic Simulated Annealing

Jan 25, 2026

Duckgyu Shin, Naoya Onizawa, Warren J. Gross, Takahiro Hanyu

Abstract:Simulated annealing (SA) is a well-known algorithm for solving combinatorial optimization problems. However, the computation time of SA increases rapidly, as the size of the problem grows. Recently, a stochastic simulated annealing (SSA) algorithm that converges faster than conventional SA has been reported. In this paper, we present a hardware-aware SSA (HA- SSA) algorithm for memory-efficient FPGA implementations. HA-SSA can reduce the memory usage of storing intermediate results while maintaining the computing speed of SSA. For evaluation purposes, the proposed algorithm is compared with the conventional SSA and SA approaches on maximum cut combinatorial optimization problems. HA-SSA achieves a convergence speed that is up to 114-times faster than that of the conventional SA algorithm depending on the maximum cut problem selected from the G-set which is a dataset of the maximum cut problems. HA-SSA is implemented on a field-programmable gate array (FPGA) (Xilinx Kintex-7), and it achieves up to 6-times the memory efficiency of conventional SSA while maintaining high solution quality for optimization problems.

* 11 pages

Via

Access Paper or Ask Questions

Hardware-friendly IR-HARQ for Polar SCL Decoders

Aug 11, 2025

Marwan Jalaleddine, Jiajie Li, Warren J. Gross

Abstract:To extend the applications of polar codes within next-generation wireless communication systems, it is essential to incorporate support for Incremental Redundancy (IR) Hybrid Automatic Repeat Request (HARQ) schemes. The baseline IR-HARQ scheme's reliance on set-based operations leads to irregular memory access patterns, posing significant challenges for efficient hardware implementation. Furthermore, the introduction of new bit types increases the number of fast nodes that are decoded without traversing the sub-tree, resulting in a substantial area overhead when implemented in hardware. To address these issues and improve hardware compatibility, we propose transforming the set-based operations within the polar IR-HARQ scheme into binary vector operations. Additionally, we introduce a new fast node integration approach that avoids increasing the number of fast nodes, thereby minimizing the associated area overhead. Our proposed scheme results in a memory overhead of 25-27% compared to successive cancellation list (SCL) decoding without IR-HARQ support.

Via

Access Paper or Ask Questions

Automatic Pruning of Fine-tuning Datasets for Transformer-based Language Models

Jul 11, 2024

Mohammadreza Tayaranian, Seyyed Hasan Mozafari, Brett H. Meyer, James J. Clark, Warren J. Gross

Abstract:Transformer-based language models have shown state-of-the-art performance on a variety of natural language understanding tasks. To achieve this performance, these models are first pre-trained on general corpus and then fine-tuned on downstream tasks. Previous work studied the effect of pruning the training set of the downstream tasks on the performance of the model on its evaluation set. In this work, we propose an automatic dataset pruning method for the training set of fine-tuning tasks. Our method is based on the model's success rate in correctly classifying each training data point. Unlike previous work which relies on user feedback to determine subset size, our method automatically extracts training subsets that are adapted for each pair of model and fine-tuning task. Our method provides multiple subsets for use in dataset pruning that navigate the trade-off between subset size and evaluation accuracy. Our largest subset, which we also refer to as the winning ticket subset, is on average $3 \times$ smaller than the original training set of the fine-tuning task. Our experiments on 5 downstream tasks and 2 language models show that, on average, fine-tuning on the winning ticket subsets results in a $0.1 \%$ increase in the evaluation performance of the model.

* 28 pages, 17 figures. Accepted at the Third Conference on Lifelong Learning Agents (CoLLAs 2024)

Via

Access Paper or Ask Questions

Step-GRAND: A Low Latency Universal Soft-input Decoder

Jul 27, 2023

Syed Mohsin Abbas, Marwan Jalaleddine, Chi-Ying Tsui, Warren J. Gross

Figure 1 for Step-GRAND: A Low Latency Universal Soft-input Decoder

Figure 2 for Step-GRAND: A Low Latency Universal Soft-input Decoder

Figure 3 for Step-GRAND: A Low Latency Universal Soft-input Decoder

Figure 4 for Step-GRAND: A Low Latency Universal Soft-input Decoder

Abstract:GRAND features both soft-input and hard-input variants that are well suited to efficient hardware implementations that can be characterized with achievable average and worst-case decoding latency. This paper introduces step-GRAND, a soft-input variant of GRAND that, in addition to achieving appealing average decoding latency, also reduces the worst-case decoding latency of the corresponding hardware implementation. The hardware implementation results demonstrate that the proposed step-GRAND can decode CA-polar code $(128,105+11)$ with an average information throughput of $47.7$ Gbps at the target FER of $\leq10^{-7}$. Furthermore, the proposed step-GRAND hardware is $10\times$ more area efficient than the previous soft-input ORBGRAND hardware implementation, and its worst-case latency is $\frac{1}{6.8}\times$ that of the previous ORBGRAND hardware.

* Submitted to 2023 IEEE Globecom Workshops

Via

Access Paper or Ask Questions

SSS3D: Fast Neural Architecture Search For Efficient Three-Dimensional Semantic Segmentation

Apr 21, 2023

Olivier Therrien, Marihan Amein, Zhuoran Xiong, Warren J. Gross, Brett H. Meyer

Figure 1 for SSS3D: Fast Neural Architecture Search For Efficient Three-Dimensional Semantic Segmentation

Figure 2 for SSS3D: Fast Neural Architecture Search For Efficient Three-Dimensional Semantic Segmentation

Figure 3 for SSS3D: Fast Neural Architecture Search For Efficient Three-Dimensional Semantic Segmentation

Figure 4 for SSS3D: Fast Neural Architecture Search For Efficient Three-Dimensional Semantic Segmentation

Abstract:We present SSS3D, a fast multi-objective NAS framework designed to find computationally efficient 3D semantic scene segmentation networks. It uses RandLA-Net, an off-the-shelf point-based network, as a super-network to enable weight sharing and reduce search time by 99.67% for single-stage searches. SSS3D has a complex search space composed of sampling and architectural parameters that can form 2.88 * 10^17 possible networks. To further reduce search time, SSS3D splits the complete search space and introduces a two-stage search that finds optimal subnetworks in 54% of the time required by single-stage searches.

* Accepted as a full paper by the TinyML Research Symposium 2023

Via

Access Paper or Ask Questions

FMAS: Fast Multi-Objective SuperNet Architecture Search for Semantic Segmentation

Mar 28, 2023

Zhuoran Xiong, Marihan Amein, Olivier Therrien, Warren J. Gross, Brett H. Meyer

Figure 1 for FMAS: Fast Multi-Objective SuperNet Architecture Search for Semantic Segmentation

Figure 2 for FMAS: Fast Multi-Objective SuperNet Architecture Search for Semantic Segmentation

Figure 3 for FMAS: Fast Multi-Objective SuperNet Architecture Search for Semantic Segmentation

Figure 4 for FMAS: Fast Multi-Objective SuperNet Architecture Search for Semantic Segmentation

Abstract:We present FMAS, a fast multi-objective neural architecture search framework for semantic segmentation. FMAS subsamples the structure and pre-trained parameters of DeepLabV3+, without fine-tuning, dramatically reducing training time during search. To further reduce candidate evaluation time, we use a subset of the validation dataset during the search. Only the final, Pareto non-dominated, candidates are ultimately fine-tuned using the complete training set. We evaluate FMAS by searching for models that effectively trade accuracy and computational cost on the PASCAL VOC 2012 dataset. FMAS finds competitive designs quickly, e.g., taking just 0.5 GPU days to discover a DeepLabV3+ variant that reduces FLOPs and parameters by 10$\%$ and 20$\%$ respectively, for less than 3$\%$ increased error. We also search on an edge device called GAP8 and use its latency as the metric. FMAS is capable of finding 2.2$\times$ faster network with 7.61$\%$ MIoU loss.

* Accepted as a full paper by the TinyML Research Symposium 2023

Via

Access Paper or Ask Questions

BD-KD: Balancing the Divergences for Online Knowledge Distillation

Dec 25, 2022

Ibtihel Amara, Nazanin Sepahvand, Brett H. Meyer, Warren J. Gross, James J. Clark

Abstract:Knowledge distillation (KD) has gained a lot of attention in the field of model compression for edge devices thanks to its effectiveness in compressing large powerful networks into smaller lower-capacity models. Online distillation, in which both the teacher and the student are learning collaboratively, has also gained much interest due to its ability to improve on the performance of the networks involved. The Kullback-Leibler (KL) divergence ensures the proper knowledge transfer between the teacher and student. However, most online KD techniques present some bottlenecks under the network capacity gap. By cooperatively and simultaneously training, the models the KL distance becomes incapable of properly minimizing the teacher's and student's distributions. Alongside accuracy, critical edge device applications are in need of well-calibrated compact networks. Confidence calibration provides a sensible way of getting trustworthy predictions. We propose BD-KD: Balancing of Divergences for online Knowledge Distillation. We show that adaptively balancing between the reverse and forward divergences shifts the focus of the training strategy to the compact student network without limiting the teacher network's learning process. We demonstrate that, by performing this balancing design at the level of the student distillation loss, we improve upon both performance accuracy and calibration of the compact student network. We conducted extensive experiments using a variety of network architectures and show improvements on multiple datasets including CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet. We illustrate the effectiveness of our approach through comprehensive comparisons and ablations with current state-of-the-art online and offline KD techniques.

Via

Access Paper or Ask Questions

Efficient Fine-Tuning of Compressed Language Models with Learners

Aug 03, 2022

Danilo Vucetic, Mohammadreza Tayaranian, Maryam Ziaeefard, James J. Clark, Brett H. Meyer, Warren J. Gross

Figure 1 for Efficient Fine-Tuning of Compressed Language Models with Learners

Figure 2 for Efficient Fine-Tuning of Compressed Language Models with Learners

Figure 3 for Efficient Fine-Tuning of Compressed Language Models with Learners

Figure 4 for Efficient Fine-Tuning of Compressed Language Models with Learners

Abstract:Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many prior works aim to improve inference efficiency via compression techniques, e.g., pruning, these works do not explicitly address the computational challenges of training to downstream tasks. We introduce Learner modules and priming, novel methods for fine-tuning that exploit the overparameterization of pre-trained language models to gain benefits in convergence speed and resource utilization. Learner modules navigate the double bind of 1) training efficiently by fine-tuning a subset of parameters, and 2) training effectively by ensuring quick convergence and high metric scores. Our results on DistilBERT demonstrate that learners perform on par with or surpass the baselines. Learners train 7x fewer parameters than state-of-the-art methods on GLUE. On CoLA, learners fine-tune 20% faster, and have significantly lower resource utilization.

* 8 pages, 9 figures, 2 tables, presented at ICML 2022 workshop on Hardware-Aware Efficient Training (HAET 2022)

Via

Access Paper or Ask Questions

Efficient Fine-Tuning of BERT Models on the Edge

May 03, 2022

Danilo Vucetic, Mohammadreza Tayaranian, Maryam Ziaeefard, James J. Clark, Brett H. Meyer, Warren J. Gross

Figure 1 for Efficient Fine-Tuning of BERT Models on the Edge

Figure 2 for Efficient Fine-Tuning of BERT Models on the Edge

Figure 3 for Efficient Fine-Tuning of BERT Models on the Edge

Figure 4 for Efficient Fine-Tuning of BERT Models on the Edge

Abstract:Resource-constrained devices are increasingly the deployment targets of machine learning applications. Static models, however, do not always suffice for dynamic environments. On-device training of models allows for quick adaptability to new scenarios. With the increasing size of deep neural networks, as noted with the likes of BERT and other natural language processing models, comes increased resource requirements, namely memory, computation, energy, and time. Furthermore, training is far more resource intensive than inference. Resource-constrained on-device learning is thus doubly difficult, especially with large BERT-like models. By reducing the memory usage of fine-tuning, pre-trained BERT models can become efficient enough to fine-tune on resource-constrained devices. We propose Freeze And Reconfigure (FAR), a memory-efficient training regime for BERT-like models that reduces the memory usage of activation maps during fine-tuning by avoiding unnecessary parameter updates. FAR reduces fine-tuning time on the DistilBERT model and CoLA dataset by 30%, and time spent on memory operations by 47%. More broadly, reductions in metric performance on the GLUE and SQuAD datasets are around 1% on average.

* 4 pages, 2 figures, 3 tables. To be published in ISCAS 2022 and made available on IEEE Xplore

Via

Access Paper or Ask Questions

Standard Deviation-Based Quantization for Deep Neural Networks

Feb 24, 2022

Amir Ardakani, Arash Ardakani, Brett Meyer, James J. Clark, Warren J. Gross

Figure 1 for Standard Deviation-Based Quantization for Deep Neural Networks

Figure 2 for Standard Deviation-Based Quantization for Deep Neural Networks

Figure 3 for Standard Deviation-Based Quantization for Deep Neural Networks

Figure 4 for Standard Deviation-Based Quantization for Deep Neural Networks

Abstract:Quantization of deep neural networks is a promising approach that reduces the inference cost, making it feasible to run deep networks on resource-restricted devices. Inspired by existing methods, we propose a new framework to learn the quantization intervals (discrete values) using the knowledge of the network's weight and activation distributions, i.e., standard deviation. Furthermore, we propose a novel base-2 logarithmic quantization scheme to quantize weights to power-of-two discrete values. Our proposed scheme allows us to replace resource-hungry high-precision multipliers with simple shift-add operations. According to our evaluations, our method outperforms existing work on CIFAR10 and ImageNet datasets and even achieves better accuracy performance with 3-bit weights and activations when compared to the full-precision models. Moreover, our scheme simultaneously prunes the network's parameters and allows us to flexibly adjust the pruning ratio during the quantization process.

Via

Access Paper or Ask Questions