Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lukas Cavigelli

AC-LoRA: (Almost) Training-Free Access Control-Aware Multi-Modal LLMs

May 15, 2025

Lara Magdalena Lazier, Aritra Dhar, Vasilije Stambolic, Lukas Cavigelli

Abstract:Corporate LLMs are gaining traction for efficient knowledge dissemination and management within organizations. However, as current LLMs are vulnerable to leaking sensitive information, it has proven difficult to apply them in settings where strict access control is necessary. To this end, we design AC-LoRA, an end-to-end system for access control-aware corporate LLM chatbots that maintains a strong information isolation guarantee. AC-LoRA maintains separate LoRA adapters for permissioned datasets, along with the document embedding they are finetuned on. AC-LoRA retrieves a precise set of LoRA adapters based on the similarity score with the user query and their permission. This similarity score is later used to merge the responses if more than one LoRA is retrieved, without requiring any additional training for LoRA routing. We provide an end-to-end prototype of AC-LoRA, evaluate it on two datasets, and show that AC-LoRA matches or even exceeds the performance of state-of-the-art LoRA mixing techniques while providing strong isolation guarantees. Furthermore, we show that AC-LoRA design can be directly applied to different modalities.

Via

Access Paper or Ask Questions

Late Breaking Results: The Art of Beating the Odds with Predictor-Guided Random Design Space Exploration

Feb 26, 2025

Felix Arnold, Maxence Bouvier, Ryan Amaudruz, Renzo Andri, Lukas Cavigelli

Figure 1 for Late Breaking Results: The Art of Beating the Odds with Predictor-Guided Random Design Space Exploration

Figure 2 for Late Breaking Results: The Art of Beating the Odds with Predictor-Guided Random Design Space Exploration

Figure 3 for Late Breaking Results: The Art of Beating the Odds with Predictor-Guided Random Design Space Exploration

Figure 4 for Late Breaking Results: The Art of Beating the Odds with Predictor-Guided Random Design Space Exploration

Abstract:This work introduces an innovative method for improving combinational digital circuits through random exploration in MIG-based synthesis. High-quality circuits are crucial for performance, power, and cost, making this a critical area of active research. Our approach incorporates next-state prediction and iterative selection, significantly accelerating the synthesis process. This novel method achieves up to 14x synthesis speedup and up to 20.94% better MIG minimization on the EPFL Combinational Benchmark Suite compared to state-of-the-art techniques. We further explore various predictor models and show that increased prediction accuracy does not guarantee an equivalent increase in synthesis quality of results or speedup, observing that randomness remains a desirable factor.

* 2 pages, 3 figures, conference, this research manuscript is currently under review for publication in an IEEE conference

Via

Access Paper or Ask Questions

Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

Feb 12, 2025

Konstantin Berestizshevsky, Renzo Andri, Lukas Cavigelli

Abstract:The attention mechanism is essential for the impressive capabilities of transformer-based Large Language Models (LLMs). However, calculating attention is computationally intensive due to its quadratic dependency on the sequence length. We introduce a novel approach called Top-Theta Attention, or simply Top-$\theta$, which selectively prunes less essential attention elements by comparing them against carefully calibrated thresholds. This method greatly improves the efficiency of self-attention matrix multiplication while preserving model accuracy, reducing the number of required V cache rows by 3x during generative decoding and the number of attention elements by 10x during the prefill phase. Our method does not require model retraining; instead, it requires only a brief calibration phase to be resilient to distribution shifts, thus not requiring the thresholds for different datasets to be recalibrated. Unlike top-k attention, Top-$\theta$ eliminates full-vector dependency, making it suitable for tiling and scale-out and avoiding costly top-k search. A key innovation of our approach is the development of efficient numerical compensation techniques, which help preserve model accuracy even under aggressive pruning of attention scores.

* 8 pages, 11 figures, work under submission

Via

Access Paper or Ask Questions

PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving

Jan 14, 2025

Ahmet Caner Yüzügüler, Jiawei Zhuang, Lukas Cavigelli

Figure 1 for PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving

Figure 2 for PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving

Figure 3 for PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving

Figure 4 for PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving

Abstract:Large language models (LLMs) are widely used across various applications, but their substantial computational requirements pose significant challenges, particularly in terms of HBM bandwidth bottlenecks and inter-device communication overhead. In this paper, we present PRESERVE, a novel prefetching framework designed to optimize LLM inference by overlapping memory reads for model weights and KV-cache with collective communication operations. Through extensive experiments conducted on commercial AI accelerators, we demonstrate up to 1.6x end-to-end speedup on state-of-the-art, open-source LLMs. Additionally, we perform a design space exploration that identifies the optimal hardware configuration for the proposed method, showing a further 1.25x improvement in performance per cost by selecting the optimal L2 cache size. Our results show that PRESERVE has the potential to mitigate the memory bottlenecks and communication overheads, offering a solution to improve the performance and scalability of the LLM inference systems.

Via

Access Paper or Ask Questions

SSSD: Simply-Scalable Speculative Decoding

Nov 08, 2024

Michele Marzollo, Jiawei Zhuang, Niklas Roemer, Lorenz K. Müller, Lukas Cavigelli

Figure 1 for SSSD: Simply-Scalable Speculative Decoding

Figure 2 for SSSD: Simply-Scalable Speculative Decoding

Figure 3 for SSSD: Simply-Scalable Speculative Decoding

Figure 4 for SSSD: Simply-Scalable Speculative Decoding

Abstract:Over the past year, Speculative Decoding has gained popularity as a technique for accelerating Large Language Model inference. While several methods have been introduced, most struggle to deliver satisfactory performance at batch sizes typical for data centers ($\geq 8$) and often involve significant deployment complexities. In this work, we offer a theoretical explanation of how Speculative Decoding can be effectively utilized with larger batch sizes. We also introduce a method that integrates seamlessly into existing systems without additional training or the complexity of deploying a small LLM. In a continuous batching setting, we achieve a 4x increase in throughput without any latency impact for short context generation, and a 1.7-2x improvement in both latency and throughput for longer contexts.

* 14 pages, 7 figures

Via

Access Paper or Ask Questions

On-Device Domain Learning for Keyword Spotting on Low-Power Extreme Edge Embedded Systems

Mar 12, 2024

Cristian Cioflan, Lukas Cavigelli, Manuele Rusci, Miguel de Prado, Luca Benini

Abstract:Keyword spotting accuracy degrades when neural networks are exposed to noisy environments. On-site adaptation to previously unseen noise is crucial to recovering accuracy loss, and on-device learning is required to ensure that the adaptation process happens entirely on the edge device. In this work, we propose a fully on-device domain adaptation system achieving up to 14% accuracy gains over already-robust keyword spotting models. We enable on-device learning with less than 10 kB of memory, using only 100 labeled utterances to recover 5% accuracy after adapting to the complex speech noise. We demonstrate that domain adaptation can be achieved on ultra-low-power microcontrollers with as little as 806 mJ in only 14 s on always-on, battery-operated devices.

* 5 pages, 2 tables, 2 figures. Accepted at IEEE AICAS 2024

Via

Access Paper or Ask Questions

Boosting keyword spotting through on-device learnable user speech characteristics

Mar 12, 2024

Cristian Cioflan, Lukas Cavigelli, Luca Benini

Abstract:Keyword spotting systems for always-on TinyML-constrained applications require on-site tuning to boost the accuracy of offline trained classifiers when deployed in unseen inference conditions. Adapting to the speech peculiarities of target users requires many in-domain samples, often unavailable in real-world scenarios. Furthermore, current on-device learning techniques rely on computationally intensive and memory-hungry backbone update schemes, unfit for always-on, battery-powered devices. In this work, we propose a novel on-device learning architecture, composed of a pretrained backbone and a user-aware embedding learning the user's speech characteristics. The so-generated features are fused and used to classify the input utterance. For domain shifts generated by unseen speakers, we measure error rate reductions of up to 19% from 30.1% to 24.3% based on the 35-class problem of the Google Speech Commands dataset, through the inexpensive update of the user projections. We moreover demonstrate the few-shot learning capabilities of our proposed architecture in sample- and class-scarce learning conditions. With 23.7 kparameters and 1 MFLOP per epoch required for on-device training, our system is feasible for TinyML applications aimed at battery-powered microcontrollers.

* 5 pages, 3 tables, 2 figures. Accepted as a full paper by the tinyML Research Symposium 2024

Via

Access Paper or Ask Questions

Stella Nera: Achieving 161 TOp/s/W with Multiplier-free DNN Acceleration based on Approximate Matrix Multiplication

Nov 16, 2023

Jannis Schönleber, Lukas Cavigelli, Renzo Andri, Matteo Perotti, Luca Benini

Abstract:From classical HPC to deep learning, MatMul is at the heart of today's computing. The recent Maddness method approximates MatMul without the need for multiplication by using a hash-based version of product quantization (PQ) indexing into a look-up table (LUT). Stella Nera is the first Maddness accelerator and it achieves 15x higher area efficiency (GMAC/s/mm^2) and more than 25x higher energy efficiency (TMAC/s/W) than direct MatMul accelerators implemented in the same technology. The hash function is a decision tree, which allows for an efficient hardware implementation as the multiply-accumulate operations are replaced by decision tree passes and LUT lookups. The entire Maddness MatMul can be broken down into parts that allow an effective implementation with small computing units and memories, allowing it to reach extreme efficiency while remaining generically applicable for MatMul tasks. In a commercial 14nm technology and scaled to 3nm, we achieve an energy efficiency of 161 TOp/s/W@0.55V with a Top-1 accuracy on CIFAR-10 of more than 92.5% using ResNet9.

* 6 pages, 7 figures, preprint under review

Via

Access Paper or Ask Questions

RL-based Stateful Neural Adaptive Sampling and Denoising for Real-Time Path Tracing

Oct 05, 2023

Antoine Scardigli, Lukas Cavigelli, Lorenz K. Müller

Abstract:Monte-Carlo path tracing is a powerful technique for realistic image synthesis but suffers from high levels of noise at low sample counts, limiting its use in real-time applications. To address this, we propose a framework with end-to-end training of a sampling importance network, a latent space encoder network, and a denoiser network. Our approach uses reinforcement learning to optimize the sampling importance network, thus avoiding explicit numerically approximated gradients. Our method does not aggregate the sampled values per pixel by averaging but keeps all sampled values which are then fed into the latent space encoder. The encoder replaces handcrafted spatiotemporal heuristics by learned representations in a latent space. Finally, a neural denoiser is trained to refine the output image. Our approach increases visual quality on several challenging datasets and reduces rendering times for equal quality by a factor of 1.6x compared to the previous state-of-the-art, making it a promising solution for real-time applications.

* Submitted to NeurIPS. https://openreview.net/forum?id=xNyR7DXUzJ

Via

Access Paper or Ask Questions

Going Further With Winograd Convolutions: Tap-Wise Quantization for Efficient Inference on 4x4 Tile

Sep 26, 2022

Renzo Andri, Beatrice Bussolino, Antonio Cipolletta, Lukas Cavigelli, Zhe Wang

Figure 1 for Going Further With Winograd Convolutions: Tap-Wise Quantization for Efficient Inference on 4x4 Tile

Figure 2 for Going Further With Winograd Convolutions: Tap-Wise Quantization for Efficient Inference on 4x4 Tile

Figure 3 for Going Further With Winograd Convolutions: Tap-Wise Quantization for Efficient Inference on 4x4 Tile

Figure 4 for Going Further With Winograd Convolutions: Tap-Wise Quantization for Efficient Inference on 4x4 Tile

Abstract:Most of today's computer vision pipelines are built around deep neural networks, where convolution operations require most of the generally high compute effort. The Winograd convolution algorithm computes convolutions with fewer MACs compared to the standard algorithm, reducing the operation count by a factor of 2.25x for 3x3 convolutions when using the version with 2x2-sized tiles $F_2$. Even though the gain is significant, the Winograd algorithm with larger tile sizes, i.e., $F_4$, offers even more potential in improving throughput and energy efficiency, as it reduces the required MACs by 4x. Unfortunately, the Winograd algorithm with larger tile sizes introduces numerical issues that prevent its use on integer domain-specific accelerators and higher computational overhead to transform input and output data between spatial and Winograd domains. To unlock the full potential of Winograd $F_4$, we propose a novel tap-wise quantization method that overcomes the numerical issues of using larger tiles, enabling integer-only inference. Moreover, we present custom hardware units that process the Winograd transformations in a power- and area-efficient way, and we show how to integrate such custom modules in an industrial-grade, programmable DSA. An extensive experimental evaluation on a large set of state-of-the-art computer vision benchmarks reveals that the tap-wise quantization algorithm makes the quantized Winograd $F_4$ network almost as accurate as the FP32 baseline. The Winograd-enhanced DSA achieves up to 1.85x gain in energy efficiency and up to 1.83x end-to-end speed-up for state-of-the-art segmentation and detection networks.

* Accepted at IEEE/ACM MICRO 2022 (1-5 October 2022)

Via

Access Paper or Ask Questions