ETH Zurich
Abstract:N:M structured pruning is essential for large language models (LLMs) because it can remove less important network weights and reduce the memory and computation requirements. Existing pruning methods mainly focus on designing metrics to measure the importance of network components to guide pruning. Apart from the impact of these metrics, we observe that different layers have different sensitivities over the network performance. Thus, we propose an efficient method based on the trace of Fisher Information Matrix (FIM) to quantitatively measure and verify the different sensitivities across layers. Based on this, we propose Mixed Sparsity Pruning (MSP) which uses a pruning-oriented evolutionary algorithm (EA) to determine the optimal sparsity levels for different layers. To guarantee fast convergence and achieve promising performance, we utilize efficient FIM-inspired layer-wise sensitivity to initialize the population of EA. In addition, our MSP can work as a plug-and-play module, ready to be integrated into existing pruning methods. Extensive experiments on LLaMA and LLaMA-2 on language modeling and zero-shot tasks demonstrate our superior performance. In particular, in extreme pruning ratio (e.g. 75%), our method significantly outperforms existing methods in terms of perplexity (PPL) by orders of magnitude (Figure 1).
Abstract:The acceleration of pruned Deep Neural Networks (DNNs) on edge devices such as Microcontrollers (MCUs) is a challenging task, given the tight area- and power-constraints of these devices. In this work, we propose a three-fold contribution to address this problem. First, we design a set of optimized software kernels for N:M pruned layers, targeting ultra-low-power, multicore RISC-V MCUs, which are up to 2.1x and 3.4x faster than their dense counterparts at 1:8 and 1:16 sparsity, respectively. Then, we implement a lightweight Instruction-Set Architecture (ISA) extension to accelerate the indirect load and non-zero indices decompression operations required by our kernels, obtaining up to 1.9x extra speedup, at the cost of a 5% area overhead. Lastly, we extend an open-source DNN compiler to utilize our sparse kernels for complete networks, showing speedups of 3.21x and 1.81x on a ResNet18 and a Vision Transformer (ViT), with less than 1.5% accuracy drop compared to a dense baseline.
Abstract:Accurate and efficient electroencephalography (EEG) analysis is essential for detecting seizures and artifacts in long-term monitoring, with applications spanning hospital diagnostics to wearable health devices. Robust EEG analytics have the potential to greatly improve patient care. However, traditional deep learning models, especially Transformer-based architectures, are hindered by their quadratic time and memory complexity, making them less suitable for resource-constrained environments. To address these challenges, we present FEMBA (Foundational EEG Mamba + Bidirectional Architecture), a novel self-supervised framework that establishes new efficiency benchmarks for EEG analysis through bidirectional state-space modeling. Unlike Transformer-based models, which incur quadratic time and memory complexity, FEMBA scales linearly with sequence length, enabling more scalable and efficient processing of extended EEG recordings. Trained on over 21,000 hours of unlabeled EEG and fine-tuned on three downstream tasks, FEMBA achieves competitive performance in comparison with transformer models, with significantly lower computational cost. Specifically, it reaches 81.82% balanced accuracy (0.8921 AUROC) on TUAB and 0.949 AUROC on TUAR, while a tiny 7.8M-parameter variant demonstrates viability for resource-constrained devices. These results pave the way for scalable, general-purpose EEG analytics in both clinical and highlight FEMBA as a promising candidate for wearable applications.
Abstract:Electroencephalograph (EEG) is a crucial tool for studying brain activity. Recently, self-supervised learning methods leveraging large unlabeled datasets have emerged as a potential solution to the scarcity of widely available annotated EEG data. However, current methods suffer from at least one of the following limitations: i) sub-optimal EEG signal modeling, ii) model sizes in the hundreds of millions of trainable parameters, and iii) reliance on private datasets and/or inconsistent public benchmarks, hindering reproducibility. To address these challenges, we introduce a Compact Encoder for Representations of Brain Oscillations using alternating attention (CEReBrO), a new small EEG foundation model. Our tokenization scheme represents EEG signals at a per-channel patch granularity. We propose an alternating attention mechanism that jointly models intra-channel temporal dynamics and inter-channel spatial correlations, achieving 2x speed improvement with 6x less memory required compared to standard self-attention. We present several model sizes ranging from 3.6 million to 85 million parameters. Pre-trained on over 20,000 hours of publicly available scalp EEG recordings with diverse channel configurations, our models set new benchmarks in emotion detection and seizure detection tasks, with competitive performance in anomaly classification and gait prediction. This validates our models' effectiveness and effictiveness.
Abstract:Since 2013, the PULP (Parallel Ultra-Low Power) Platform project has been one of the most active and successful initiatives in designing research IPs and releasing them as open-source. Its portfolio now ranges from processor cores to network-on-chips, peripherals, SoC templates, and full hardware accelerators. In this article, we focus on the PULP experience designing heterogeneous AI acceleration SoCs - an endeavour encompassing SoC architecture definition; development, verification, and integration of acceleration IPs; front- and back-end VLSI design; testing; development of AI deployment software.
Abstract:Heart rate (HR) estimation from photoplethysmography (PPG) signals is a key feature of modern wearable devices for health and wellness monitoring. While deep learning models show promise, their performance relies on the availability of large datasets. We present EnhancePPG, a method that enhances state-of-the-art models by integrating self-supervised learning with data augmentation (DA). Our approach combines self-supervised pre-training with DA, allowing the model to learn more generalizable features, without needing more labelled data. Inspired by a U-Net-like autoencoder architecture, we utilize unsupervised PPG signal reconstruction, taking advantage of large amounts of unlabeled data during the pre-training phase combined with data augmentation, to improve state-of-the-art models' performance. Thanks to our approach and minimal modification to the state-of-the-art model, we improve the best HR estimation by 12.2%, lowering from 4.03 Beats-Per-Minute (BPM) to 3.54 BPM the error on PPG-DaLiA. Importantly, our EnhancePPG approach focuses exclusively on the training of the selected deep learning model, without significantly increasing its inference latency
Abstract:Nano-drones, with their small, lightweight design, are ideal for confined-space rescue missions and inherently safe for human interaction. However, their limited payload restricts the critical sensing needed for ego-velocity estimation and obstacle detection to single-bean laser-based time-of-flight (ToF) and low-resolution optical sensors. Although those sensors have demonstrated good performance, they fail in some complex real-world scenarios, especially when facing transparent or reflective surfaces (ToFs) or when lacking visual features (optical-flow sensors). Taking inspiration from bats, this paper proposes a novel two-way ranging-based method for ego-velocity estimation and obstacle avoidance based on down-and-forward facing ultra-low-power ultrasonic sensors, which improve the performance when the drone faces reflective materials or navigates in complete darkness. Our results demonstrate that our new sensing system achieves a mean square error of 0.019 m/s on ego-velocity estimation and allows exploration for a flight time of 8 minutes while covering 136 m on average in a challenging environment with transparent and reflective obstacles. We also compare ultrasonic and laser-based ToF sensing techniques for obstacle avoidance, as well as optical flow and ultrasonic-based techniques for ego-velocity estimation, denoting how these systems and methods can be complemented to enhance the robustness of nano-drone operations.
Abstract:While vision transformers show promise in numerous image restoration (IR) tasks, the challenge remains in efficiently generalizing and scaling up a model for multiple IR tasks. To strike a balance between efficiency and model capacity for a generalized transformer-based IR method, we propose a hierarchical information flow mechanism for image restoration, dubbed Hi-IR, which progressively propagates information among pixels in a bottom-up manner. Hi-IR constructs a hierarchical information tree representing the degraded image across three levels. Each level encapsulates different types of information, with higher levels encompassing broader objects and concepts and lower levels focusing on local details. Moreover, the hierarchical tree architecture removes long-range self-attention, improves the computational efficiency and memory utilization, thus preparing it for effective model scaling. Based on that, we explore model scaling to improve our method's capabilities, which is expected to positively impact IR in large-scale training settings. Extensive experimental results show that Hi-IR achieves state-of-the-art performance in seven common image restoration tasks, affirming its effectiveness and generalizability.
Abstract:Fine-tuning large-scale text-to-image diffusion models for various downstream tasks has yielded impressive results. However, the heavy computational burdens of tuning large models prevent personal customization. Recent advances have attempted to employ parameter-efficient fine-tuning (PEFT) techniques to adapt the floating-point (FP) or quantized pre-trained weights. Nonetheless, the adaptation parameters in existing works are still restricted to FP arithmetic, hindering hardware-friendly acceleration. In this work, we propose IntLoRA, to further push the efficiency limits by using integer type (INT) low-rank parameters to adapt the quantized diffusion models. By working in the integer arithmetic, our IntLoRA offers three key advantages: (i) for fine-tuning, the pre-trained weights are quantized, reducing memory usage; (ii) for storage, both pre-trained and low-rank weights are in INT which consumes less disk space; (iii) for inference, IntLoRA weights can be naturally merged into quantized pre-trained weights through efficient integer multiplication or bit-shifting, eliminating additional post-training quantization. Extensive experiments demonstrate that IntLoRA can achieve performance on par with or even superior to the vanilla LoRA, accompanied by significant efficiency improvements. Code is available at \url{https://github.com/csguoh/IntLoRA}.
Abstract:This work explores the feasibility of employing ultrasound (US) US technology in a wrist-worn IoT device for low-power, high-fidelity heart-rate (HR) extraction. US offers deep tissue penetration and can monitor pulsatile arterial blood flow in large vessels and the surrounding tissue, potentially improving robustness and accuracy compared to PPG. We present an IoT wearable system prototype utilizing a commercial microcontroller MCU employing the onboard ADC to capture high frequency US signals and an innovative low-power US pulser. An envelope filter lowers the bandwidth of the US signal by a factor of >5x, reducing the system's acquisition requirements without compromising accuracy (correlation coefficient between HR extracted from enveloped and raw signals, r(92)=0.99, p<0.001). The full signal processing pipeline is ported to fixed point arithmetic for increased energy efficiency and runs entirely onboard. The system has an average power consumption of 5.8mW, competitive with PPG based systems, and the HR extraction algorithm requires only 68kB of RAM and 71ms of processing time on an ARM Cortex-M4 MCU. The system is estimated to run continuously for more than 7 days on a smartwatch battery. To accurately evaluate the proposed circuit and algorithm and identify the anatomical location on the wrist with the highest accuracy for HR extraction, we collected a dataset from 10 healthy adults at three different wrist positions. The dataset comprises roughly 5 hours of HR data with an average of 80.6+-16.3 bpm. During recording, we synchronized the established ECG gold standard with our US-based method. The comparisons yields a Pearson correlation coefficient of r(92)=0.99, p<0.001 and a mean error of 0.69+-1.99 bpm in the lateral wrist position near the radial artery. The dataset and code have been open-sourced at https://github.com/mgiordy/Ultrasound-Heart-Rate