Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wu

Post-Training Quantization for Vision Mamba with k-Scaled Quantization and Reparameterization

Jan 28, 2025

Bo-Yun Shi, Yi-Cheng Lo, An-Yeu, Wu

Abstract:The Mamba model, utilizing a structured state-space model (SSM), offers linear time complexity and demonstrates significant potential. Vision Mamba (ViM) extends this framework to vision tasks by incorporating a bidirectional SSM and patch embedding, surpassing Transformer-based models in performance. While model quantization is essential for efficient computing, existing works have focused solely on the original Mamba model and have not been applied to ViM. Additionally, they neglect quantizing the SSM layer, which is central to Mamba and can lead to substantial error propagation by naive quantization due to its inherent structure. In this paper, we focus on the post-training quantization (PTQ) of ViM. We address the issues with three core techniques: 1) a k-scaled token-wise quantization method for linear and convolutional layers, 2) a reparameterization technique to simplify hidden state quantization, and 3) a factor-determining method that reduces computational overhead by integrating operations. Through these methods, the error caused by PTQ can be mitigated. Experimental results on ImageNet-1k demonstrate only a 0.8-1.2\% accuracy degradation due to PTQ, highlighting the effectiveness of our approach.

Via

Access Paper or Ask Questions

SoK: Prompt Hacking of Large Language Models

Oct 16, 2024

Baha Rababah, Shang, Wu, Matthew Kwiatkowski, Carson Leung, Cuneyt Gurcan Akcora

Figure 1 for SoK: Prompt Hacking of Large Language Models

Figure 2 for SoK: Prompt Hacking of Large Language Models

Figure 3 for SoK: Prompt Hacking of Large Language Models

Figure 4 for SoK: Prompt Hacking of Large Language Models

Abstract:The safety and robustness of large language models (LLMs) based applications remain critical challenges in artificial intelligence. Among the key threats to these applications are prompt hacking attacks, which can significantly undermine the security and reliability of LLM-based systems. In this work, we offer a comprehensive and systematic overview of three distinct types of prompt hacking: jailbreaking, leaking, and injection, addressing the nuances that differentiate them despite their overlapping characteristics. To enhance the evaluation of LLM-based applications, we propose a novel framework that categorizes LLM responses into five distinct classes, moving beyond the traditional binary classification. This approach provides more granular insights into the AI's behavior, improving diagnostic precision and enabling more targeted enhancements to the system's safety and robustness.

Via

Access Paper or Ask Questions

Efficient and Reliable Vector Similarity Search Using Asymmetric Encoding with NAND-Flash for Many-Class Few-Shot Learning

Sep 12, 2024

Hao-Wei Chiang, Chi-Tse Huang, Hsiang-Yun Cheng, Po-Hao Tseng, Ming-Hsiu Lee, An-Yeu, Wu

Figure 1 for Efficient and Reliable Vector Similarity Search Using Asymmetric Encoding with NAND-Flash for Many-Class Few-Shot Learning

Figure 2 for Efficient and Reliable Vector Similarity Search Using Asymmetric Encoding with NAND-Flash for Many-Class Few-Shot Learning

Figure 3 for Efficient and Reliable Vector Similarity Search Using Asymmetric Encoding with NAND-Flash for Many-Class Few-Shot Learning

Figure 4 for Efficient and Reliable Vector Similarity Search Using Asymmetric Encoding with NAND-Flash for Many-Class Few-Shot Learning

Abstract:While memory-augmented neural networks (MANNs) offer an effective solution for few-shot learning (FSL) by integrating deep neural networks with external memory, the capacity requirements and energy overhead of data movement become enormous due to the large number of support vectors in many-class FSL scenarios. Various in-memory search solutions have emerged to improve the energy efficiency of MANNs. NAND-based multi-bit content addressable memory (MCAM) is a promising option due to its high density and large capacity. Despite its potential, MCAM faces limitations such as a restricted number of word lines, limited quantization levels, and non-ideal effects like varying string currents and bottleneck effects, which lead to significant accuracy drops. To address these issues, we propose several innovative methods. First, the Multi-bit Thermometer Code (MTMC) leverages the extensive capacity of MCAM to enhance vector precision using cumulative encoding rules, thereby mitigating the bottleneck effect. Second, the Asymmetric vector similarity search (AVSS) reduces the precision of the query vector while maintaining that of the support vectors, thereby minimizing the search iterations and improving efficiency in many-class scenarios. Finally, the Hardware-Aware Training (HAT) method optimizes controller training by modeling the hardware characteristics of MCAM, thus enhancing the reliability of the system. Our integrated framework reduces search iterations by up to 32 times, and increases overall accuracy by 1.58% to 6.94%.

Via

Access Paper or Ask Questions

Relay-Assisted Carrier Aggregation (RACA) Uplink System for Enhancing Data Rate of Extended Reality (XR)

Jul 02, 2024

Chi-Wei Chen, Wen-Chiao Tsai, Lung-Sheng Tsai, An-Yeu, Wu

Figure 1 for Relay-Assisted Carrier Aggregation (RACA) Uplink System for Enhancing Data Rate of Extended Reality (XR)

Figure 2 for Relay-Assisted Carrier Aggregation (RACA) Uplink System for Enhancing Data Rate of Extended Reality (XR)

Figure 3 for Relay-Assisted Carrier Aggregation (RACA) Uplink System for Enhancing Data Rate of Extended Reality (XR)

Figure 4 for Relay-Assisted Carrier Aggregation (RACA) Uplink System for Enhancing Data Rate of Extended Reality (XR)

Abstract:In Extended Reality (XR) applications, high data rates and low latency are crucial for immersive experiences. Uplink transmission in XR is challenging due to the limited antennas and power of lightweight XR devices. To improve data transmission rates, we investigate a relay-assisted carrier aggregation (RACA) system. The XR device simultaneously transmits data to an access point (AP) and a relay in proximity over low-frequency and high-frequency bands, respectively. Then, the relay down-converts and amplifies the signals to the AP, effectively acting as an additional transmit antenna for the XR device. In this paper, we propose two algorithms to maximize the data rate of the XR device in their respective protocols. In the centralized protocol, the rate maximization problem is equivalently transformed as a weighted mean square error minimization (WMMSE) problem which can be solved iteratively by alternative optimization. In the distributed protocol, the rate maximization problem is decomposed into two independent sub-problems where the rate of the direct link and the rate of the relay link are maximized by singular value decomposition (SVD)-based methods with water-filling (WF). Simulation results show that the rate of the RACA system is improved by $32\%$ compared to that of the conventional carrier aggregation scheme.

Via

Access Paper or Ask Questions

LATTE: Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer

Apr 11, 2024

Jiing-Ping Wang, Ming-Guang Lin, An-Yeu, Wu

Abstract:With the rise of Transformer models in NLP and CV domain, Multi-Head Attention has been proven to be a game-changer. However, its expensive computation poses challenges to the model throughput and efficiency, especially for the long sequence tasks. Exploiting the sparsity in attention has been proven to be an effective way to reduce computation. Nevertheless, prior works do not consider the various distributions among different heads and lack a systematic method to determine the threshold. To address these challenges, we propose Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer (LATTE). LATTE employs a headwise threshold-based filter with the low-precision dot product and computation reuse mechanism to reduce the computation of MHA. Moreover, the trainable threshold is introduced to provide a systematic method for adjusting the thresholds and enable end-to-end optimization. Experimental results indicate LATTE can smoothly adapt to both NLP and CV tasks, offering significant computation savings with only a minor compromise in performance. Also, the trainable threshold is shown to be essential for the leverage between the performance and the computation. As a result, LATTE filters up to 85.16% keys with only a 0.87% accuracy drop in the CV task and 89.91% keys with a 0.86 perplexity increase in the NLP task.

Via

Access Paper or Ask Questions

MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision Transformer

Feb 01, 2024

Yu-Shan Tai, An-Yeu, Wu

Figure 1 for MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision Transformer

Figure 2 for MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision Transformer

Figure 3 for MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision Transformer

Figure 4 for MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision Transformer

Abstract:While vision transformers (ViTs) have shown great potential in computer vision tasks, their intense computation and memory requirements pose challenges for practical applications. Existing post-training quantization methods leverage value redistribution or specialized quantizers to address the non-normal distribution in ViTs. However, without considering the asymmetry in activations and relying on hand-crafted settings, these methods often struggle to maintain performance under low-bit quantization. To overcome these challenges, we introduce SmoothQuant with bias term (SQ-b) to alleviate the asymmetry issue and reduce the clamping loss. We also introduce optimal scaling factor ratio search (OPT-m) to determine quantization parameters by a data-dependent mechanism automatically. To further enhance the compressibility, we incorporate the above-mentioned techniques and propose a mixed-precision post-training quantization framework for vision transformers (MPTQ-ViT). We develop greedy mixed-precision quantization (Greedy MP) to allocate layer-wise bit-width considering both model performance and compressibility. Our experiments on ViT, DeiT, and Swin demonstrate significant accuracy improvements compared with SOTA on the ImageNet dataset. Specifically, our proposed methods achieve accuracy improvements ranging from 0.90% to 23.35% on 4-bit ViTs with single-precision and from 3.82% to 78.14% on 5-bit fully quantized ViTs with mixed-precision.

Via

Access Paper or Ask Questions

TSPTQ-ViT: Two-scaled post-training quantization for vision transformer

May 22, 2023

Yu-Shan Tai, Ming-Guang Lin, An-Yeu, Wu

Figure 1 for TSPTQ-ViT: Two-scaled post-training quantization for vision transformer

Figure 2 for TSPTQ-ViT: Two-scaled post-training quantization for vision transformer

Figure 3 for TSPTQ-ViT: Two-scaled post-training quantization for vision transformer

Figure 4 for TSPTQ-ViT: Two-scaled post-training quantization for vision transformer

Abstract:Vision transformers (ViTs) have achieved remarkable performance in various computer vision tasks. However, intensive memory and computation requirements impede ViTs from running on resource-constrained edge devices. Due to the non-normally distributed values after Softmax and GeLU, post-training quantization on ViTs results in severe accuracy degradation. Moreover, conventional methods fail to address the high channel-wise variance in LayerNorm. To reduce the quantization loss and improve classification accuracy, we propose a two-scaled post-training quantization scheme for vision transformer (TSPTQ-ViT). We design the value-aware two-scaled scaling factors (V-2SF) specialized for post-Softmax and post-GeLU values, which leverage the bit sparsity in non-normal distribution to save bit-widths. In addition, the outlier-aware two-scaled scaling factors (O-2SF) are introduced to LayerNorm, alleviating the dominant impacts from outlier values. Our experimental results show that the proposed methods reach near-lossless accuracy drops (<0.5%) on the ImageNet classification task under 8-bit fully quantized ViTs.

Via

Access Paper or Ask Questions

C3-SL: Circular Convolution-Based Batch-Wise Compression for Communication-Efficient Split Learning

Jul 25, 2022

Cheng-Yen Hsieh, Yu-Chuan Chuang, An-Yeu, Wu

Figure 1 for C3-SL: Circular Convolution-Based Batch-Wise Compression for Communication-Efficient Split Learning

Figure 2 for C3-SL: Circular Convolution-Based Batch-Wise Compression for Communication-Efficient Split Learning

Figure 3 for C3-SL: Circular Convolution-Based Batch-Wise Compression for Communication-Efficient Split Learning

Figure 4 for C3-SL: Circular Convolution-Based Batch-Wise Compression for Communication-Efficient Split Learning

Abstract:Most existing studies improve the efficiency of Split learning (SL) by compressing the transmitted features. However, most works focus on dimension-wise compression that transforms high-dimensional features into a low-dimensional space. In this paper, we propose circular convolution-based batch-wise compression for SL (C3-SL) to compress multiple features into one single feature. To avoid information loss while merging multiple features, we exploit the quasi-orthogonality of features in high-dimensional space with circular convolution and superposition. To the best of our knowledge, we are the first to explore the potential of batch-wise compression under the SL scenario. Based on the simulation results on CIFAR-10 and CIFAR-100, our method achieves a 16x compression ratio with negligible accuracy drops compared with the vanilla SL. Moreover, C3-SL significantly reduces 1152x memory and 2.25x computation overhead compared to the state-of-the-art dimension-wise compression method.

* 6 pages, IEEE MLSP 2022, Github: https://github.com/WesleyHsieh0806/Split-Learning-Compression

Via

Access Paper or Ask Questions

Learnable Mixed-precision and Dimension Reduction Co-design for Low-storage Activation

Jul 19, 2022

Yu-Shan Tai, Cheng-Yang Chang, Chieh-Fang Teng, AnYeu, Wu

Figure 1 for Learnable Mixed-precision and Dimension Reduction Co-design for Low-storage Activation

Figure 2 for Learnable Mixed-precision and Dimension Reduction Co-design for Low-storage Activation

Figure 3 for Learnable Mixed-precision and Dimension Reduction Co-design for Low-storage Activation

Figure 4 for Learnable Mixed-precision and Dimension Reduction Co-design for Low-storage Activation

Abstract:Recently, deep convolutional neural networks (CNNs) have achieved many eye-catching results. However, deploying CNNs on resource-constrained edge devices is constrained by limited memory bandwidth for transmitting large intermediated data during inference, i.e., activation. Existing research utilizes mixed-precision and dimension reduction to reduce computational complexity but pays less attention to its application for activation compression. To further exploit the redundancy in activation, we propose a learnable mixed-precision and dimension reduction co-design system, which separates channels into groups and allocates specific compression policies according to their importance. In addition, the proposed dynamic searching technique enlarges search space and finds out the optimal bit-width allocation automatically. Our experimental results show that the proposed methods improve 3.54%/1.27% in accuracy and save 0.18/2.02 bits per value over existing mixed-precision methods on ResNet18 and MobileNetv2, respectively.

Via

Access Paper or Ask Questions

MAUS: A Dataset for Mental Workload Assessmenton N-back Task Using Wearable Sensor

Nov 03, 2021

Win-Ken Beh, Yi-Hsuan Wu, An-Yeu, Wu

Figure 1 for MAUS: A Dataset for Mental Workload Assessmenton N-back Task Using Wearable Sensor

Figure 2 for MAUS: A Dataset for Mental Workload Assessmenton N-back Task Using Wearable Sensor

Figure 3 for MAUS: A Dataset for Mental Workload Assessmenton N-back Task Using Wearable Sensor

Figure 4 for MAUS: A Dataset for Mental Workload Assessmenton N-back Task Using Wearable Sensor

Abstract:This paper describes an open-access database focusing on the study of mental workload (MW) assessment system for wearable devices. A wristband photoplethysmogram (PPG) was provided as a representative of wearable devices. In addition, a clinical device that can record Electrocardiography (ECG) , galvanic skin response (GSR) and, fingertip PPG was included in the database as a reference. The MW was induced by performing the N-back task with 22 subjects. The participants were asked to answer the Pittsburgh Sleep Quality Index (PSQI) questionnaire at the beginning of the experiment and the NASA Task Load Index (NASA-TLX) questionnaire after each N-back task. The result of data analysis show the potential uses of the recorded modalities and the feasibility of the MW elicitation protocol. Finally the MAUS dataset is now available for academic use (The MAUS dataset is available at IEEE Dataport: https://ieee-dataport.org/open-access/maus-dataset-mental-workload-assessment-n-back-task-using-wearable-sensor). Besides, we also presents a reproducible baseline system as a preliminary benchmark (The code of the baseline system on MAUS dataset is available on Github: https://github.com/rickwu11/MAUS\_dataset\_baseline\_system), which testing accuracy are 71.6 %, 66.7 %, and 59.9 % in ECG, fingertip PPG, wristband PPG, respectively.

Via

Access Paper or Ask Questions