Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zebin Yang

LightMamba: Efficient Mamba Acceleration on FPGA with Quantization and Hardware Co-design

Feb 21, 2025

Renjie Wei, Songqiang Xu, Linfeng Zhong, Zebin Yang, Qingyu Guo, Yuan Wang, Runsheng Wang, Meng Li

Abstract:State space models (SSMs) like Mamba have recently attracted much attention. Compared to Transformer-based large language models (LLMs), Mamba achieves linear computation complexity with the sequence length and demonstrates superior performance. However, Mamba is hard to accelerate due to the scattered activation outliers and the complex computation dependency, rendering existing LLM accelerators inefficient. In this paper, we propose LightMamba that co-designs the quantization algorithm and FPGA accelerator architecture for efficient Mamba inference. We first propose an FPGA-friendly post-training quantization algorithm that features rotation-assisted quantization and power-of-two SSM quantization to reduce the majority of computation to 4-bit. We further design an FPGA accelerator that partially unrolls the Mamba computation to balance the efficiency and hardware costs. Through computation reordering as well as fine-grained tiling and fusion, the hardware utilization and memory efficiency of the accelerator get drastically improved. We implement LightMamba on Xilinx Versal VCK190 FPGA and achieve 4.65x to 6.06x higher energy efficiency over the GPU baseline. When evaluated on Alveo U280 FPGA, LightMamba reaches 93 tokens/s, which is 1.43x that of the GPU baseline.

* Accepted by DATE 2025

Via

Access Paper or Ask Questions

Inherently Interpretable Tree Ensemble Learning

Oct 24, 2024

Zebin Yang, Agus Sudjianto, Xiaoming Li, Aijun Zhang

Figure 1 for Inherently Interpretable Tree Ensemble Learning

Figure 2 for Inherently Interpretable Tree Ensemble Learning

Figure 3 for Inherently Interpretable Tree Ensemble Learning

Figure 4 for Inherently Interpretable Tree Ensemble Learning

Abstract:Tree ensemble models like random forests and gradient boosting machines are widely used in machine learning due to their excellent predictive performance. However, a high-performance ensemble consisting of a large number of decision trees lacks sufficient transparency and explainability. In this paper, we demonstrate that when shallow decision trees are used as base learners, the ensemble learning algorithms can not only become inherently interpretable subject to an equivalent representation as the generalized additive models but also sometimes lead to better generalization performance. First, an interpretation algorithm is developed that converts the tree ensemble into the functional ANOVA representation with inherent interpretability. Second, two strategies are proposed to further enhance the model interpretability, i.e., by adding constraints in the model training stage and post-hoc effect pruning. Experiments on simulations and real-world datasets show that our proposed methods offer a better trade-off between model interpretation and predictive performance, compared with its counterpart benchmarks.

Via

Access Paper or Ask Questions

MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers

Oct 23, 2024

Zebin Yang, Renze Chen, Taiqiang Wu, Ngai Wong, Yun Liang, Runsheng Wang, Ru Huang, Meng Li

Figure 1 for MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers

Figure 2 for MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers

Figure 3 for MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers

Figure 4 for MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers

Abstract:In this paper, we propose MCUBERT to enable language models like BERT on tiny microcontroller units (MCUs) through network and scheduling co-optimization. We observe the embedding table contributes to the major storage bottleneck for tiny BERT models. Hence, at the network level, we propose an MCU-aware two-stage neural architecture search algorithm based on clustered low-rank approximation for embedding compression. To reduce the inference memory requirements, we further propose a novel fine-grained MCU-friendly scheduling strategy. Through careful computation tiling and re-ordering as well as kernel design, we drastically increase the input sequence lengths supported on MCUs without any latency or accuracy penalty. MCUBERT reduces the parameter size of BERT-tiny and BERT-mini by 5.7$\times$ and 3.0$\times$ and the execution memory by 3.5$\times$ and 4.3$\times$, respectively. MCUBERT also achieves 1.5$\times$ latency reduction. For the first time, MCUBERT enables lightweight BERT models on commodity MCUs and processing more than 512 tokens with less than 256KB of memory.

* ICCAD 2024

Via

Access Paper or Ask Questions

FastQuery: Communication-efficient Embedding Table Query for Private LLM Inference

May 25, 2024

Chenqi Lin, Tianshi Xu, Zebin Yang, Runsheng Wang, Ru Huang, Meng Li

Abstract:With the fast evolution of large language models (LLMs), privacy concerns with user queries arise as they may contain sensitive information. Private inference based on homomorphic encryption (HE) has been proposed to protect user query privacy. However, a private embedding table query has to be formulated as a HE-based matrix-vector multiplication problem and suffers from enormous computation and communication overhead. We observe the overhead mainly comes from the neglect of 1) the one-hot nature of user queries and 2) the robustness of the embedding table to low bit-width quantization noise. Hence, in this paper, we propose a private embedding table query optimization framework, dubbed FastQuery. FastQuery features a communication-aware embedding table quantization algorithm and a one-hot-aware dense packing algorithm to simultaneously reduce both the computation and communication costs. Compared to prior-art HE-based frameworks, e.g., Cheetah, Iron, and Bumblebee, FastQuery achieves more than $4.3\times$, $2.7\times$, $1.3\times$ latency reduction, respectively and more than $75.7\times$, $60.2\times$, $20.2\times$ communication reduction, respectively, on both LLAMA-7B and LLAMA-30B.

* 6 pages, DAC2024

Via

Access Paper or Ask Questions

ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel Decoding

Feb 21, 2024

Shuzhang Zhong, Zebin Yang, Meng Li, Ruihao Gong, Runsheng Wang, Ru Huang

Figure 1 for ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel Decoding

Figure 2 for ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel Decoding

Figure 3 for ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel Decoding

Figure 4 for ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel Decoding

Abstract:Recent advancements in generative large language models (LLMs) have significantly boosted the performance in natural language processing tasks. However, their efficiency is hampered by the inherent limitations in autoregressive token generation. While parallel decoding with token tree verification, e.g., Medusa, has been proposed to improve decoding parallelism and efficiency, it often struggles with maintaining contextual relationships due to its independent token prediction approach and incurs significant verification overhead, especially with large tree sizes and batch processing. In this paper, we propose ProPD, an efficient LLM parallel decoding framework based on dynamic token tree pruning and generation. ProPD features an advanced early pruning mechanism to efficiently eliminate unpromising token sequences to improve verification efficiency. Additionally, it introduces a dynamic token tree generation algorithm to balance the computation and parallelism of the verification phase in real-time and maximize the overall efficiency across different batch sizes, sequence lengths, and tasks, etc. We verify ProPD across a diverse set of datasets, LLMs, and batch sizes and demonstrate ProPD consistently outperforms existing decoding algorithms by 1.1-3.2x.

Via

Access Paper or Ask Questions

AttentionLego: An Open-Source Building Block For Spatially-Scalable Large Language Model Accelerator With Processing-In-Memory Technology

Jan 21, 2024

Rongqing Cong, Wenyang He, Mingxuan Li, Bangning Luo, Zebin Yang, Yuchao Yang, Ru Huang, Bonan Yan

Abstract:Large language models (LLMs) with Transformer architectures have become phenomenal in natural language processing, multimodal generative artificial intelligence, and agent-oriented artificial intelligence. The self-attention module is the most dominating sub-structure inside Transformer-based LLMs. Computation using general-purpose graphics processing units (GPUs) inflicts reckless demand for I/O bandwidth for transferring intermediate calculation results between memories and processing units. To tackle this challenge, this work develops a fully customized vanilla self-attention accelerator, AttentionLego, as the basic building block for constructing spatially expandable LLM processors. AttentionLego provides basic implementation with fully-customized digital logic incorporating Processing-In-Memory (PIM) technology. It is based on PIM-based matrix-vector multiplication and look-up table-based Softmax design. The open-source code is available online: https://bonany.cc/attentionleg.

* for associated source codes, see https://bonany.cc/attentionleg

Via

Access Paper or Ask Questions

PiML Toolbox for Interpretable Machine Learning Model Development and Validation

May 07, 2023

Agus Sudjianto, Aijun Zhang, Zebin Yang, Yu Su, Ningzhou Zeng

Abstract:PiML (read $\pi$-ML, /`pai.`em.`el/) is an integrated and open-access Python toolbox for interpretable machine learning model development and model diagnostics. It is designed with machine learning workflows in both low-code and high-code modes, including data pipeline, model training, model interpretation and explanation, and model diagnostics and comparison. The toolbox supports a growing list of interpretable models (e.g. GAM, GAMI-Net, XGB2) with inherent local and/or global interpretability. It also supports model-agnostic explainability tools (e.g. PFI, PDP, LIME, SHAP) and a powerful suite of model-agnostic diagnostics (e.g. weakness, uncertainty, robustness, fairness). Integration of PiML models and tests to existing MLOps platforms for quality assurance are enabled by flexible high-code APIs. Furthermore, PiML toolbox comes with a comprehensive user guide and hands-on examples, including the applications for model development and validation in banking. The project is available at https://github.com/SelfExplainML/PiML-Toolbox.

Via

Access Paper or Ask Questions

Explainable Recommendation Systems by Generalized Additive Models with Manifest and Latent Interactions

Dec 15, 2020

Yifeng Guo, Yu Su, Zebin Yang, Aijun Zhang

Figure 1 for Explainable Recommendation Systems by Generalized Additive Models with Manifest and Latent Interactions

Figure 2 for Explainable Recommendation Systems by Generalized Additive Models with Manifest and Latent Interactions

Figure 3 for Explainable Recommendation Systems by Generalized Additive Models with Manifest and Latent Interactions

Figure 4 for Explainable Recommendation Systems by Generalized Additive Models with Manifest and Latent Interactions

Abstract:In recent years, the field of recommendation systems has attracted increasing attention to developing predictive models that provide explanations of why an item is recommended to a user. The explanations can be either obtained by post-hoc diagnostics after fitting a relatively complex model or embedded into an intrinsically interpretable model. In this paper, we propose the explainable recommendation systems based on a generalized additive model with manifest and latent interactions (GAMMLI). This model architecture is intrinsically interpretable, as it additively consists of the user and item main effects, the manifest user-item interactions based on observed features, and the latent interaction effects from residuals. Unlike conventional collaborative filtering methods, the group effect of users and items are considered in GAMMLI. It is beneficial for enhancing the model interpretability, and can also facilitate the cold-start recommendation problem. A new Python package GAMMLI is developed for efficient model training and visualized interpretation of the results. By numerical experiments based on simulation data and real-world cases, the proposed method is shown to have advantages in both predictive performance and explainable recommendation.

Via

Access Paper or Ask Questions

Unwrapping The Black Box of Deep ReLU Networks: Interpretability, Diagnostics, and Simplification

Nov 08, 2020

Agus Sudjianto, William Knauth, Rahul Singh, Zebin Yang, Aijun Zhang

Figure 1 for Unwrapping The Black Box of Deep ReLU Networks: Interpretability, Diagnostics, and Simplification

Figure 2 for Unwrapping The Black Box of Deep ReLU Networks: Interpretability, Diagnostics, and Simplification

Figure 3 for Unwrapping The Black Box of Deep ReLU Networks: Interpretability, Diagnostics, and Simplification

Figure 4 for Unwrapping The Black Box of Deep ReLU Networks: Interpretability, Diagnostics, and Simplification

Abstract:The deep neural networks (DNNs) have achieved great success in learning complex patterns with strong predictive power, but they are often thought of as "black box" models without a sufficient level of transparency and interpretability. It is important to demystify the DNNs with rigorous mathematics and practical tools, especially when they are used for mission-critical applications. This paper aims to unwrap the black box of deep ReLU networks through local linear representation, which utilizes the activation pattern and disentangles the complex network into an equivalent set of local linear models (LLMs). We develop a convenient LLM-based toolkit for interpretability, diagnostics, and simplification of a pre-trained deep ReLU network. We propose the local linear profile plot and other visualization methods for interpretation and diagnostics, and an effective merging strategy for network simplification. The proposed methods are demonstrated by simulation examples, benchmark datasets, and a real case study in home lending credit risk assessment.

Via

Access Paper or Ask Questions

Hyperparameter Optimization via Sequential Uniform Designs

Sep 08, 2020

Zebin Yang, Aijun Zhang

Figure 1 for Hyperparameter Optimization via Sequential Uniform Designs

Figure 2 for Hyperparameter Optimization via Sequential Uniform Designs

Figure 3 for Hyperparameter Optimization via Sequential Uniform Designs

Figure 4 for Hyperparameter Optimization via Sequential Uniform Designs

Abstract:Hyperparameter tuning or optimization plays a central role in the automated machine learning (AutoML) pipeline. It is a challenging task as the response surfaces of hyperparameters are generally unknown, and the evaluation of each experiment is expensive. In this paper, we reformulate hyperparameter optimization as a kind of computer experiment and propose a novel sequential uniform design (SeqUD) for hyperparameter optimization. It is advantageous as a) it adaptively explores the hyperparameter space with evenly spread design points, which is free of the expensive meta-modeling and acquisition optimization procedures in Bayesian optimization; b) sequential design points are generated in batch, which can be easily parallelized; and c) a real-time augmented uniform design (AugUD) algorithm is developed for the efficient generation of new design points. Experiments are conducted on both global optimization tasks and hyperparameter optimization applications. The results show that SeqUD outperforms related hyperparameter optimization methods, which is demonstrated to be a promising and competitive alternative of existing tools.

Via

Access Paper or Ask Questions