Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yue Guan

An Efficient Private GPT Never Autoregressively Decodes

May 21, 2025

Zhengyi Li, Yue Guan, Kang Yang, Yu Feng, Ning Liu, Yu Yu, Jingwen Leng, Minyi Guo

Abstract:The wide deployment of the generative pre-trained transformer (GPT) has raised privacy concerns for both clients and servers. While cryptographic primitives can be employed for secure GPT inference to protect the privacy of both parties, they introduce considerable performance overhead.To accelerate secure inference, this study proposes a public decoding and secure verification approach that utilizes public GPT models, motivated by the observation that securely decoding one and multiple tokens takes a similar latency. The client uses the public model to generate a set of tokens, which are then securely verified by the private model for acceptance. The efficiency of our approach depends on the acceptance ratio of tokens proposed by the public model, which we improve from two aspects: (1) a private sampling protocol optimized for cryptographic primitives and (2) model alignment using knowledge distillation. Our approach improves the efficiency of secure decoding while maintaining the same level of privacy and generation quality as standard secure decoding. Experiments demonstrate a $2.1\times \sim 6.0\times$ speedup compared to standard decoding across three pairs of public-private models and different network conditions.

* Accepted by ICML 2025

Via

Access Paper or Ask Questions

Transkimmer: Transformer Learns to Layer-wise Skim

May 15, 2022

Yue Guan, Zhengyi Li, Jingwen Leng, Zhouhan Lin, Minyi Guo

Figure 1 for Transkimmer: Transformer Learns to Layer-wise Skim

Figure 2 for Transkimmer: Transformer Learns to Layer-wise Skim

Figure 3 for Transkimmer: Transformer Learns to Layer-wise Skim

Figure 4 for Transkimmer: Transformer Learns to Layer-wise Skim

Abstract:Transformer architecture has become the de-facto model for many machine learning tasks from natural language processing and computer vision. As such, improving its computational efficiency becomes paramount. One of the major computational inefficiency of Transformer-based models is that they spend the identical amount of computation throughout all layers. Prior works have proposed to augment the Transformer model with the capability of skimming tokens to improve its computational efficiency. However, they suffer from not having effectual and end-to-end optimization of the discrete skimming predictor. To address the above limitations, we propose the Transkimmer architecture, which learns to identify hidden state tokens that are not required by each layer. The skimmed tokens are then forwarded directly to the final output, thus reducing the computation of the successive layers. The key idea in Transkimmer is to add a parameterized predictor before each layer that learns to make the skimming decision. We also propose to adopt reparameterization trick and add skim loss for the end-to-end training of Transkimmer. Transkimmer achieves 10.97x average speedup on GLUE benchmark compared with vanilla BERT-base baseline with less than 1% accuracy degradation.

* Published as a conference paper at ACL 2022

Via

Access Paper or Ask Questions

Block-Skim: Efficient Question Answering for Transformer

Dec 16, 2021

Yue Guan, Zhengyi Li, Jingwen Leng, Zhouhan Lin, Minyi Guo, Yuhao Zhu

Figure 1 for Block-Skim: Efficient Question Answering for Transformer

Figure 2 for Block-Skim: Efficient Question Answering for Transformer

Figure 3 for Block-Skim: Efficient Question Answering for Transformer

Figure 4 for Block-Skim: Efficient Question Answering for Transformer

Abstract:Transformer models have achieved promising results on natural language processing (NLP) tasks including extractive question answering (QA). Common Transformer encoders used in NLP tasks process the hidden states of all input tokens in the context paragraph throughout all layers. However, different from other tasks such as sequence classification, answering the raised question does not necessarily need all the tokens in the context paragraph. Following this motivation, we propose Block-skim, which learns to skim unnecessary context in higher hidden layers to improve and accelerate the Transformer performance. The key idea of Block-Skim is to identify the context that must be further processed and those that could be safely discarded early on during inference. Critically, we find that such information could be sufficiently derived from the self-attention weights inside the Transformer model. We further prune the hidden states corresponding to the unnecessary positions early in lower layers, achieving significant inference-time speedup. To our surprise, we observe that models pruned in this way outperform their full-size counterparts. Block-Skim improves QA models' accuracy on different datasets and achieves 3 times speedup on BERT-base model.

* Published as a conference paper at AAAI 2022

Via

Access Paper or Ask Questions

Joint Optical Neuroimaging Denoising with Semantic Tasks

Sep 22, 2021

Tianfang Zhu, Yue Guan, Anan Li

Figure 1 for Joint Optical Neuroimaging Denoising with Semantic Tasks

Figure 2 for Joint Optical Neuroimaging Denoising with Semantic Tasks

Figure 3 for Joint Optical Neuroimaging Denoising with Semantic Tasks

Figure 4 for Joint Optical Neuroimaging Denoising with Semantic Tasks

Abstract:Optical neuroimaging is a vital tool for understanding the brain structure and the connection between regions and nuclei. However, the image noise introduced in the sample preparation and the imaging system hinders the extraction of the possible knowlege from the dataset, thus denoising for the optical neuroimaging is usually necessary. The supervised denoisng methods often outperform the unsupervised ones, but the training of the supervised denoising models needs the corresponding clean labels, which is not always avaiable due to the high labeling cost. On the other hand, those semantic labels, such as the located soma positions, the reconstructed neuronal fibers, and the nuclei segmentation result, are generally available and accumulated from everyday neuroscience research. This work connects a supervised denoising and a semantic segmentation model together to form a end-to-end model, which can make use of the semantic labels while still provides a denoised image as an intermediate product. We use both the supervised and the self-supervised models for the denoising and introduce a new cost term for the joint denoising and the segmentation setup. We test the proposed approach on both the synthetic data and the real-world data, including the optical neuroimaing dataset and the electron microscope dataset. The result shows that the joint denoising result outperforms the one using the denoising method alone and the joint model benefits the segmentation and other downstream task as well.

Via

Access Paper or Ask Questions

PointManifoldCut: Point-wise Augmentation in the Manifold for Point Clouds

Sep 15, 2021

Tianfang Zhu, Yue Guan, Anan Li

Figure 1 for PointManifoldCut: Point-wise Augmentation in the Manifold for Point Clouds

Figure 2 for PointManifoldCut: Point-wise Augmentation in the Manifold for Point Clouds

Figure 3 for PointManifoldCut: Point-wise Augmentation in the Manifold for Point Clouds

Figure 4 for PointManifoldCut: Point-wise Augmentation in the Manifold for Point Clouds

Abstract:Augmentation can benefit point cloud learning due to the limited availability of large-scale public datasets. This paper proposes a mix-up augmentation approach, PointManifoldCut, which replaces the neural network embedded points, rather than the Euclidean space coordinates. This approach takes the advantage that points at the higher levels of the neural network are already trained to embed its neighbors relations and mixing these representation will not mingle the relation between itself and its label. This allows to regularize the parameter space as the other augmentation methods but without worrying about the proper label of the replaced points. The experiments show that our proposed approach provides a competitive performance on point cloud classification and segmentation when it is combined with the cutting-edge vanilla point cloud networks. The result shows a consistent performance boosting compared to other state-of-the-art point cloud augmentation method, such as PointMixup and PointCutMix. The code of this paper is available at: https://github.com/fun0515/PointManifoldCut.

Via

Access Paper or Ask Questions

How Far Does BERT Look At:Distance-based Clustering and Analysis of BERT$'$s Attention

Nov 03, 2020

Yue Guan, Jingwen Leng, Chao Li, Quan Chen, Minyi Guo

Figure 1 for How Far Does BERT Look At:Distance-based Clustering and Analysis of BERT$'$s Attention

Figure 2 for How Far Does BERT Look At:Distance-based Clustering and Analysis of BERT$'$s Attention

Figure 3 for How Far Does BERT Look At:Distance-based Clustering and Analysis of BERT$'$s Attention

Figure 4 for How Far Does BERT Look At:Distance-based Clustering and Analysis of BERT$'$s Attention

Abstract:Recent research on the multi-head attention mechanism, especially that in pre-trained models such as BERT, has shown us heuristics and clues in analyzing various aspects of the mechanism. As most of the research focus on probing tasks or hidden states, previous works have found some primitive patterns of attention head behavior by heuristic analytical methods, but a more systematic analysis specific on the attention patterns still remains primitive. In this work, we clearly cluster the attention heatmaps into significantly different patterns through unsupervised clustering on top of a set of proposed features, which corroborates with previous observations. We further study their corresponding functions through analytical study. In addition, our proposed features can be used to explain and calibrate different attention heads in Transformer models.

Via

Access Paper or Ask Questions

Learning Nash Equilibria in Zero-Sum Stochastic Games via Entropy-Regularized Policy Approximation

Sep 01, 2020

Qifan Zhang, Yue Guan, Panagiotis Tsiotras

Figure 1 for Learning Nash Equilibria in Zero-Sum Stochastic Games via Entropy-Regularized Policy Approximation

Figure 2 for Learning Nash Equilibria in Zero-Sum Stochastic Games via Entropy-Regularized Policy Approximation

Figure 3 for Learning Nash Equilibria in Zero-Sum Stochastic Games via Entropy-Regularized Policy Approximation

Figure 4 for Learning Nash Equilibria in Zero-Sum Stochastic Games via Entropy-Regularized Policy Approximation

Abstract:We explore the use of policy approximation for reducing the computational cost of learning Nash equilibria in multi-agent reinforcement learning scenarios. We propose a new algorithm for zero-sum stochastic games in which each agent simultaneously learns a Nash policy and an entropy-regularized policy. The two policies help each other towards convergence: the former guides the latter to the desired Nash equilibrium, while the latter serves as an efficient approximation of the former. We demonstrate the possibility of using the proposed algorithm to transfer previous training experiences to different environments, enabling the agents to adapt quickly to new tasks. We also provide a dynamic hyper-parameter scheduling scheme for further expedited convergence. Empirical results applied to a number of stochastic games show that the proposed algorithm converges to the Nash equilibrium while exhibiting a major speed-up over existing algorithms.

Via

Access Paper or Ask Questions

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity

Aug 29, 2020

Cong Guo, Bo Yang Hsueh, Jingwen Leng, Yuxian Qiu, Yue Guan, Zehuan Wang, Xiaoying Jia, Xipeng Li, Minyi Guo, Yuhao Zhu

Figure 1 for Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity

Figure 2 for Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity

Figure 3 for Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity

Figure 4 for Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity

Abstract:Network pruning can reduce the high computation cost of deep neural network (DNN) models. However, to maintain their accuracies, sparse models often carry randomly-distributed weights, leading to irregular computations. Consequently, sparse models cannot achieve meaningful speedup on commodity hardware (e.g., GPU) built for dense matrix computations. As such, prior works usually modify or design completely new sparsity-optimized architectures for exploiting sparsity. We propose an algorithm-software co-designed pruning method that achieves latency speedups on existing dense architectures. Our work builds upon the insight that the matrix multiplication generally breaks the large matrix into multiple smaller tiles for parallel execution. We propose a tiling-friendly "tile-wise" sparsity pattern, which maintains a regular pattern at the tile level for efficient execution but allows for irregular, arbitrary pruning at the global scale to maintain the high accuracy. We implement and evaluate the sparsity pattern on GPU tensor core, achieving a 1.95x speedup over the dense model.

* 12pages, ACM/IEEE Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC20)

Via

Access Paper or Ask Questions

Geometry Normalization Networks for Accurate Scene Text Detection

Sep 02, 2019

Youjiang Xu, Jiaqi Duan, Zhanghui Kuang, Xiaoyu Yue, Hongbin Sun, Yue Guan, Wayne Zhang

Figure 1 for Geometry Normalization Networks for Accurate Scene Text Detection

Figure 2 for Geometry Normalization Networks for Accurate Scene Text Detection

Figure 3 for Geometry Normalization Networks for Accurate Scene Text Detection

Figure 4 for Geometry Normalization Networks for Accurate Scene Text Detection

Abstract:Large geometry (e.g., orientation) variances are the key challenges in the scene text detection. In this work, we first conduct experiments to investigate the capacity of networks for learning geometry variances on detecting scene texts, and find that networks can handle only limited text geometry variances. Then, we put forward a novel Geometry Normalization Module (GNM) with multiple branches, each of which is composed of one Scale Normalization Unit and one Orientation Normalization Unit, to normalize each text instance to one desired canonical geometry range through at least one branch. The GNM is general and readily plugged into existing convolutional neural network based text detectors to construct end-to-end Geometry Normalization Networks (GNNets). Moreover, we propose a geometry-aware training scheme to effectively train the GNNets by sampling and augmenting text instances from a uniform geometry variance distribution. Finally, experiments on popular benchmarks of ICDAR 2015 and ICDAR 2017 MLT validate that our method outperforms all the state-of-the-art approaches remarkably by obtaining one-forward test F-scores of 88.52 and 74.54 respectively.

* ICCV 2019

Via

Access Paper or Ask Questions