Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ping Gong

Efficient Long-Context LLM Inference via KV Cache Clustering

Jun 13, 2025

Jie Hu, Shengnan Wang, Yutong He, Ping Gong, Jiawei Yi, Juncheng Zhang, Youhui Bai, Renhai Chen, Gong Zhang, Cheng Li(+1 more)

Abstract:Large language models (LLMs) with extended context windows have become increasingly prevalent for tackling complex tasks. However, the substantial Key-Value (KV) cache required for long-context LLMs poses significant deployment challenges. Existing approaches either discard potentially critical information needed for future generations or offer limited efficiency gains due to high computational overhead. In this paper, we introduce Chelsea, a simple yet effective framework for online KV cache clustering. Our approach is based on the observation that key states exhibit high similarity along the sequence dimension. To enable efficient clustering, we divide the sequence into chunks and propose Chunked Soft Matching, which employs an alternating partition strategy within each chunk and identifies clusters based on similarity. Chelsea then merges the KV cache within each cluster into a single centroid. Additionally, we provide a theoretical analysis of the computational complexity and the optimality of the intra-chunk partitioning strategy. Extensive experiments across various models and long-context benchmarks demonstrate that Chelsea achieves up to 80% reduction in KV cache memory usage while maintaining comparable model performance. Moreover, with minimal computational overhead, Chelsea accelerates the decoding stage of inference by up to 3.19$\times$ and reduces end-to-end latency by up to 2.72$\times$.

Via

Access Paper or Ask Questions

Unicorn: A Universal and Collaborative Reinforcement Learning Approach Towards Generalizable Network-Wide Traffic Signal Control

Mar 14, 2025

Yifeng Zhang, Yilin Liu, Ping Gong, Peizhuo Li, Mingfeng Fan, Guillaume Sartoretti

Abstract:Adaptive traffic signal control (ATSC) is crucial in reducing congestion, maximizing throughput, and improving mobility in rapidly growing urban areas. Recent advancements in parameter-sharing multi-agent reinforcement learning (MARL) have greatly enhanced the scalable and adaptive optimization of complex, dynamic flows in large-scale homogeneous networks. However, the inherent heterogeneity of real-world traffic networks, with their varied intersection topologies and interaction dynamics, poses substantial challenges to achieving scalable and effective ATSC across different traffic scenarios. To address these challenges, we present Unicorn, a universal and collaborative MARL framework designed for efficient and adaptable network-wide ATSC. Specifically, we first propose a unified approach to map the states and actions of intersections with varying topologies into a common structure based on traffic movements. Next, we design a Universal Traffic Representation (UTR) module with a decoder-only network for general feature extraction, enhancing the model's adaptability to diverse traffic scenarios. Additionally, we incorporate an Intersection Specifics Representation (ISR) module, designed to identify key latent vectors that represent the unique intersection's topology and traffic dynamics through variational inference techniques. To further refine these latent representations, we employ a contrastive learning approach in a self-supervised manner, which enables better differentiation of intersection-specific features. Moreover, we integrate the state-action dependencies of neighboring agents into policy optimization, which effectively captures dynamic agent interactions and facilitates efficient regional collaboration. Our results show that Unicorn outperforms other methods across various evaluation metrics, highlighting its potential in complex, dynamic traffic networks.

Via

Access Paper or Ask Questions

SARA: Singular-Value Based Adaptive Low-Rank Adaption

Aug 06, 2024

Jihao Gu, Shuai Chen, Zelin Wang, Yibo Zhang, Ping Gong

Abstract:With the increasing number of parameters in large pre-trained models, LoRA as a parameter-efficient fine-tuning(PEFT) method is widely used for not adding inference overhead. The LoRA method assumes that weight changes during fine-tuning can be approximated by low-rank matrices. However, the rank values need to be manually verified to match different downstream tasks, and they cannot accommodate the varying importance of different layers in the model. In this work, we first analyze the relationship between the performance of different layers and their ranks using SVD. Based on this, we design the Singular-Value Based Adaptive Low-Rank Adaption(SARA), which adaptively finds the rank during initialization by performing SVD on the pre-trained weights. Additionally, we explore the Mixture-of-SARA(Mo-SARA), which significantly reduces the number of parameters by fine-tuning only multiple parallel sets of singular values controlled by a router. Extensive experiments on various complex tasks demonstrate the simplicity and parameter efficiency of our methods. They can effectively and adaptively find the most suitable rank for each layer of each model.

Via

Access Paper or Ask Questions

Understand Data Preprocessing for Effective End-to-End Training of Deep Neural Networks

Apr 18, 2023

Ping Gong, Yuxin Ma, Cheng Li, Xiaosong Ma, Sam H. Noh

Figure 1 for Understand Data Preprocessing for Effective End-to-End Training of Deep Neural Networks

Figure 2 for Understand Data Preprocessing for Effective End-to-End Training of Deep Neural Networks

Figure 3 for Understand Data Preprocessing for Effective End-to-End Training of Deep Neural Networks

Figure 4 for Understand Data Preprocessing for Effective End-to-End Training of Deep Neural Networks

Abstract:In this paper, we primarily focus on understanding the data preprocessing pipeline for DNN Training in the public cloud. First, we run experiments to test the performance implications of the two major data preprocessing methods using either raw data or record files. The preliminary results show that data preprocessing is a clear bottleneck, even with the most efficient software and hardware configuration enabled by NVIDIA DALI, a high-optimized data preprocessing library. Second, we identify the potential causes, exercise a variety of optimization methods, and present their pros and cons. We hope this work will shed light on the new co-design of ``data storage, loading pipeline'' and ``training framework'' and flexible resource configurations between them so that the resources can be fully exploited and performance can be maximized.

Via

Access Paper or Ask Questions

Joint localization and classification of breast tumors on ultrasound images using a novel auxiliary attention-based framework

Oct 11, 2022

Zong Fan, Ping Gong, Shanshan Tang, Christine U. Lee, Xiaohui Zhang, Pengfei Song, Shigao Chen, Hua Li

Figure 1 for Joint localization and classification of breast tumors on ultrasound images using a novel auxiliary attention-based framework

Figure 2 for Joint localization and classification of breast tumors on ultrasound images using a novel auxiliary attention-based framework

Figure 3 for Joint localization and classification of breast tumors on ultrasound images using a novel auxiliary attention-based framework

Figure 4 for Joint localization and classification of breast tumors on ultrasound images using a novel auxiliary attention-based framework

Abstract:Automatic breast lesion detection and classification is an important task in computer-aided diagnosis, in which breast ultrasound (BUS) imaging is a common and frequently used screening tool. Recently, a number of deep learning-based methods have been proposed for joint localization and classification of breast lesions using BUS images. In these methods, features extracted by a shared network trunk are appended by two independent network branches to achieve classification and localization. Improper information sharing might cause conflicts in feature optimization in the two branches and leads to performance degradation. Also, these methods generally require large amounts of pixel-level annotated data for model training. To overcome these limitations, we proposed a novel joint localization and classification model based on the attention mechanism and disentangled semi-supervised learning strategy. The model used in this study is composed of a classification network and an auxiliary lesion-aware network. By use of the attention mechanism, the auxiliary lesion-aware network can optimize multi-scale intermediate feature maps and extract rich semantic information to improve classification and localization performance. The disentangled semi-supervised learning strategy only requires incomplete training datasets for model training. The proposed modularized framework allows flexible network replacement to be generalized for various applications. Experimental results on two different breast ultrasound image datasets demonstrate the effectiveness of the proposed method. The impacts of various network factors on model performance are also investigated to gain deep insights into the designed framework.

Via

Access Paper or Ask Questions

One-Shot Medical Landmark Localization by Edge-Guided Transform and Noisy Landmark Refinement

Jul 31, 2022

Zihao Yin, Ping Gong, Chunyu Wang, Yizhou Yu, Yizhou Wang

Figure 1 for One-Shot Medical Landmark Localization by Edge-Guided Transform and Noisy Landmark Refinement

Figure 2 for One-Shot Medical Landmark Localization by Edge-Guided Transform and Noisy Landmark Refinement

Figure 3 for One-Shot Medical Landmark Localization by Edge-Guided Transform and Noisy Landmark Refinement

Figure 4 for One-Shot Medical Landmark Localization by Edge-Guided Transform and Noisy Landmark Refinement

Abstract:As an important upstream task for many medical applications, supervised landmark localization still requires non-negligible annotation costs to achieve desirable performance. Besides, due to cumbersome collection procedures, the limited size of medical landmark datasets impacts the effectiveness of large-scale self-supervised pre-training methods. To address these challenges, we propose a two-stage framework for one-shot medical landmark localization, which first infers landmarks by unsupervised registration from the labeled exemplar to unlabeled targets, and then utilizes these noisy pseudo labels to train robust detectors. To handle the significant structure variations, we learn an end-to-end cascade of global alignment and local deformations, under the guidance of novel loss functions which incorporate edge information. In stage II, we explore self-consistency for selecting reliable pseudo labels and cross-consistency for semi-supervised learning. Our method achieves state-of-the-art performances on public datasets of different body parts, which demonstrates its general applicability.

Via

Access Paper or Ask Questions

BiFeat: Supercharge GNN Training via Graph Feature Quantization

Jul 29, 2022

Yuxin Ma, Ping Gong, Jun Yi, Zhewei Yao, Minjie Wang, Cheng Li, Yuxiong He, Feng Yan

Figure 1 for BiFeat: Supercharge GNN Training via Graph Feature Quantization

Figure 2 for BiFeat: Supercharge GNN Training via Graph Feature Quantization

Figure 3 for BiFeat: Supercharge GNN Training via Graph Feature Quantization

Figure 4 for BiFeat: Supercharge GNN Training via Graph Feature Quantization

Abstract:Graph Neural Networks (GNNs) is a promising approach for applications with nonEuclidean data. However, training GNNs on large scale graphs with hundreds of millions nodes is both resource and time consuming. Different from DNNs, GNNs usually have larger memory footprints, and thus the GPU memory capacity and PCIe bandwidth are the main resource bottlenecks in GNN training. To address this problem, we present BiFeat: a graph feature quantization methodology to accelerate GNN training by significantly reducing the memory footprint and PCIe bandwidth requirement so that GNNs can take full advantage of GPU computing capabilities. Our key insight is that unlike DNN, GNN is less prone to the information loss of input features caused by quantization. We identify the main accuracy impact factors in graph feature quantization and theoretically prove that BiFeat training converges to a network where the loss is within $\epsilon$ of the optimal loss of uncompressed network. We perform extensive evaluation of BiFeat using several popular GNN models and datasets, including GraphSAGE on MAG240M, the largest public graph dataset. The results demonstrate that BiFeat achieves a compression ratio of more than 30 and improves GNN training speed by 200%-320% with marginal accuracy loss. In particular, BiFeat achieves a record by training GraphSAGE on MAG240M within one hour using only four GPUs.

Via

Access Paper or Ask Questions

Unsupervised Domain Adaptation Network with Category-Centric Prototype Aligner for Biomedical Image Segmentation

Mar 03, 2021

Ping Gong, Wenwen Yu, Qiuwen Sun, Ruohan Zhao, Junfeng Hu

Figure 1 for Unsupervised Domain Adaptation Network with Category-Centric Prototype Aligner for Biomedical Image Segmentation

Figure 2 for Unsupervised Domain Adaptation Network with Category-Centric Prototype Aligner for Biomedical Image Segmentation

Figure 3 for Unsupervised Domain Adaptation Network with Category-Centric Prototype Aligner for Biomedical Image Segmentation

Figure 4 for Unsupervised Domain Adaptation Network with Category-Centric Prototype Aligner for Biomedical Image Segmentation

Abstract:With the widespread success of deep learning in biomedical image segmentation, domain shift becomes a critical and challenging problem, as the gap between two domains can severely affect model performance when deployed to unseen data with heterogeneous features. To alleviate this problem, we present a novel unsupervised domain adaptation network, for generalizing models learned from the labeled source domain to the unlabeled target domain for cross-modality biomedical image segmentation. Specifically, our approach consists of two key modules, a conditional domain discriminator~(CDD) and a category-centric prototype aligner~(CCPA). The CDD, extended from conditional domain adversarial networks in classifier tasks, is effective and robust in handling complex cross-modality biomedical images. The CCPA, improved from the graph-induced prototype alignment mechanism in cross-domain object detection, can exploit precise instance-level features through an elaborate prototype representation. In addition, it can address the negative effect of class imbalance via entropy-based loss. Extensive experiments on a public benchmark for the cardiac substructure segmentation task demonstrate that our method significantly improves performance on the target domain.

* Ping Gong and Wenwen Yu contributed equally to this work. 11 pages, 4 figures, 3 tables

Via

Access Paper or Ask Questions

PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks

Apr 16, 2020

Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, Rong Xiao

Figure 1 for PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks

Figure 2 for PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks

Figure 3 for PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks

Figure 4 for PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks

Abstract:Computer vision with state-of-the-art deep learning models has achieved huge success in the field of Optical Character Recognition (OCR) including text detection and recognition tasks recently. However, Key Information Extraction (KIE) from documents as the downstream task of OCR, having a large number of use scenarios in real-world, remains a challenge because documents not only have textual features extracting from OCR systems but also have semantic visual features that are not fully exploited and play a critical role in KIE. Too little work has been devoted to efficiently make full use of both textual and visual features of the documents. In this paper, we introduce PICK, a framework that is effective and robust in handling complex documents layout for KIE by combining graph learning with graph convolution operation, yielding a richer semantic representation containing the textual and visual features and global layout without ambiguity. Extensive experiments on real-world datasets have been conducted to show that our method outperforms baselines methods by significant margins.

* The first two authors contributed equally to this work. 8 pages, 3 figures, 4 tables

Via

Access Paper or Ask Questions

MASTER: Multi-Aspect Non-local Network for Scene Text Recognition

Oct 07, 2019

Ning Lu, Wenwen Yu, Xianbiao Qi, Yihao Chen, Ping Gong, Rong Xiao

Figure 1 for MASTER: Multi-Aspect Non-local Network for Scene Text Recognition

Figure 2 for MASTER: Multi-Aspect Non-local Network for Scene Text Recognition

Figure 3 for MASTER: Multi-Aspect Non-local Network for Scene Text Recognition

Figure 4 for MASTER: Multi-Aspect Non-local Network for Scene Text Recognition

Abstract:Attention based scene text recognizers have gained huge success, which leverage a more compact intermediate representations to learn 1d- or 2d- attention by a RNN-based encoder-decoder architecture. However, such methods suffer from attention-drift problem because high similarity among encoded features lead to attention confusion under the RNN-based local attention mechanism. Moreover RNN-based methods have low efficiency due to poor parallelization. To overcome these problems, we propose the MASTER, a self-attention based scene text recognizer that (1) not only encodes the input-output attention, but also learns self-attention which encodes feature-feature and target-target relationships inside the encoder and decoder and (2) learns a more powerful and robust intermediate representation to spatial distortion and (3) owns a better training and evaluation efficiency. Extensive experiments on various benchmarks demonstrate the superior performance of our MASTER on both regular and irregular scene text.

* Ning Lu and Wenwen Yu are co-first authors. 11 pages, 6 figures, 5 tables

Via

Access Paper or Ask Questions