Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gong Zhang

Efficient Long-Context LLM Inference via KV Cache Clustering

Jun 13, 2025

Jie Hu, Shengnan Wang, Yutong He, Ping Gong, Jiawei Yi, Juncheng Zhang, Youhui Bai, Renhai Chen, Gong Zhang, Cheng Li(+1 more)

Abstract:Large language models (LLMs) with extended context windows have become increasingly prevalent for tackling complex tasks. However, the substantial Key-Value (KV) cache required for long-context LLMs poses significant deployment challenges. Existing approaches either discard potentially critical information needed for future generations or offer limited efficiency gains due to high computational overhead. In this paper, we introduce Chelsea, a simple yet effective framework for online KV cache clustering. Our approach is based on the observation that key states exhibit high similarity along the sequence dimension. To enable efficient clustering, we divide the sequence into chunks and propose Chunked Soft Matching, which employs an alternating partition strategy within each chunk and identifies clusters based on similarity. Chelsea then merges the KV cache within each cluster into a single centroid. Additionally, we provide a theoretical analysis of the computational complexity and the optimality of the intra-chunk partitioning strategy. Extensive experiments across various models and long-context benchmarks demonstrate that Chelsea achieves up to 80% reduction in KV cache memory usage while maintaining comparable model performance. Moreover, with minimal computational overhead, Chelsea accelerates the decoding stage of inference by up to 3.19$\times$ and reduces end-to-end latency by up to 2.72$\times$.

Via

Access Paper or Ask Questions

AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora

May 29, 2025

Jiaxin Bai, Wei Fan, Qi Hu, Qing Zong, Chunyang Li, Hong Ting Tsang, Hongyu Luo, Yauwai Yim, Haoyu Huang, Xiao Zhou(+10 more)

Abstract:We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 95\% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.

* 9 pages, preprint, code: https://github.com/HKUST-KnowComp/AutoSchemaKG

Via

Access Paper or Ask Questions

XL3M: A Training-free Framework for LLM Length Extension Based on Segment-wise Inference

May 28, 2024

Shengnan Wang, Youhui Bai, Lin Zhang, Pingyi Zhou, Shixiong Zhao, Gong Zhang, Sen Wang, Renhai Chen, Hua Xu, Hongwei Sun

Figure 1 for XL3M: A Training-free Framework for LLM Length Extension Based on Segment-wise Inference

Figure 2 for XL3M: A Training-free Framework for LLM Length Extension Based on Segment-wise Inference

Figure 3 for XL3M: A Training-free Framework for LLM Length Extension Based on Segment-wise Inference

Figure 4 for XL3M: A Training-free Framework for LLM Length Extension Based on Segment-wise Inference

Abstract:Length generalization failure problem, namely the large language model (LLM) fails to generalize to texts longer than its maximum training length, greatly restricts the application of LLM in the scenarios with streaming long inputs. To address this problem, the existing methods either require substantial costs or introduce precision loss. In this paper, we empirically find that the accuracy of the LLM's prediction is highly correlated to its certainty. Based on this, we propose an efficient training free framework, named XL3M (it means extra-long large language model), which enables the LLMs trained on short sequences to reason extremely long sequence without any further training or fine-tuning. Under the XL3M framework, the input context will be firstly decomposed into multiple short sub-contexts, where each sub-context contains an independent segment and a common ``question'' which is a few tokens from the end of the original context. Then XL3M gives a method to measure the relevance between each segment and the ``question'', and constructs a concise key context by splicing all the relevant segments in chronological order. The key context is further used instead of the original context to complete the inference task. Evaluations on comprehensive benchmarks show the superiority of XL3M. Using our framework, a Llama2-7B model is able to reason 20M long sequences on an 8-card Huawei Ascend 910B NPU machine with 64GB memory per card.

* 11 pages, 5 figures

Via

Access Paper or Ask Questions

Super-Resolution Harmonic Retrieval of Non-Circular Signals

Jan 17, 2023

Yu Zhang, Yue Wang, Zhi Tian, Geert Leus, Gong Zhang

Figure 1 for Super-Resolution Harmonic Retrieval of Non-Circular Signals

Figure 2 for Super-Resolution Harmonic Retrieval of Non-Circular Signals

Figure 3 for Super-Resolution Harmonic Retrieval of Non-Circular Signals

Abstract:This paper proposes a super-resolution harmonic retrieval method for uncorrelated strictly non-circular signals, whose covariance and pseudo-covariance present Toeplitz and Hankel structures, respectively. Accordingly, the augmented covariance matrix constructed by the covariance and pseudo-covariance matrices is not only low rank but also jointly Toeplitz-Hankel structured. To efficiently exploit such a desired structure for high estimation accuracy, we develop a low-rank Toeplitz-Hankel covariance reconstruction (LRTHCR) solution employed over the augmented covariance matrix. Further, we design a fitting error constraint to flexibly implement the LRTHCR algorithm without knowing the noise statistics. In addition, performance analysis is provided for the proposed LRTHCR in practical settings. Simulation results reveal that the LRTHCR outperforms the benchmark methods in terms of lower estimation errors.

Via

Access Paper or Ask Questions

Trading off Quality for Efficiency of Community Detection: An Inductive Method across Graphs

Sep 29, 2022

Meng Qin, Chaorui Zhang, Bo Bai, Gong Zhang, Dit-Yan Yeung

Figure 1 for Trading off Quality for Efficiency of Community Detection: An Inductive Method across Graphs

Figure 2 for Trading off Quality for Efficiency of Community Detection: An Inductive Method across Graphs

Figure 3 for Trading off Quality for Efficiency of Community Detection: An Inductive Method across Graphs

Figure 4 for Trading off Quality for Efficiency of Community Detection: An Inductive Method across Graphs

Abstract:Many network applications can be formulated as NP-hard combinatorial optimization problems of community detection (CD). Due to the NP-hardness, to balance the CD quality and efficiency remains a challenge. Most existing CD methods are transductive, which are independently optimized only for the CD on a single graph. Some of these methods use advanced machine learning techniques to obtain high-quality CD results but usually have high complexity. Other approaches use fast heuristic approximation to ensure low runtime but may suffer from quality degradation. In contrast to these transductive methods, we propose an alternative inductive community detection (ICD) method across graphs of a system or scenario to alleviate the NP-hard challenge. ICD first conducts the offline training of an adversarial dual GNN on historical graphs to capture key properties of the system. The trained model is then directly generalized to new unseen graphs for online CD without additional optimization, where a better trade-off between quality and efficiency can be achieved. ICD can also capture the permutation invariant community labels in the offline training and tackle the online CD on new graphs with non-fixed number of nodes and communities. Experiments on a set of benchmarks demonstrate that ICD can achieve a significant trade-off between quality and efficiency over various baselines.

Via

Access Paper or Ask Questions

Grasping the Arrow of Time from the Singularity: Decoding Micromotion in Low-dimensional Latent Spaces from StyleGAN

Apr 27, 2022

Qiucheng Wu, Yifan Jiang, Junru Wu, Kai Wang, Gong Zhang, Humphrey Shi, Zhangyang Wang, Shiyu Chang

Figure 1 for Grasping the Arrow of Time from the Singularity: Decoding Micromotion in Low-dimensional Latent Spaces from StyleGAN

Figure 2 for Grasping the Arrow of Time from the Singularity: Decoding Micromotion in Low-dimensional Latent Spaces from StyleGAN

Figure 3 for Grasping the Arrow of Time from the Singularity: Decoding Micromotion in Low-dimensional Latent Spaces from StyleGAN

Figure 4 for Grasping the Arrow of Time from the Singularity: Decoding Micromotion in Low-dimensional Latent Spaces from StyleGAN

Abstract:The disentanglement of StyleGAN latent space has paved the way for realistic and controllable image editing, but does StyleGAN know anything about temporal motion, as it was only trained on static images? To study the motion features in the latent space of StyleGAN, in this paper, we hypothesize and demonstrate that a series of meaningful, natural, and versatile small, local movements (referred to as "micromotion", such as expression, head movement, and aging effect) can be represented in low-rank spaces extracted from the latent space of a conventionally pre-trained StyleGAN-v2 model for face generation, with the guidance of proper "anchors" in the form of either short text or video clips. Starting from one target face image, with the editing direction decoded from the low-rank space, its micromotion features can be represented as simple as an affine transformation over its latent feature. Perhaps more surprisingly, such micromotion subspace, even learned from just single target face, can be painlessly transferred to other unseen face images, even those from vastly different domains (such as oil painting, cartoon, and sculpture faces). It demonstrates that the local feature geometry corresponding to one type of micromotion is aligned across different face subjects, and hence that StyleGAN-v2 is indeed "secretly" aware of the subject-disentangled feature variations caused by that micromotion. We present various successful examples of applying our low-dimensional micromotion subspace technique to directly and effortlessly manipulate faces, showing high robustness, low computational overhead, and impressive domain transferability. Our codes are available at https://github.com/wuqiuche/micromotion-StyleGAN.

* 13 pages, 8 figures

Via

Access Paper or Ask Questions

More Control for Free! Image Synthesis with Semantic Diffusion Guidance

Dec 14, 2021

Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, Trevor Darrell

Figure 1 for More Control for Free! Image Synthesis with Semantic Diffusion Guidance

Figure 2 for More Control for Free! Image Synthesis with Semantic Diffusion Guidance

Figure 3 for More Control for Free! Image Synthesis with Semantic Diffusion Guidance

Figure 4 for More Control for Free! Image Synthesis with Semantic Diffusion Guidance

Abstract:Controllable image synthesis models allow creation of diverse images based on text instructions or guidance from an example image. Recently, denoising diffusion probabilistic models have been shown to generate more realistic imagery than prior methods, and have been successfully demonstrated in unconditional and class-conditional settings. We explore fine-grained, continuous control of this model class, and introduce a novel unified framework for semantic diffusion guidance, which allows either language or image guidance, or both. Guidance is injected into a pretrained unconditional diffusion model using the gradient of image-text or image matching scores. We explore CLIP-based textual guidance as well as both content and style-based image guidance in a unified form. Our text-guided synthesis approach can be applied to datasets without associated text annotations. We conduct experiments on FFHQ and LSUN datasets, and show results on fine-grained text-guided image synthesis, synthesis of images related to a style or content example image, and examples with both textual and image guidance.

* Project page https://xh-liu.github.io/sdg/

Via

Access Paper or Ask Questions

Comprehensive and Clinically Accurate Head and Neck Organs at Risk Delineation via Stratified Deep Learning: A Large-scale Multi-Institutional Study

Nov 01, 2021

Dazhou Guo, Jia Ge, Xianghua Ye, Senxiang Yan, Yi Xin, Yuchen Song, Bing-shen Huang, Tsung-Min Hung, Zhuotun Zhu, Ling Peng(+15 more)

Figure 1 for Comprehensive and Clinically Accurate Head and Neck Organs at Risk Delineation via Stratified Deep Learning: A Large-scale Multi-Institutional Study

Figure 2 for Comprehensive and Clinically Accurate Head and Neck Organs at Risk Delineation via Stratified Deep Learning: A Large-scale Multi-Institutional Study

Figure 3 for Comprehensive and Clinically Accurate Head and Neck Organs at Risk Delineation via Stratified Deep Learning: A Large-scale Multi-Institutional Study

Figure 4 for Comprehensive and Clinically Accurate Head and Neck Organs at Risk Delineation via Stratified Deep Learning: A Large-scale Multi-Institutional Study

Abstract:Accurate organ at risk (OAR) segmentation is critical to reduce the radiotherapy post-treatment complications. Consensus guidelines recommend a set of more than 40 OARs in the head and neck (H&N) region, however, due to the predictable prohibitive labor-cost of this task, most institutions choose a substantially simplified protocol by delineating a smaller subset of OARs and neglecting the dose distributions associated with other OARs. In this work we propose a novel, automated and highly effective stratified OAR segmentation (SOARS) system using deep learning to precisely delineate a comprehensive set of 42 H&N OARs. SOARS stratifies 42 OARs into anchor, mid-level, and small & hard subcategories, with specifically derived neural network architectures for each category by neural architecture search (NAS) principles. We built SOARS models using 176 training patients in an internal institution and independently evaluated on 1327 external patients across six different institutions. It consistently outperformed other state-of-the-art methods by at least 3-5% in Dice score for each institutional evaluation (up to 36% relative error reduction in other metrics). More importantly, extensive multi-user studies evidently demonstrated that 98% of the SOARS predictions need only very minor or no revisions for direct clinical acceptance (saving 90% radiation oncologists workload), and their segmentation and dosimetric accuracy are within or smaller than the inter-user variation. These findings confirmed the strong clinical applicability of SOARS for the OAR delineation process in H&N cancer radiotherapy workflows, with improved efficiency, comprehensiveness, and quality.

Via

Access Paper or Ask Questions

Cross-lingual Entity Alignment with Adversarial Kernel Embedding and Adversarial Knowledge Translation

Apr 16, 2021

Gong Zhang, Yang Zhou, Sixing Wu, Zeru Zhang, Dejing Dou

Figure 1 for Cross-lingual Entity Alignment with Adversarial Kernel Embedding and Adversarial Knowledge Translation

Figure 2 for Cross-lingual Entity Alignment with Adversarial Kernel Embedding and Adversarial Knowledge Translation

Figure 3 for Cross-lingual Entity Alignment with Adversarial Kernel Embedding and Adversarial Knowledge Translation

Figure 4 for Cross-lingual Entity Alignment with Adversarial Kernel Embedding and Adversarial Knowledge Translation

Abstract:Cross-lingual entity alignment, which aims to precisely connect the same entities in different monolingual knowledge bases (KBs) together, often suffers challenges from feature inconsistency to sequence context unawareness. This paper presents a dual adversarial learning framework for cross-lingual entity alignment, DAEA, with two original contributions. First, in order to address the structural and attribute feature inconsistency between entities in two knowledge graphs (KGs), an adversarial kernel embedding technique is proposed to extract graph-invariant information in an unsupervised manner, and project two KGs into the common embedding space. Second, in order to further improve successful rate of entity alignment, we propose to produce multiple random walks through each entity to be aligned and mask these entities in random walks. With the guidance of known aligned entities in the context of multiple random walks, an adversarial knowledge translation model is developed to fill and translate masked entities in pairwise random walks from two KGs. Extensive experiments performed on real-world datasets show that DAEA can well solve the feature inconsistency and sequence context unawareness issues and significantly outperforms thirteen state-of-the-art entity alignment methods.

Via

Access Paper or Ask Questions

Query-aware Tip Generation for Vertical Search

Oct 19, 2020

Yang Yang, Junmei Hao, Canjia Li, Zili Wang, Jingang Wang, Fuzheng Zhang, Rao Fu, Peixu Hou, Gong Zhang, Zhongyuan Wang

Figure 1 for Query-aware Tip Generation for Vertical Search

Figure 2 for Query-aware Tip Generation for Vertical Search

Figure 3 for Query-aware Tip Generation for Vertical Search

Figure 4 for Query-aware Tip Generation for Vertical Search

Abstract:As a concise form of user reviews, tips have unique advantages to explain the search results, assist users' decision making, and further improve user experience in vertical search scenarios. Existing work on tip generation does not take query into consideration, which limits the impact of tips in search scenarios. To address this issue, this paper proposes a query-aware tip generation framework, integrating query information into encoding and subsequent decoding processes. Two specific adaptations of Transformer and Recurrent Neural Network (RNN) are proposed. For Transformer, the query impact is incorporated into the self-attention computation of both the encoder and the decoder. As for RNN, the query-aware encoder adopts a selective network to distill query-relevant information from the review, while the query-aware decoder integrates the query information into the attention computation during decoding. The framework consistently outperforms the competing methods on both public and real-world industrial datasets. Last but not least, online deployment experiments on Dianping demonstrate the advantage of the proposed framework for tip generation as well as its online business values.

* Accepted By CIKM 2020 Applied Research Track

Via

Access Paper or Ask Questions