Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yukun Cao

SkewRoute: Training-Free LLM Routing for Knowledge Graph Retrieval-Augmented Generation via Score Skewness of Retrieved Context

May 28, 2025

Hairu Wang, Yuan Feng, Yukun Cao, Xike Xie, S Kevin Zhou

Abstract:Large language models excel at many tasks but often incur high inference costs during deployment. To mitigate hallucination, many systems use a knowledge graph to enhance retrieval-augmented generation (KG-RAG). However, the large amount of retrieved knowledge contexts increase these inference costs further. A promising solution to balance performance and cost is LLM routing, which directs simple queries to smaller LLMs and complex ones to larger LLMs. However, no dedicated routing methods currently exist for RAG, and existing training-based routers face challenges scaling to this domain due to the need for extensive training data. We observe that the score distributions produced by the retrieval scorer strongly correlate with query difficulty. Based on this, we propose a novel, training-free routing framework, the first tailored to KG-RAG that effectively balances performance and cost in a plug-and-play manner. Experiments show our method reduces calls to larger LLMs by up to 50% without sacrificing response quality, demonstrating its potential for efficient and scalable LLM deployment.

Via

Access Paper or Ask Questions

Lego Sketch: A Scalable Memory-augmented Neural Network for Sketching Data Streams

May 26, 2025

Yuan Feng, Yukun Cao, Hairu Wang, Xike Xie, S Kevin Zhou

Abstract:Sketches, probabilistic structures for estimating item frequencies in infinite data streams with limited space, are widely used across various domains. Recent studies have shifted the focus from handcrafted sketches to neural sketches, leveraging memory-augmented neural networks (MANNs) to enhance the streaming compression capabilities and achieve better space-accuracy trade-offs.However, existing neural sketches struggle to scale across different data domains and space budgets due to inflexible MANN configurations. In this paper, we introduce a scalable MANN architecture that brings to life the {\it Lego sketch}, a novel sketch with superior scalability and accuracy. Much like assembling creations with modular Lego bricks, the Lego sketch dynamically coordinates multiple memory bricks to adapt to various space budgets and diverse data domains. Our theoretical analysis guarantees its high scalability and provides the first error bound for neural sketch. Furthermore, extensive experimental evaluations demonstrate that the Lego sketch exhibits superior space-accuracy trade-offs, outperforming existing handcrafted and neural sketches. Our code is available at https://github.com/FFY0/LegoSketch_ICML.

* ICML 2025

Via

Access Paper or Ask Questions

Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective

Feb 06, 2025

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S Kevin Zhou

Abstract:Large language models have revolutionized natural language processing but face significant challenges of high storage and runtime costs, due to the transformer architecture's reliance on self-attention, particularly the large Key-Value (KV) cache for long-sequence inference. Recent efforts to reduce KV cache size by pruning less critical entries based on attention weights remain empirical and lack formal grounding. This paper presents a formal study on identifying critical KV cache entries by analyzing attention output perturbation. Our analysis reveals that, beyond attention weights, the value states within KV entries and pretrained parameter matrices are also crucial. Based on this, we propose a perturbation-constrained selection algorithm that optimizes the worst-case output perturbation to identify critical entries. Evaluations on the Needle-in-a-Haystack test and Longbench benchmark show our algorithm enhances state-of-the-art cache eviction methods. Further empirical analysis confirms that our algorithm achieves lower output perturbations in over 92% attention heads in Llama model, thereby providing a significant improvement over existing methods.

Via

Access Paper or Ask Questions

FRAG: A Flexible Modular Framework for Retrieval-Augmented Generation based on Knowledge Graphs

Jan 17, 2025

Zengyi Gao, Yukun Cao, Hairu Wang, Ao Ke, Yuan Feng, Xike Xie, S Kevin Zhou

Figure 1 for FRAG: A Flexible Modular Framework for Retrieval-Augmented Generation based on Knowledge Graphs

Figure 2 for FRAG: A Flexible Modular Framework for Retrieval-Augmented Generation based on Knowledge Graphs

Figure 3 for FRAG: A Flexible Modular Framework for Retrieval-Augmented Generation based on Knowledge Graphs

Figure 4 for FRAG: A Flexible Modular Framework for Retrieval-Augmented Generation based on Knowledge Graphs

Abstract:To mitigate the hallucination and knowledge deficiency in large language models (LLMs), Knowledge Graph (KG)-based Retrieval-Augmented Generation (RAG) has shown promising potential by utilizing KGs as external resource to enhance LLMs reasoning.However, existing KG-RAG approaches struggle with a trade-off between flexibility and retrieval quality.Modular methods prioritize flexibility by avoiding the use of KG-fine-tuned models during retrieval, leading to fixed retrieval strategies and suboptimal retrieval quality.Conversely, coupled methods embed KG information within models to improve retrieval quality, but at the expense of flexibility.In this paper, we propose a novel flexible modular KG-RAG framework, termed FRAG, which synergizes the advantages of both approaches.FRAG estimates the hop range of reasoning paths based solely on the query and classify it as either simple or complex.To match the complexity of the query, tailored pipelines are applied to ensure efficient and accurate reasoning path retrieval, thus fostering the final reasoning process.By using the query text instead of the KG to infer the structural information of reasoning paths and employing adaptable retrieval strategies, FRAG improves retrieval quality while maintaining flexibility.Moreover, FRAG does not require extra LLMs fine-tuning or calls, significantly boosting efficiency and conserving resources.Extensive experiments show that FRAG achieves state-of-the-art performance with high efficiency and low resource consumption.

Via

Access Paper or Ask Questions

DPCL-Diff: The Temporal Knowledge Graph Reasoning based on Graph Node Diffusion Model with Dual-Domain Periodic Contrastive Learning

Nov 03, 2024

Yukun Cao, Lisheng Wang, Luobing Huang

Abstract:Temporal knowledge graph (TKG) reasoning that infers future missing facts is an essential and challenging task. Predicting future events typically relies on closely related historical facts, yielding more accurate results for repetitive or periodic events. However, for future events with sparse historical interactions, the effectiveness of this method, which focuses on leveraging high-frequency historical information, diminishes. Recently, the capabilities of diffusion models in image generation have opened new opportunities for TKG reasoning. Therefore, we propose a graph node diffusion model with dual-domain periodic contrastive learning (DPCL-Diff). Graph node diffusion model (GNDiff) introduces noise into sparsely related events to simulate new events, generating high-quality data that better conforms to the actual distribution. This generative mechanism significantly enhances the model's ability to reason about new events. Additionally, the dual-domain periodic contrastive learning (DPCL) maps periodic and non-periodic event entities to Poincar\'e and Euclidean spaces, leveraging their characteristics to distinguish similar periodic events effectively. Experimental results on four public datasets demonstrate that DPCL-Diff significantly outperforms state-of-the-art TKG models in event prediction, demonstrating our approach's effectiveness. This study also investigates the combined effectiveness of GNDiff and DPCL in TKG tasks.

* 11 pages, 2 figures

Via

Access Paper or Ask Questions

Prototype-based Optimal Transport for Out-of-Distribution Detection

Oct 10, 2024

Ao Ke, Wenlong Chen, Chuanwen Feng, Yukun Cao, Xike Xie, S. Kevin Zhou, Lei Feng

Figure 1 for Prototype-based Optimal Transport for Out-of-Distribution Detection

Figure 2 for Prototype-based Optimal Transport for Out-of-Distribution Detection

Figure 3 for Prototype-based Optimal Transport for Out-of-Distribution Detection

Figure 4 for Prototype-based Optimal Transport for Out-of-Distribution Detection

Abstract:Detecting Out-of-Distribution (OOD) inputs is crucial for improving the reliability of deep neural networks in the real-world deployment. In this paper, inspired by the inherent distribution shift between ID and OOD data, we propose a novel method that leverages optimal transport to measure the distribution discrepancy between test inputs and ID prototypes. The resulting transport costs are used to quantify the individual contribution of each test input to the overall discrepancy, serving as a desirable measure for OOD detection. To address the issue that solely relying on the transport costs to ID prototypes is inadequate for identifying OOD inputs closer to ID data, we generate virtual outliers to approximate the OOD region via linear extrapolation. By combining the transport costs to ID prototypes with the costs to virtual outliers, the detection of OOD data near ID data is emphasized, thereby enhancing the distinction between ID and OOD inputs. Experiments demonstrate the superiority of our method over state-of-the-art methods.

Via

Access Paper or Ask Questions

GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding

Sep 05, 2024

Yukun Cao, Shuo Han, Zengyi Gao, Zezhong Ding, Xike Xie, S. Kevin Zhou

Figure 1 for GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding

Figure 2 for GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding

Figure 3 for GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding

Figure 4 for GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding

Abstract:Although Large Language Models (LLMs) have demonstrated potential in processing graphs, they struggle with comprehending graphical structure information through prompts of graph description sequences, especially as the graph size increases. We attribute this challenge to the uneven memory performance of LLMs across different positions in graph description sequences, known as ''positional biases''. To address this, we propose GraphInsight, a novel framework aimed at improving LLMs' comprehension of both macro- and micro-level graphical information. GraphInsight is grounded in two key strategies: 1) placing critical graphical information in positions where LLMs exhibit stronger memory performance, and 2) investigating a lightweight external knowledge base for regions with weaker memory performance, inspired by retrieval-augmented generation (RAG). Moreover, GraphInsight explores integrating these two strategies into LLM agent processes for composite graph tasks that require multi-step reasoning. Extensive empirical studies on benchmarks with a wide range of evaluation tasks show that GraphInsight significantly outperforms all other graph description methods (e.g., prompting techniques and reordering strategies) in understanding graph structures of varying sizes.

Via

Access Paper or Ask Questions

Optimizing KV Cache Eviction in LLMs: Adaptive Allocation for Enhanced Budget Utilization

Jul 16, 2024

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou

Figure 1 for Optimizing KV Cache Eviction in LLMs: Adaptive Allocation for Enhanced Budget Utilization

Figure 2 for Optimizing KV Cache Eviction in LLMs: Adaptive Allocation for Enhanced Budget Utilization

Figure 3 for Optimizing KV Cache Eviction in LLMs: Adaptive Allocation for Enhanced Budget Utilization

Figure 4 for Optimizing KV Cache Eviction in LLMs: Adaptive Allocation for Enhanced Budget Utilization

Abstract:Large Language Models have excelled in various fields but encounter efficiency limitations due to the extensive KV cache required for long sequences inference. Many efforts try to evict non-critical cache elements during runtime, thereby reducing cache size within a given memory budget while preserving generation quality. Our reexamination of their underlying principles discerns that prevailing strategies essentially aim to minimize an upper bound of eviction loss within a specific budget allocation. However, we observe that the current practice of uniformly allocating budgets across different attention heads during the eviction procedure tends to degrade the quality of generation posten-eviction. In light of these findings, we propose a simple yet effective adaptive allocation algorithm that not only theoretically ensures its loss upper bound does not exceed that of previous uniform allocation methods, but also effectively aligns with the characteristics of the self-attention mechanism, thus practically reducing the upper bound. Further, integrating this algorithm with two of the most advanced methods yields Ada-SnapKV and Ada-Pyramid. Extensive experimental validation across 16 datasets and the Needle-in-a-Haystack test confirm that Ada-SnapKV and Ada-Pyramid achieve further enhancements, establishing new benchmarks in state-of-the-art performance.

Via

Access Paper or Ask Questions

Learn to Explore: on Bootstrapping Interactive Data Exploration with Meta-learning

Dec 07, 2022

Yukun Cao, Xike Xie, Kexin Huang

Abstract:Interactive data exploration (IDE) is an effective way of comprehending big data, whose volume and complexity are beyond human abilities. The main goal of IDE is to discover user interest regions from a database through multi-rounds of user labelling. Existing IDEs adopt active-learning framework, where users iteratively discriminate or label the interestingness of selected tuples. The process of data exploration can be viewed as the process of training a classifier, which determines whether a database tuple is interesting to a user. An efficient exploration thus takes very few iterations of user labelling to reach the data region of interest. In this work, we consider the data exploration as the process of few-shot learning, where the classifier is learned with only a few training examples, or exploration iterations. To this end, we propose a learning-to-explore framework, based on meta-learning, which learns how to learn a classifier with automatically generated meta-tasks, so that the exploration process can be much shortened. Extensive experiments on real datasets show that our proposal outperforms existing explore-by-example solutions in terms of accuracy and efficiency.

Via

Access Paper or Ask Questions

Isomer: Transfer enhanced Dual-Channel Heterogeneous Dependency Attention Network for Aspect-based Sentiment Classification

Nov 21, 2021

Yukun Cao, Yijia Tang, Ziyue Wei, ChengKun Jin, Zeyu Miao, Yixin Fang, Haizhou Du, Feifei Xu

Figure 1 for Isomer: Transfer enhanced Dual-Channel Heterogeneous Dependency Attention Network for Aspect-based Sentiment Classification

Figure 2 for Isomer: Transfer enhanced Dual-Channel Heterogeneous Dependency Attention Network for Aspect-based Sentiment Classification

Figure 3 for Isomer: Transfer enhanced Dual-Channel Heterogeneous Dependency Attention Network for Aspect-based Sentiment Classification

Figure 4 for Isomer: Transfer enhanced Dual-Channel Heterogeneous Dependency Attention Network for Aspect-based Sentiment Classification

Abstract:Aspect-based sentiment classification aims to predict the sentiment polarity of a specific aspect in a sentence. However, most existing methods attempt to construct dependency relations into a homogeneous dependency graph with the sparsity and ambiguity, which cannot cover the comprehensive contextualized features of short texts or consider any additional node types or semantic relation information. To solve those issues, we present a sentiment analysis model named Isomer, which performs a dual-channel attention on heterogeneous dependency graphs incorporating external knowledge, to effectively integrate other additional information. Specifically, a transfer-enhanced dual-channel heterogeneous dependency attention network is devised in Isomer to model short texts using heterogeneous dependency graphs. These heterogeneous dependency graphs not only consider different types of information but also incorporate external knowledge. Experiments studies show that our model outperforms recent models on benchmark datasets. Furthermore, the results suggest that our method captures the importance of various information features to focus on informative contextual words.

* 6 Pages, 2 Figures

Via

Access Paper or Ask Questions