Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yilong Zhao

LEANN: A Low-Storage Vector Index

Jun 09, 2025

Yichuan Wang, Shu Liu, Zhifei Li, Yongji Wu, Ziming Mao, Yilong Zhao, Xiao Yan, Zhiying Xu, Yang Zhou, Ion Stoica(+3 more)

Abstract:Embedding-based search is widely used in applications such as recommendation and retrieval-augmented generation (RAG). Recently, there is a growing demand to support these capabilities over personal data stored locally on devices. However, maintaining the necessary data structure associated with the embedding-based search is often infeasible due to its high storage overhead. For example, indexing 100 GB of raw data requires 150 to 700 GB of storage, making local deployment impractical. Reducing this overhead while maintaining search quality and latency becomes a critical challenge. In this paper, we present LEANN, a storage-efficient approximate nearest neighbor (ANN) search index optimized for resource-constrained personal devices. LEANN combines a compact graph-based structure with an efficient on-the-fly recomputation strategy to enable fast and accurate retrieval with minimal storage overhead. Our evaluation shows that LEANN reduces index size to under 5% of the original raw data, achieving up to 50 times smaller storage than standard indexes, while maintaining 90% top-3 recall in under 2 seconds on real-world question answering benchmarks.

Via

Access Paper or Ask Questions

Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

May 24, 2025

Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng(+4 more)

Abstract:Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates top-p dynamic budget control and customized kernel implementations, achieving up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively.

Via

Access Paper or Ask Questions

Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

Feb 03, 2025

Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li(+4 more)

Figure 1 for Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

Figure 2 for Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

Figure 3 for Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

Figure 4 for Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

Abstract:Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D Full Attention with respect to the context length. In this paper, we propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency. We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an online profiling strategy to capture the dynamic sparse patterns and predicts the type of attention head. Combined with a novel hardware-efficient tensor layout transformation and customized kernel implementations, SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality.

* 13 pages, 8 figures, 3 tables

Via

Access Paper or Ask Questions

BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

Nov 25, 2024

Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, Ion Stoica

Figure 1 for BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

Figure 2 for BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

Figure 3 for BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

Figure 4 for BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

Abstract:Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and modality makes requests more diverse in compute and memory demands, creating unique opportunities for throughput improvement by resource overlapping. However, a request schedule that maximizes resource overlapping can conflict with the schedule that maximizes prefix sharing, a widely-used performance optimization, causing sub-optimal inference throughput. We present BlendServe, a system that maximizes resource utilization of offline batch inference by combining the benefits of resource overlapping and prefix sharing using a resource-aware prefix tree. BlendServe exploits the relaxed latency requirements in offline batch inference to reorder and overlap requests with varied resource demands while ensuring high prefix sharing. We evaluate BlendServe on a variety of synthetic multi-modal workloads and show that it provides up to $1.44\times$ throughput boost compared to widely-used industry standards, vLLM and SGLang.

Via

Access Paper or Ask Questions

XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Nov 22, 2024

Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, Tianqi Chen

Figure 1 for XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Figure 2 for XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Figure 3 for XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Figure 4 for XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Abstract:The applications of LLM Agents are becoming increasingly complex and diverse, leading to a high demand for structured outputs that can be parsed into code, structured function calls, and embodied agent commands. These developments bring significant demands for structured generation in LLM inference. Context-free grammar is a flexible approach to enable structured generation via constrained decoding. However, executing context-free grammar requires going through several stack states over all tokens in vocabulary during runtime, bringing non-negligible overhead for structured generation. In this paper, we propose XGrammar, a flexible and efficient structure generation engine for large language models. XGrammar accelerates context-free grammar execution by dividing the vocabulary into context-independent tokens that can be prechecked and context-dependent tokens that need to be interpreted during runtime. We further build transformations to expand the grammar context and reduce the number of context-independent tokens. Additionally, we build an efficient persistent stack to accelerate the context-dependent token checks. Finally, we co-design the grammar engine with LLM inference engine to overlap grammar computation with GPU executions. Evaluation results show that XGrammar can achieve up to 100x speedup over existing solutions. Combined with an LLM inference engine, it can generate near-zero overhead structure generation in end-to-end low-LLM serving.

Via

Access Paper or Ask Questions

A Large Language Model-based Framework for Semi-Structured Tender Document Retrieval-Augmented Generation

Oct 04, 2024

Yilong Zhao, Daifeng Li

Figure 1 for A Large Language Model-based Framework for Semi-Structured Tender Document Retrieval-Augmented Generation

Figure 2 for A Large Language Model-based Framework for Semi-Structured Tender Document Retrieval-Augmented Generation

Figure 3 for A Large Language Model-based Framework for Semi-Structured Tender Document Retrieval-Augmented Generation

Figure 4 for A Large Language Model-based Framework for Semi-Structured Tender Document Retrieval-Augmented Generation

Abstract:The drafting of documents in the procurement field has progressively become more complex and diverse, driven by the need to meet legal requirements, adapt to technological advancements, and address stakeholder demands. While large language models (LLMs) show potential in document generation, most LLMs lack specialized knowledge in procurement. To address this gap, we use retrieval-augmented techniques to achieve professional document generation, ensuring accuracy and relevance in procurement documentation.

Via

Access Paper or Ask Questions

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Jun 16, 2024

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han

Figure 1 for Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Figure 2 for Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Figure 3 for Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Figure 4 for Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Abstract:As the demand for long-context large language models (LLMs) increases, models with context windows of up to 128K or 1M tokens are becoming increasingly prevalent. However, long-context LLM inference is challenging since the inference speed decreases significantly as the sequence length grows. This slowdown is primarily caused by loading a large KV cache during self-attention. Previous works have shown that a small portion of critical tokens will dominate the attention outcomes. However, we observe the criticality of a token highly depends on the query. To this end, we propose Quest, a query-aware KV cache selection algorithm. Quest keeps track of the minimal and maximal Key values in KV cache pages and estimates the criticality of a given page using Query vectors. By only loading the Top-K critical KV cache pages for attention, Quest significantly speeds up self-attention without sacrificing accuracy. We show that Quest can achieve up to 2.23x self-attention speedup, which reduces inference latency by 7.03x while performing well on tasks with long dependencies with negligible accuracy loss. Code is available at http://github.com/mit-han-lab/Quest .

* ICML 2024

Via

Access Paper or Ask Questions

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Nov 07, 2023

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci

Figure 1 for Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Figure 2 for Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Figure 3 for Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Figure 4 for Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Abstract:The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity. However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage the capabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance. To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by using low-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracy by applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization setups in the serving context. Atom improves end-to-end throughput by up to $7.73\times$ compared to the FP16 and by $2.53\times$ compared to INT8 quantization, while maintaining the same latency target.

Via

Access Paper or Ask Questions

Neural-PIM: Efficient Processing-In-Memory with Neural Approximation of Peripherals

Jan 30, 2022

Weidong Cao, Yilong Zhao, Adith Boloor, Yinhe Han, Xuan Zhang, Li Jiang

Figure 1 for Neural-PIM: Efficient Processing-In-Memory with Neural Approximation of Peripherals

Figure 2 for Neural-PIM: Efficient Processing-In-Memory with Neural Approximation of Peripherals

Figure 3 for Neural-PIM: Efficient Processing-In-Memory with Neural Approximation of Peripherals

Figure 4 for Neural-PIM: Efficient Processing-In-Memory with Neural Approximation of Peripherals

Abstract:Processing-in-memory (PIM) architectures have demonstrated great potential in accelerating numerous deep learning tasks. Particularly, resistive random-access memory (RRAM) devices provide a promising hardware substrate to build PIM accelerators due to their abilities to realize efficient in-situ vector-matrix multiplications (VMMs). However, existing PIM accelerators suffer from frequent and energy-intensive analog-to-digital (A/D) conversions, severely limiting their performance. This paper presents a new PIM architecture to efficiently accelerate deep learning tasks by minimizing the required A/D conversions with analog accumulation and neural approximated peripheral circuits. We first characterize the different dataflows employed by existing PIM accelerators, based on which a new dataflow is proposed to remarkably reduce the required A/D conversions for VMMs by extending shift and add (S+A) operations into the analog domain before the final quantizations. We then leverage a neural approximation method to design both analog accumulation circuits (S+A) and quantization circuits (ADCs) with RRAM crossbar arrays in a highly-efficient manner. Finally, we apply them to build an RRAM-based PIM accelerator (i.e., \textbf{Neural-PIM}) upon the proposed analog dataflow and evaluate its system-level performance. Evaluations on different benchmarks demonstrate that Neural-PIM can improve energy efficiency by 5.36x (1.73x) and speed up throughput by 3.43x (1.59x) without losing accuracy, compared to the state-of-the-art RRAM-based PIM accelerators, i.e., ISAAC (CASCADE).

* IEEE Transactions on Computers, 2021
* 14 pages, 13 figures, Published in IEEE Transactions on Computers

Via

Access Paper or Ask Questions

SME: ReRAM-based Sparse-Multiplication-Engine to Squeeze-Out Bit Sparsity of Neural Network

Mar 02, 2021

Fangxin Liu, Wenbo Zhao, Yilong Zhao, Zongwu Wang, Tao Yang, Zhezhi He, Naifeng Jing, Xiaoyao Liang, Li Jiang

Figure 1 for SME: ReRAM-based Sparse-Multiplication-Engine to Squeeze-Out Bit Sparsity of Neural Network

Figure 2 for SME: ReRAM-based Sparse-Multiplication-Engine to Squeeze-Out Bit Sparsity of Neural Network

Figure 3 for SME: ReRAM-based Sparse-Multiplication-Engine to Squeeze-Out Bit Sparsity of Neural Network

Figure 4 for SME: ReRAM-based Sparse-Multiplication-Engine to Squeeze-Out Bit Sparsity of Neural Network

Abstract:Resistive Random-Access-Memory (ReRAM) crossbar is a promising technique for deep neural network (DNN) accelerators, thanks to its in-memory and in-situ analog computing abilities for Vector-Matrix Multiplication-and-Accumulations (VMMs). However, it is challenging for crossbar architecture to exploit the sparsity in the DNN. It inevitably causes complex and costly control to exploit fine-grained sparsity due to the limitation of tightly-coupled crossbar structure. As the countermeasure, we developed a novel ReRAM-based DNN accelerator, named Sparse-Multiplication-Engine (SME), based on a hardware and software co-design framework. First, we orchestrate the bit-sparse pattern to increase the density of bit-sparsity based on existing quantization methods. Second, we propose a novel weigh mapping mechanism to slice the bits of a weight across the crossbars and splice the activation results in peripheral circuits. This mechanism can decouple the tightly-coupled crossbar structure and cumulate the sparsity in the crossbar. Finally, a superior squeeze-out scheme empties the crossbars mapped with highly-sparse non-zeros from the previous two steps. We design the SME architecture and discuss its use for other quantization methods and different ReRAM cell technologies. Compared with prior state-of-the-art designs, the SME shrinks the use of crossbars up to 8.7x and 2.1x using Resent-50 and MobileNet-v2, respectively, with less than 0.3% accuracy drop on ImageNet.

* 7 pages, 12 figures

Via

Access Paper or Ask Questions