Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yaxing Cai

XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Nov 22, 2024

Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, Tianqi Chen

Figure 1 for XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Figure 2 for XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Figure 3 for XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Figure 4 for XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Abstract:The applications of LLM Agents are becoming increasingly complex and diverse, leading to a high demand for structured outputs that can be parsed into code, structured function calls, and embodied agent commands. These developments bring significant demands for structured generation in LLM inference. Context-free grammar is a flexible approach to enable structured generation via constrained decoding. However, executing context-free grammar requires going through several stack states over all tokens in vocabulary during runtime, bringing non-negligible overhead for structured generation. In this paper, we propose XGrammar, a flexible and efficient structure generation engine for large language models. XGrammar accelerates context-free grammar execution by dividing the vocabulary into context-independent tokens that can be prechecked and context-dependent tokens that need to be interpreted during runtime. We further build transformations to expand the grammar context and reduce the number of context-independent tokens. Additionally, we build an efficient persistent stack to accelerate the context-dependent token checks. Finally, we co-design the grammar engine with LLM inference engine to overlap grammar computation with GPU executions. Evaluation results show that XGrammar can achieve up to 100x speedup over existing solutions. Combined with an LLM inference engine, it can generate near-zero overhead structure generation in end-to-end low-LLM serving.

Via

Access Paper or Ask Questions

Relax: Composable Abstractions for End-to-End Dynamic Machine Learning

Nov 01, 2023

Ruihang Lai, Junru Shao, Siyuan Feng, Steven S. Lyubomirsky, Bohan Hou, Wuwei Lin, Zihao Ye, Hongyi Jin, Yuchen Jin, Jiawei Liu(+9 more)

Figure 1 for Relax: Composable Abstractions for End-to-End Dynamic Machine Learning

Figure 2 for Relax: Composable Abstractions for End-to-End Dynamic Machine Learning

Figure 3 for Relax: Composable Abstractions for End-to-End Dynamic Machine Learning

Figure 4 for Relax: Composable Abstractions for End-to-End Dynamic Machine Learning

Abstract:Dynamic shape computations have become critical in modern machine learning workloads, especially in emerging large language models. The success of these models has driven demand for deploying them to a diverse set of backend environments. In this paper, we present Relax, a compiler abstraction for optimizing end-to-end dynamic machine learning workloads. Relax introduces first-class symbolic shape annotations to track dynamic shape computations globally across the program. It also introduces a cross-level abstraction that encapsulates computational graphs, loop-level tensor programs, and library calls in a single representation to enable cross-level optimizations. We build an end-to-end compilation framework using the proposed approach to optimize dynamic shape models. Experimental results on large language models show that Relax delivers performance competitive with state-of-the-art hand-optimized systems across platforms and enables deployment of emerging dynamic models to a broader set of environments, including mobile phones, embedded devices, and web browsers.

Via

Access Paper or Ask Questions

CVLight: Deep Reinforcement Learning for Adaptive Traffic Signal Control with Connected Vehicles

Apr 21, 2021

Wangzhi Li, Yaxing Cai, Ujwal Dinesha, Yongjie Fu, Xuan Di

Figure 1 for CVLight: Deep Reinforcement Learning for Adaptive Traffic Signal Control with Connected Vehicles

Figure 2 for CVLight: Deep Reinforcement Learning for Adaptive Traffic Signal Control with Connected Vehicles

Figure 3 for CVLight: Deep Reinforcement Learning for Adaptive Traffic Signal Control with Connected Vehicles

Figure 4 for CVLight: Deep Reinforcement Learning for Adaptive Traffic Signal Control with Connected Vehicles

Abstract:This paper develops a reinforcement learning (RL) scheme for adaptive traffic signal control (ATSC), called "CVLight", that leverages data collected only from connected vehicles (CV). Seven types of RL models are proposed within this scheme that contain various state and reward representations, including incorporation of CV delay and green light duration into state and the usage of CV delay as reward. To further incorporate information of both CV and non-CV into CVLight, an algorithm based on actor-critic, A2C-Full, is proposed where both CV and non-CV information is used to train the critic network, while only CV information is used to update the policy network and execute optimal signal timing. These models are compared at an isolated intersection under various CV market penetration rates. A full model with the best performance (i.e., minimum average travel delay per vehicle) is then selected and applied to compare with state-of-the-art benchmarks under different levels of traffic demands, turning proportions, and dynamic traffic demands, respectively. Two case studies are performed on an isolated intersection and a corridor with three consecutive intersections located in Manhattan, New York, to further demonstrate the effectiveness of the proposed algorithm under real-world scenarios. Compared to other baseline models that use all vehicle information, the trained CVLight agent can efficiently control multiple intersections solely based on CV data and can achieve a similar or even greater performance when the CV penetration rate is no less than 20%.

* 27 pages, 13 figures

Via

Access Paper or Ask Questions