Abstract:Retrieval-augmented generation (RAG) improves large language model (LLM) answer quality by grounding generation in external evidence, but processing retrieved contexts makes the prefill stage a dominant serving cost. RAG cache fusion reduces this cost by reusing precomputed key-value (KV) caches for retrieved chunks and selectively recomputing tokens under the current prompt. Existing selectors, however, face a dilemma between quality and efficiency: fast query-agnostic or final-layer query-to-context selectors can miss request-relevant evidence, whereas full-view query-aware selectors require broad context and layer visibility before recomputation and therefore stall the layer-wise cache-fusion pipeline. We present QCFuse, a compressed-view query-aware selector for RAG cache fusion. QCFuse uses chunk-anchor query probing to condition user-query states on compact per-chunk anchors and critical-layer profiling to identify recomputation tokens without all-layer inspection. We implement QCFuse in SGLang and evaluate it on four open-weight LLMs across six datasets. QCFuse reaches full-prefill-level quality. At matched quality, QCFuse achieves an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, the strongest quality-preserving baseline.
![Figure 1 for GridTuner: Reinvestigate Grid Size Selection for Spatiotemporal Prediction Models [Technical Report]](/_next/image?url=https%3A%2F%2Fai2-s2-public.s3.amazonaws.com%2Ffigures%2F2017-08-08%2F96c4ddea729a54368beb3939f00ad8b67893edc7%2F1-Figure1-1.png&w=640&q=75)
![Figure 2 for GridTuner: Reinvestigate Grid Size Selection for Spatiotemporal Prediction Models [Technical Report]](/_next/image?url=https%3A%2F%2Fai2-s2-public.s3.amazonaws.com%2Ffigures%2F2017-08-08%2F96c4ddea729a54368beb3939f00ad8b67893edc7%2F14-Figure10-1.png&w=640&q=75)
![Figure 3 for GridTuner: Reinvestigate Grid Size Selection for Spatiotemporal Prediction Models [Technical Report]](/_next/image?url=https%3A%2F%2Fai2-s2-public.s3.amazonaws.com%2Ffigures%2F2017-08-08%2F96c4ddea729a54368beb3939f00ad8b67893edc7%2F14-Figure11-1.png&w=640&q=75)
![Figure 4 for GridTuner: Reinvestigate Grid Size Selection for Spatiotemporal Prediction Models [Technical Report]](/_next/image?url=https%3A%2F%2Fai2-s2-public.s3.amazonaws.com%2Ffigures%2F2017-08-08%2F96c4ddea729a54368beb3939f00ad8b67893edc7%2F14-Figure12-1.png&w=640&q=75)
Abstract:With the development of traffic prediction technology, spatiotemporal prediction models have attracted more and more attention from academia communities and industry. However, most existing researches focus on reducing model's prediction error but ignore the error caused by the uneven distribution of spatial events within a region. In this paper, we study a region partitioning problem, namely optimal grid size selection problem (OGSS), which aims to minimize the real error of spatiotemporal prediction models by selecting the optimal grid size. In order to solve OGSS, we analyze the upper bound of real error of spatiotemporal prediction models and minimize the real error by minimizing its upper bound. Through in-depth analysis, we find that the upper bound of real error will decrease then increase when the number of model grids increase from 1 to the maximum allowed value. Then, we propose two algorithms, namely Ternary Search and Iterative Method, to automatically find the optimal grid size. Finally, the experiments verify that the error of prediction has the same trend as its upper bound, and the change trend of the upper bound of real error with respect to the increase of the number of model grids will decrease then increase. Meanwhile, in a case study, by selecting the optimal grid size, the order dispatching results of a state-of-the-art prediction-based algorithm can be improved up to 13.6%, which shows the effectiveness of our methods on tuning the region partition for spatiotemporal prediction models.
![Figure 1 for A Queueing-Theoretic Framework for Vehicle Dispatching in Dynamic Car-Hailing [technical report]](/_next/image?url=https%3A%2F%2Fai2-s2-public.s3.amazonaws.com%2Ffigures%2F2017-08-08%2Ffc232beef2f64fd18a8e781332d3e64520fe1361%2F1-Figure1-1.png&w=640&q=75)
![Figure 2 for A Queueing-Theoretic Framework for Vehicle Dispatching in Dynamic Car-Hailing [technical report]](/_next/image?url=https%3A%2F%2Fai2-s2-public.s3.amazonaws.com%2Ffigures%2F2017-08-08%2Ffc232beef2f64fd18a8e781332d3e64520fe1361%2F3-Table1-1.png&w=640&q=75)
![Figure 3 for A Queueing-Theoretic Framework for Vehicle Dispatching in Dynamic Car-Hailing [technical report]](/_next/image?url=https%3A%2F%2Fai2-s2-public.s3.amazonaws.com%2Ffigures%2F2017-08-08%2Ffc232beef2f64fd18a8e781332d3e64520fe1361%2F4-Figure2-1.png&w=640&q=75)
![Figure 4 for A Queueing-Theoretic Framework for Vehicle Dispatching in Dynamic Car-Hailing [technical report]](/_next/image?url=https%3A%2F%2Fai2-s2-public.s3.amazonaws.com%2Ffigures%2F2017-08-08%2Ffc232beef2f64fd18a8e781332d3e64520fe1361%2F10-Table2-1.png&w=640&q=75)
Abstract:With the rapid development of smart mobile devices, the car-hailing platforms (e.g., Uber or Lyft) have attracted much attention from both the academia and the industry. In this paper, we consider an important dynamic car-hailing problem, namely \textit{maximum revenue vehicle dispatching} (MRVD), in which rider requests dynamically arrive and drivers need to serve as many riders as possible such that the entire revenue of the platform is maximized. We prove that the MRVD problem is NP-hard and intractable. In addition, the dynamic car-hailing platforms have no information of the future riders, which makes the problem even harder. To handle the MRVD problem, we propose a queueing-based vehicle dispatching framework, which first uses existing machine learning algorithms to predict the future vehicle demand of each region, then estimates the idle time periods of drivers through a queueing model for each region. With the information of the predicted vehicle demands and estimated idle time periods of drivers, we propose two batch-based vehicle dispatching algorithms to efficiently assign suitable drivers to riders such that the expected overall revenue of the platform is maximized during each batch processing. Through extensive experiments, we demonstrate the efficiency and effectiveness of our proposed approaches over both real and synthetic datasets.