Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zirui Li

DocMMIR: A Framework for Document Multi-modal Information Retrieval

May 25, 2025

Zirui Li, Siwei Wu, Xingyu Wang, Yi Zhou, Yizhi Li, Chenghua Lin

Abstract:The rapid advancement of unsupervised representation learning and large-scale pre-trained vision-language models has significantly improved cross-modal retrieval tasks. However, existing multi-modal information retrieval (MMIR) studies lack a comprehensive exploration of document-level retrieval and suffer from the absence of cross-domain datasets at this granularity. To address this limitation, we introduce DocMMIR, a novel multi-modal document retrieval framework designed explicitly to unify diverse document formats and domains, including Wikipedia articles, scientific papers (arXiv), and presentation slides, within a comprehensive retrieval scenario. We construct a large-scale cross-domain multimodal benchmark, comprising 450K samples, which systematically integrates textual and visual information. Our comprehensive experimental analysis reveals substantial limitations in current state-of-the-art MLLMs (CLIP, BLIP2, SigLIP-2, ALIGN) when applied to our tasks, with only CLIP demonstrating reasonable zero-shot performance. Furthermore, we conduct a systematic investigation of training strategies, including cross-modal fusion methods and loss functions, and develop a tailored approach to train CLIP on our benchmark. This results in a +31% improvement in MRR@10 compared to the zero-shot baseline. All our data and code are released in https://github.com/J1mL1/DocMMIR.

* Comments: 13 pages, 7 figures. Code and data publicly available at https://github.com/J1mL1/DocMMIR

Via

Access Paper or Ask Questions

InterHub: A Naturalistic Trajectory Dataset with Dense Interaction for Autonomous Driving

Nov 30, 2024

Xiyan Jiang, Xiaocong Zhao, Yiru Liu, Zirui Li, Peng Hang, Lu Xiong, Jian Sun

Figure 1 for InterHub: A Naturalistic Trajectory Dataset with Dense Interaction for Autonomous Driving

Figure 2 for InterHub: A Naturalistic Trajectory Dataset with Dense Interaction for Autonomous Driving

Figure 3 for InterHub: A Naturalistic Trajectory Dataset with Dense Interaction for Autonomous Driving

Figure 4 for InterHub: A Naturalistic Trajectory Dataset with Dense Interaction for Autonomous Driving

Abstract:The driving interaction-a critical yet complex aspect of daily driving-lies at the core of autonomous driving research. However, real-world driving scenarios sparsely capture rich interaction events, limiting the availability of comprehensive trajectory datasets for this purpose. To address this challenge, we present InterHub, a dense interaction dataset derived by mining interaction events from extensive naturalistic driving records. We employ formal methods to describe and extract multi-agent interaction events, exposing the limitations of existing autonomous driving solutions. Additionally, we introduce a user-friendly toolkit enabling the expansion of InterHub with both public and private data. By unifying, categorizing, and analyzing diverse interaction events, InterHub facilitates cross-comparative studies and large-scale research, thereby advancing the evaluation and development of autonomous driving technologies.

Via

Access Paper or Ask Questions

ArrivalNet: Predicting City-wide Bus/Tram Arrival Time with Two-dimensional Temporal Variation Modeling

Oct 17, 2024

Zirui Li, Patrick Wolf, Meng Wang

Figure 1 for ArrivalNet: Predicting City-wide Bus/Tram Arrival Time with Two-dimensional Temporal Variation Modeling

Figure 2 for ArrivalNet: Predicting City-wide Bus/Tram Arrival Time with Two-dimensional Temporal Variation Modeling

Figure 3 for ArrivalNet: Predicting City-wide Bus/Tram Arrival Time with Two-dimensional Temporal Variation Modeling

Figure 4 for ArrivalNet: Predicting City-wide Bus/Tram Arrival Time with Two-dimensional Temporal Variation Modeling

Abstract:Accurate arrival time prediction (ATP) of buses and trams plays a crucial role in public transport operations. Current methods focused on modeling one-dimensional temporal information but overlooked the latent periodic information within time series. Moreover, most studies developed algorithms for ATP based on a single or a few routes of public transport, which reduces the transferability of the prediction models and their applicability in public transport management systems. To this end, this paper proposes \textit{ArrivalNet}, a two-dimensional temporal variation-based multi-step ATP for buses and trams. It decomposes the one-dimensional temporal sequence into intra-periodic and inter-periodic variations, which can be recast into two-dimensional tensors (2D blocks). Each row of a tensor contains the time points within a period, and each column involves the time points at the same intra-periodic index across various periods. The transformed 2D blocks in different frequencies have an image-like feature representation that enables effective learning with computer vision backbones (e.g., convolutional neural network). Drawing on the concept of residual neural network, the 2D block module is designed as a basic module for flexible aggregation. Meanwhile, contextual factors like workdays, peak hours, and intersections, are also utilized in the augmented feature representation to improve the performance of prediction. 125 days of public transport data from Dresden were collected for model training and validation. Experimental results show that the root mean square error, mean absolute error, and mean absolute percentage error of the proposed predictor decrease by at least 6.1\%, 14.7\%, and 34.2\% compared with state-of-the-art baseline methods.

* Under review at IEEE T-ITS

Via

Access Paper or Ask Questions

Updating Robot Safety Representations Online from Natural Language Feedback

Sep 22, 2024

Leonardo Santos, Zirui Li, Lasse Peters, Somil Bansal, Andrea Bajcsy

Figure 1 for Updating Robot Safety Representations Online from Natural Language Feedback

Figure 2 for Updating Robot Safety Representations Online from Natural Language Feedback

Figure 3 for Updating Robot Safety Representations Online from Natural Language Feedback

Figure 4 for Updating Robot Safety Representations Online from Natural Language Feedback

Abstract:Robots must operate safely when deployed in novel and human-centered environments, like homes. Current safe control approaches typically assume that the safety constraints are known a priori, and thus, the robot can pre-compute a corresponding safety controller. While this may make sense for some safety constraints (e.g., avoiding collision with walls by analyzing a floor plan), other constraints are more complex (e.g., spills), inherently personal, context-dependent, and can only be identified at deployment time when the robot is interacting in a specific environment and with a specific person (e.g., fragile objects, expensive rugs). Here, language provides a flexible mechanism to communicate these evolving safety constraints to the robot. In this work, we use vision language models (VLMs) to interpret language feedback and the robot's image observations to continuously update the robot's representation of safety constraints. With these inferred constraints, we update a Hamilton-Jacobi reachability safety controller online via efficient warm-starting techniques. Through simulation and hardware experiments, we demonstrate the robot's ability to infer and respect language-based safety constraints with the proposed approach.

* Submitted to ICRA 2025

Via

Access Paper or Ask Questions

A Nested Graph Reinforcement Learning-based Decision-making Strategy for Eco-platooning

Aug 14, 2024

Xin Gao, Xueyuan Li, Hao Liu, Ao Li, Zhaoyang Ma, Zirui Li

Abstract:Platooning technology is renowned for its precise vehicle control, traffic flow optimization, and energy efficiency enhancement. However, in large-scale mixed platoons, vehicle heterogeneity and unpredictable traffic conditions lead to virtual bottlenecks. These bottlenecks result in reduced traffic throughput and increased energy consumption within the platoon. To address these challenges, we introduce a decision-making strategy based on nested graph reinforcement learning. This strategy improves collaborative decision-making, ensuring energy efficiency and alleviating congestion. We propose a theory of nested traffic graph representation that maps dynamic interactions between vehicles and platoons in non-Euclidean spaces. By incorporating spatio-temporal weighted graph into a multi-head attention mechanism, we further enhance the model's capacity to process both local and global data. Additionally, we have developed a nested graph reinforcement learning framework to enhance the self-iterative learning capabilities of platooning. Using the I-24 dataset, we designed and conducted comparative algorithm experiments, generalizability testing, and permeability ablation experiments, thereby validating the proposed strategy's effectiveness. Compared to the baseline, our strategy increases throughput by 10% and decreases energy use by 9%. Specifically, increasing the penetration rate of CAVs significantly enhances traffic throughput, though it also increases energy consumption.

* 14 pages, 18 figures

Via

Access Paper or Ask Questions

QuantumSEA: In-Time Sparse Exploration for Noise Adaptive Quantum Circuits

Jan 10, 2024

Tianlong Chen, Zhenyu Zhang, Hanrui Wang, Jiaqi Gu, Zirui Li, David Z. Pan, Frederic T. Chong, Song Han, Zhangyang Wang

Abstract:Parameterized Quantum Circuits (PQC) have obtained increasing popularity thanks to their great potential for near-term Noisy Intermediate-Scale Quantum (NISQ) computers. Achieving quantum advantages usually requires a large number of qubits and quantum circuits with enough capacity. However, limited coherence time and massive quantum noises severely constrain the size of quantum circuits that can be executed reliably on real machines. To address these two pain points, we propose QuantumSEA, an in-time sparse exploration for noise-adaptive quantum circuits, aiming to achieve two key objectives: (1) implicit circuits capacity during training - by dynamically exploring the circuit's sparse connectivity and sticking a fixed small number of quantum gates throughout the training which satisfies the coherence time and enjoy light noises, enabling feasible executions on real quantum devices; (2) noise robustness - by jointly optimizing the topology and parameters of quantum circuits under real device noise models. In each update step of sparsity, we leverage the moving average of historical gradients to grow necessary gates and utilize salience-based pruning to eliminate insignificant gates. Extensive experiments are conducted with 7 Quantum Machine Learning (QML) and Variational Quantum Eigensolver (VQE) benchmarks on 6 simulated or real quantum computers, where QuantumSEA consistently surpasses noise-aware search, human-designed, and randomly generated quantum circuit baselines by a clear performance margin. For example, even in the most challenging on-chip training regime, our method establishes state-of-the-art results with only half the number of quantum gates and ~2x time saving of circuit executions. Codes are available at https://github.com/VITA-Group/QuantumSEA.

* IEEE International Conference on Quantum Computing and Engineering (QCE 2023)

Via

Access Paper or Ask Questions

Spatiotemporal-Linear: Towards Universal Multivariate Time Series Forecasting

Dec 22, 2023

Aiyinsi Zuo, Haixi Zhang, Zirui Li, Ce Zheng

Abstract:Within the field of complicated multivariate time series forecasting (TSF), popular techniques frequently rely on intricate deep learning architectures, ranging from transformer-based designs to recurrent neural networks. However, recent findings suggest that simple Linear models can surpass sophisticated constructs on diverse datasets. These models directly map observation to multiple future time steps, thereby minimizing error accumulation in iterative multi-step prediction. Yet, these models fail to incorporate spatial and temporal information within the data, which is critical for capturing patterns and dependencies that drive insightful predictions. This oversight often leads to performance bottlenecks, especially under specific sequence lengths and dataset conditions, preventing their universal application. In response, we introduce the SpatioTemporal-Linear (STL) framework. STL seamlessly integrates time-embedded and spatially-informed bypasses to augment the Linear-based architecture. These extra routes offer a more robust and refined regression to the data, particularly when the amount of observation is limited and the capacity of simple linear layers to capture dependencies declines. Empirical evidence highlights STL's prowess, outpacing both Linear and Transformer benchmarks across varied observation and prediction durations and datasets. Such robustness accentuates its suitability across a spectrum of applications, including but not limited to, traffic trajectory and rare disease progression forecasting. Through this discourse, we not only validate the STL's distinctive capacities to become a more general paradigm in multivariate time-series prediction using deep-learning techniques but also stress the need to tackle data-scarce prediction scenarios for universal application. Code will be made available.

Via

Access Paper or Ask Questions

RobustState: Boosting Fidelity of Quantum State Preparation via Noise-Aware Variational Training

Nov 27, 2023

Hanrui Wang, Yilian Liu, Pengyu Liu, Jiaqi Gu, Zirui Li, Zhiding Liang, Jinglei Cheng, Yongshan Ding, Xuehai Qian, Yiyu Shi(+3 more)

Figure 1 for RobustState: Boosting Fidelity of Quantum State Preparation via Noise-Aware Variational Training

Figure 2 for RobustState: Boosting Fidelity of Quantum State Preparation via Noise-Aware Variational Training

Figure 3 for RobustState: Boosting Fidelity of Quantum State Preparation via Noise-Aware Variational Training

Figure 4 for RobustState: Boosting Fidelity of Quantum State Preparation via Noise-Aware Variational Training

Abstract:Quantum state preparation, a crucial subroutine in quantum computing, involves generating a target quantum state from initialized qubits. Arbitrary state preparation algorithms can be broadly categorized into arithmetic decomposition (AD) and variational quantum state preparation (VQSP). AD employs a predefined procedure to decompose the target state into a series of gates, whereas VQSP iteratively tunes ansatz parameters to approximate target state. VQSP is particularly apt for Noisy-Intermediate Scale Quantum (NISQ) machines due to its shorter circuits. However, achieving noise-robust parameter optimization still remains challenging. We present RobustState, a novel VQSP training methodology that combines high robustness with high training efficiency. The core idea involves utilizing measurement outcomes from real machines to perform back-propagation through classical simulators, thus incorporating real quantum noise into gradient calculations. RobustState serves as a versatile, plug-and-play technique applicable for training parameters from scratch or fine-tuning existing parameters to enhance fidelity on target machines. It is adaptable to various ansatzes at both gate and pulse levels and can even benefit other variational algorithms, such as variational unitary synthesis. Comprehensive evaluation of RobustState on state preparation tasks for 4 distinct quantum algorithms using 10 real quantum machines demonstrates a coherent error reduction of up to 7.1 $\times$ and state fidelity improvement of up to 96\% and 81\% for 4-Q and 5-Q states, respectively. On average, RobustState improves fidelity by 50\% and 72\% for 4-Q and 5-Q states compared to baseline approaches.

* Accepted to FASTML @ ICCAD 2023. 14 pages, 20 figures

Via

Access Paper or Ask Questions

Reachability-Based Confidence-Aware Probabilistic Collision Detection in Highway Driving

Feb 14, 2023

Xinwei Wang, Zirui Li, Javier Alonso-Mora, Meng Wang

Abstract:Risk assessment is a crucial component of collision warning and avoidance systems in intelligent vehicles. To accurately detect potential vehicle collisions, reachability-based formal approaches have been developed to ensure driving safety, but suffer from over-conservatism, potentially leading to false-positive risk events in complicated real-world applications. In this work, we combine two reachability analysis techniques, i.e., backward reachable set (BRS) and stochastic forward reachable set (FRS), and propose an integrated probabilistic collision detection framework in highway driving. Within the framework, we can firstly use a BRS to formally check whether a two-vehicle interaction is safe; otherwise, a prediction-based stochastic FRS is employed to estimate a collision probability at each future time step. In doing so, the framework can not only identify non-risky events with guaranteed safety, but also provide accurate collision risk estimation in safety-critical events. To construct the stochastic FRS, we develop a neural network-based acceleration model for surrounding vehicles, and further incorporate confidence-aware dynamic belief to improve the prediction accuracy. Extensive experiments are conducted to validate the performance of the acceleration prediction model based on naturalistic highway driving data, and the efficiency and effectiveness of the framework with the infused confidence belief are tested both in naturalistic and simulated highway scenarios. The proposed risk assessment framework is promising in real-world applications.

* Under review at Engineering. arXiv admin note: text overlap with arXiv:2205.01357

Via

Access Paper or Ask Questions

Continual Interactive Behavior Learning With Traffic Divergence Measurement: A Dynamic Gradient Scenario Memory Approach

Dec 21, 2022

Yunlong Lin, Zirui Li, Cheng Gong, Chao Lu, Xinwei Wang, Jianwei Gong

Figure 1 for Continual Interactive Behavior Learning With Traffic Divergence Measurement: A Dynamic Gradient Scenario Memory Approach

Figure 2 for Continual Interactive Behavior Learning With Traffic Divergence Measurement: A Dynamic Gradient Scenario Memory Approach

Figure 3 for Continual Interactive Behavior Learning With Traffic Divergence Measurement: A Dynamic Gradient Scenario Memory Approach

Figure 4 for Continual Interactive Behavior Learning With Traffic Divergence Measurement: A Dynamic Gradient Scenario Memory Approach

Abstract:Developing autonomous vehicles (AVs) helps improve the road safety and traffic efficiency of intelligent transportation systems (ITS). Accurately predicting the trajectories of traffic participants is essential to the decision-making and motion planning of AVs in interactive scenarios. Recently, learning-based trajectory predictors have shown state-of-the-art performance in highway or urban areas. However, most existing learning-based models trained with fixed datasets may perform poorly in continuously changing scenarios. Specifically, they may not perform well in learned scenarios after learning the new one. This phenomenon is called "catastrophic forgetting". Few studies investigate trajectory predictions in continuous scenarios, where catastrophic forgetting may happen. To handle this problem, first, a novel continual learning (CL) approach for vehicle trajectory prediction is proposed in this paper. Then, inspired by brain science, a dynamic memory mechanism is developed by utilizing the measurement of traffic divergence between scenarios, which balances the performance and training efficiency of the proposed CL approach. Finally, datasets collected from different locations are used to design continual training and testing methods in experiments. Experimental results show that the proposed approach achieves consistently high prediction accuracy in continuous scenarios without re-training, which mitigates catastrophic forgetting compared to non-CL approaches. The implementation of the proposed approach is publicly available at https://github.com/BIT-Jack/D-GSM

Via

Access Paper or Ask Questions