Abstract:Adaptive multimodal reasoning has emerged as a promising frontier in Vision-Language Models (VLMs), aiming to dynamically modulate between tool-augmented visual reasoning and text reasoning to enhance both effectiveness and efficiency. However, existing evaluations rely on static difficulty labels and simplistic metrics, which fail to capture the dynamic nature of difficulty relative to varying model capacities. Consequently, they obscure the distinction between adaptive mode selection and general performance while neglecting fine-grained process analyses. In this paper, we propose AdaptMMBench, a comprehensive benchmark for adaptive multimodal reasoning across five domains: real-world, OCR, GUI, knowledge, and math, encompassing both direct perception and complex reasoning tasks. AdaptMMBench utilizes a Matthews Correlation Coefficient (MCC) metric to evaluate the selection rationality of different reasoning modes, isolating this meta-cognition ability by dynamically identifying task difficulties based on models' capability boundaries. Moreover, AdaptMMBench facilitates multi-dimensional process evaluation across key step coverage, tool effectiveness, and computational efficiency. Our evaluation reveals that while adaptive mode selection scales with model capacity, it notably decouples from final accuracy. Conversely, key step coverage aligns with performance, though tool effectiveness remains highly inconsistent across model architectures.
Abstract:The comprehensiveness of vehicle-to-everything (V2X) recognition enriches and holistically shapes the global Birds-Eye-View (BEV) perception, incorporating rich semantics and integrating driving scene information, thereby serving features of trajectory prediction, decision-making and driving planning. Utilizing V2X message sets to form BEV format proves to be an effective perception method for connected and automated vehicles (CAVs). Specifically, MAP, SPAT and RSI data contributes to the achievement of road connectivity, synchronized traffic signal navigation and obstacle warning. Moreover, using time-sequential BSMs information from multiple vehicles allows for the perception of current state and the prediction of future trajectories. Therefore, this paper develops a comprehensive autonomous driving model that relies on BEV-V2X perception, Interacting Multiple model Unscented Kalman Filter (IMM-UKF)-based trajectory prediction, and deep reinforcement learning (DRL)-based decision making and planing. We establish a DRL environment with reward-shaping methods to formulate a unified set of optimal driving behaviors that encompass obstacle avoidance, lane changes, overtaking, turning maneuver, and synchronized traffic signal navigation. Consequently, a complex traffic intersection scenario was simulated, and the well-trained model was applied for driving control. The observed driving behavior closely resembled that of an experienced driver, exhibiting anticipatory actions and revealing notable operational highlights of driving policy.