Abstract:Vision-Language Models (VLMs) offer the ability to generate high-level, interpretable descriptions of complex activities from images and videos, making them valuable for situational awareness (SA) applications. In such settings, the focus is on identifying infrequent but significant events with high reliability and accuracy, while also extracting fine-grained details and assessing recognition quality. In this paper, we propose an approach that integrates VLMs with traditional computer vision methods through explicit logic reasoning to enhance SA in three key ways: (a) extracting fine-grained event details, (b) employing an intelligent fine-tuning (FT) strategy that achieves substantially higher accuracy than uninformed selection, and (c) generating justifications for VLM outputs during inference. We demonstrate that our intelligent FT mechanism improves the accuracy and provides a valuable means, during inferencing, to either confirm the validity of the VLM output or indicate why it may be questionable.
Abstract:In this paper, we demonstrate a proof of concept for characterizing vehicular behavior using only the roadside cameras of the ITS system. The essential advantage of this method is that it can be implemented in the roadside infrastructure transparently and inexpensively and can have a global view of each vehicle's behavior without any involvement of or awareness by the individual vehicles or drivers. By using a setup that includes programmatically controlled robot cars (to simulate different types of vehicular behaviors) and an external video camera set up to capture and analyze the vehicular behavior, we show that the driver classification based on the external video analytics yields accuracies that are within 1-2\% of the accuracies of direct vehicle-based characterization. We also show that the residual errors primarily relate to gaps in correct object identification and tracking and thus can be further reduced with a more sophisticated setup. The characterization can be used to enhance both the safety and performance of the traffic flow, particularly in the mixed manual and automated vehicle scenarios that are expected to be common soon.