Abstract:Autonomous vehicles face significant challenges in navigating adverse weather, particularly rain, due to the visual impairment of camera-based systems. In this study, we leveraged contemporary deep learning techniques to mitigate these challenges, aiming to develop a vision model that processes live vehicle camera feeds to eliminate rain-induced visual hindrances, yielding visuals closely resembling clear, rain-free scenes. Using the Car Learning to Act (CARLA) simulation environment, we generated a comprehensive dataset of clear and rainy images for model training and testing. In our model, we employed a classic encoder-decoder architecture with skip connections and concatenation operations. It was trained using novel batching schemes designed to effectively distinguish high-frequency rain patterns from low-frequency scene features across successive image frames. To evaluate the model performance, we integrated it with a steering module that processes front-view images as input. The results demonstrated notable improvements in steering accuracy, underscoring the model's potential to enhance navigation safety and reliability in rainy weather conditions.
Abstract:Image deraining holds great potential for enhancing the vision of autonomous vehicles in rainy conditions, contributing to safer driving. Previous works have primarily focused on employing a single network architecture to generate derained images. However, they often fail to fully exploit the rich prior knowledge embedded in the scenes. Particularly, most methods overlook the depth information that can provide valuable context about scene geometry and guide more robust deraining. In this work, we introduce a novel learning framework that integrates multiple networks: an AutoEncoder for deraining, an auxiliary network to incorporate depth information, and two supervision networks to enforce feature consistency between rainy and clear scenes. This multi-network design enables our model to effectively capture the underlying scene structure, producing clearer and more accurately derained images, leading to improved object detection for autonomous vehicles. Extensive experiments on three widely-used datasets demonstrated the effectiveness of our proposed method.
Abstract:Existing deep learning-based object detection models perform well under daytime conditions but face significant challenges at night, primarily because they are predominantly trained on daytime images. Additionally, training with nighttime images presents another challenge: even human annotators struggle to accurately label objects in low-light conditions. This issue is particularly pronounced in transportation applications, such as detecting vehicles and other objects of interest on rural roads at night, where street lighting is often absent, and headlights may introduce undesirable glare. This study addresses these challenges by introducing a novel framework for labeling-free data augmentation, leveraging CARLA-generated synthetic data for day-to-night image style transfer. Specifically, the framework incorporates the Efficient Attention Generative Adversarial Network for realistic day-to-night style transfer and uses CARLA-generated synthetic nighttime images to help the model learn vehicle headlight effects. To evaluate the efficacy of the proposed framework, we fine-tuned the YOLO11 model with an augmented dataset specifically curated for rural nighttime environments, achieving significant improvements in nighttime vehicle detection. This novel approach is simple yet effective, offering a scalable solution to enhance AI-based detection systems in low-visibility environments and extend the applicability of object detection models to broader real-world contexts.
Abstract:Reliable and interpretable traffic crash modeling is essential for understanding causality and improving road safety. This study introduces a novel approach to predicting collision types by utilizing a comprehensive dataset fused from multiple sources, including weather data, crash reports, high-resolution traffic information, pavement geometry, and facility characteristics. Central to our approach is the development of a Feature Group Tabular Transformer (FGTT) model, which organizes disparate data into meaningful feature groups, represented as tokens. These group-based tokens serve as rich semantic components, enabling effective identification of collision patterns and interpretation of causal mechanisms. The FGTT model is benchmarked against widely used tree ensemble models, including Random Forest, XGBoost, and CatBoost, demonstrating superior predictive performance. Furthermore, model interpretation reveals key influential factors, providing fresh insights into the underlying causality of distinct crash types.
Abstract:Harnessing the power of Large Language Models (LLMs), this study explores the use of three state-of-the-art LLMs, specifically GPT-3.5-turbo, LLaMA3-8B, and LLaMA3-70B, for crash severity inference, framing it as a classification task. We generate textual narratives from original traffic crash tabular data using a pre-built template infused with domain knowledge. Additionally, we incorporated Chain-of-Thought (CoT) reasoning to guide the LLMs in analyzing the crash causes and then inferring the severity. This study also examine the impact of prompt engineering specifically designed for crash severity inference. The LLMs were tasked with crash severity inference to: (1) evaluate the models' capabilities in crash severity analysis, (2) assess the effectiveness of CoT and domain-informed prompt engineering, and (3) examine the reasoning abilities with the CoT framework. Our results showed that LLaMA3-70B consistently outperformed the other models, particularly in zero-shot settings. The CoT and Prompt Engineering techniques significantly enhanced performance, improving logical reasoning and addressing alignment issues. Notably, the CoT offers valuable insights into LLMs' reasoning processes, unleashing their capacity to consider diverse factors such as environmental conditions, driver behavior, and vehicle characteristics in severity analysis and inference.
Abstract:This paper introduces a novel approach to leverage features learned from both supervised and self-supervised paradigms, to improve image classification tasks, specifically for vehicle classification. Two state-of-the-art self-supervised learning methods, DINO and data2vec, were evaluated and compared for their representation learning of vehicle images. The former contrasts local and global views while the latter uses masked prediction on multi-layered representations. In the latter case, supervised learning is employed to finetune a pretrained YOLOR object detector for detecting vehicle wheels, from which definitive wheel positional features are retrieved. The representations learned from these self-supervised learning methods were combined with the wheel positional features for the vehicle classification task. Particularly, a random wheel masking strategy was utilized to finetune the previously learned representations in harmony with the wheel positional features during the training of the classifier. Our experiments show that the data2vec-distilled representations, which are consistent with our wheel masking strategy, outperformed the DINO counterpart, resulting in a celebrated Top-1 classification accuracy of 97.2% for classifying the 13 vehicle classes defined by the Federal Highway Administration.
Abstract:Classification using supervised learning requires annotating a large amount of classes-balanced data for model training and testing. This has practically limited the scope of applications with supervised learning, in particular deep learning. To address the issues associated with limited and imbalanced data, this paper introduces a sample-efficient co-supervised learning paradigm (SEC-CGAN), in which a conditional generative adversarial network (CGAN) is trained alongside the classifier and supplements semantics-conditioned, confidence-aware synthesized examples to the annotated data during the training process. In this setting, the CGAN not only serves as a co-supervisor but also provides complementary quality examples to aid the classifier training in an end-to-end fashion. Experiments demonstrate that the proposed SEC-CGAN outperforms the external classifier GAN (EC-GAN) and a baseline ResNet-18 classifier. For the comparison, all classifiers in above methods adopt the ResNet-18 architecture as the backbone. Particularly, for the Street View House Numbers dataset, using the 5% of training data, a test accuracy of 90.26% is achieved by SEC-CGAN as opposed to 88.59% by EC-GAN and 87.17% by the baseline classifier; for the highway image dataset, using the 10% of training data, a test accuracy of 98.27% is achieved by SEC-CGAN, compared to 97.84% by EC-GAN and 95.52% by the baseline classifier.
Abstract:Inspired by the recent success of deep learning in multiscale information encoding, we introduce a variational autoencoder (VAE) based semi-supervised method for detection of faulty traffic data, which is cast as a classification problem. Continuous wavelet transform (CWT) is applied to the time series of traffic volume data to obtain rich features embodied in time-frequency representation, followed by a twin of VAE models to separately encode normal data and faulty data. The resulting multiscale dual encodings are concatenated and fed to an attention-based classifier, consisting of a self-attention module and a multilayer perceptron. For comparison, the proposed architecture is evaluated against five different encoding schemes, including (1) VAE with only normal data encoding, (2) VAE with only faulty data encoding, (3) VAE with both normal and faulty data encodings, but without attention module in the classifier, (4) siamese encoding, and (5) cross-vision transformer (CViT) encoding. The first four encoding schemes adopted the same convolutional neural network (CNN) architecture while the fifth encoding scheme follows the transformer architecture of CViT. Our experiments show that the proposed architecture with the dual encoding scheme, coupled with attention module, outperforms other encoding schemes and results in classification accuracy of 96.4%, precision of 95.5%, and recall of 97.7%.