Abstract:In recent years, there has been increased interest in the design, training, and evaluation of end-to-end autonomous driving (AD) systems. One often overlooked aspect is the uncertainty of planned trajectories predicted by these systems, despite awareness of their own uncertainty being key to achieve safety and robustness. We propose to estimate this uncertainty by adapting loss prediction from the uncertainty quantification literature. To this end, we introduce a novel light-weight module, dubbed CATPlan, that is trained to decode motion and planning embeddings into estimates of the collision loss used to partially supervise end-to-end AD systems. During inference, these estimates are interpreted as collision risk. We evaluate CATPlan on the safety-critical, nerf-based, closed-loop benchmark NeuroNCAP and find that it manages to detect collisions with a $54.8\%$ relative improvement to average precision over a GMM-based baseline in which the predicted trajectory is compared to the forecasted trajectories of other road users. Our findings indicate that the addition of CATPlan can lead to safer end-to-end AD systems and hope that our work will spark increased interest in uncertainty quantification for such systems.
Abstract:Unsupervised image complexity representation often suffers from bias in positive sample selection and sensitivity to image content. We propose CLICv2, a contrastive learning framework that enforces content invariance for complexity representation. Unlike CLIC, which generates positive samples via cropping-introducing positive pairs bias-our shifted patchify method applies randomized directional shifts to image patches before contrastive learning. Patches at corresponding positions serve as positive pairs, ensuring content-invariant learning. Additionally, we propose patch-wise contrastive loss, which enhances local complexity representation while mitigating content interference. In order to further suppress the interference of image content, we introduce Masked Image Modeling as an auxiliary task, but we set its modeling objective as the entropy of masked patches, which recovers the entropy of the overall image by using the information of the unmasked patches, and then obtains the global complexity perception ability. Extensive experiments on IC9600 demonstrate that CLICv2 significantly outperforms existing unsupervised methods in PCC and SRCC, achieving content-invariant complexity representation without introducing positive pairs bias.
Abstract:Advancements in Large Language Models (LLMs) have opened transformative possibilities for human-robot interaction, especially in collaborative environments. However, Real-time human-AI collaboration requires agents to adapt to unseen human behaviors while maintaining effective communication dynamically. Existing benchmarks fall short in evaluating such adaptability for embodied agents, focusing mostly on the task performance of the agent itself. To address this gap, we propose a novel benchmark that assesses agents' reactive adaptability and instantaneous communication capabilities at every step. Based on this benchmark, we propose a Monitor-then-Adapt framework (MonTA), combining strong adaptability and communication with real-time execution. MonTA contains three key LLM modules, a lightweight \textit{Monitor} for monitoring the need for adaptation in high frequency, and two proficient \textit{Adapters} for subtask and path adaptation reasoning in low frequency. Our results demonstrate that MonTA outperforms other baseline agents on our proposed benchmark. Further user studies confirm the high reasonability adaptation plan and consistent language instruction provided by our framework.
Abstract:As an essential visual attribute, image complexity affects human image comprehension and directly influences the performance of computer vision tasks. However, accurately assessing and quantifying image complexity faces significant challenges. Previous works needed more generalization capabilities and well-labeled datasets to learn image complexity features. However, creating such datasets requires expensive manual labeling costs, and the models inevitably learn about human subjective biases. To address the above problems, we propose CLIC, an unsupervised framework based on contrastive learning, for learning image complexity representations. The method learns image complexity features on unlabeled data, avoiding the high labeling cost. Specifically, we propose a unique positive and negative sample selection strategy to reinforce the differences in complexity features. At the same time, we introduce an image prior-based Complexity-Aware Loss to constrain the learning process of the model. We conducted extensive experiments for verification, and the results show that CLIC can effectively learn the image complexity representation. CLIC obtained competitive results with supervised methods by fine-tuning on IC9600. In addition, CLIC applied to downstream tasks shows significant performance improvements, demonstrating the potential for application in various real-world scenarios. \href{https://github.com/xauat-liushipeng/CLIC}{code}
Abstract:Quantifying and evaluating image complexity can be instrumental in enhancing the performance of various computer vision tasks. Supervised learning can effectively learn image complexity features from well-annotated datasets. However, creating such datasets requires expensive manual annotation costs. The models may learn human subjective biases from it. In this work, we introduce the MoCo v2 framework. We utilize contrastive learning to represent image complexity, named CLIC (Contrastive Learning for Image Complexity). We find that there are complexity differences between different local regions of an image, and propose Random Crop and Mix (RCM), which can produce positive samples consisting of multi-scale local crops. RCM can also expand the train set and increase data diversity without introducing additional data. We conduct extensive experiments with CLIC, comparing it with both unsupervised and supervised methods. The results demonstrate that the performance of CLIC is comparable to that of state-of-the-art supervised methods. In addition, we establish the pipelines that can apply CLIC to computer vision tasks to effectively improve their performance.