Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Katrin Renz

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

Mar 12, 2025

Katrin Renz, Long Chen, Elahe Arani, Oleg Sinavski

Abstract:Integrating large language models (LLMs) into autonomous driving has attracted significant attention with the hope of improving generalization and explainability. However, existing methods often focus on either driving or vision-language understanding but achieving both high driving performance and extensive language understanding remains challenging. In addition, the dominant approach to tackle vision-language understanding is using visual question answering. However, for autonomous driving, this is only useful if it is aligned with the action space. Otherwise, the model's answers could be inconsistent with its behavior. Therefore, we propose a model that can handle three different tasks: (1) closed-loop driving, (2) vision-language understanding, and (3) language-action alignment. Our model SimLingo is based on a vision language model (VLM) and works using only camera, excluding expensive sensors like LiDAR. SimLingo obtains state-of-the-art performance on the widely used CARLA simulator on the Bench2Drive benchmark and is the winning entry at the CARLA challenge 2024. Additionally, we achieve strong results in a wide variety of language-related tasks while maintaining high driving performance.

* CVPR 2025. 1st Place @ CARLA Challenge 2024. Challenge tech report (preliminary version of SimLingo): arXiv:2406.10165

Via

Access Paper or Ask Questions

CarLLaVA: Vision language models for camera-only closed-loop driving

Jun 14, 2024

Katrin Renz, Long Chen, Ana-Maria Marcu, Jan Hünermann, Benoit Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, Oleg Sinavski

Figure 1 for CarLLaVA: Vision language models for camera-only closed-loop driving

Figure 2 for CarLLaVA: Vision language models for camera-only closed-loop driving

Figure 3 for CarLLaVA: Vision language models for camera-only closed-loop driving

Figure 4 for CarLLaVA: Vision language models for camera-only closed-loop driving

Abstract:In this technical report, we present CarLLaVA, a Vision Language Model (VLM) for autonomous driving, developed for the CARLA Autonomous Driving Challenge 2.0. CarLLaVA uses the vision encoder of the LLaVA VLM and the LLaMA architecture as backbone, achieving state-of-the-art closed-loop driving performance with only camera input and without the need for complex or expensive labels. Additionally, we show preliminary results on predicting language commentary alongside the driving output. CarLLaVA uses a semi-disentangled output representation of both path predictions and waypoints, getting the advantages of the path for better lateral control and the waypoints for better longitudinal control. We propose an efficient training recipe to train on large driving datasets without wasting compute on easy, trivial data. CarLLaVA ranks 1st place in the sensor track of the CARLA Autonomous Driving Challenge 2.0 outperforming the previous state of the art by 458% and the best concurrent submission by 32.6%.

* Outstanding Champion & Innovation Award @ CARLA Autonomous Driving Challenge 2024; Project video: https://youtu.be/E1nsEgcHRuc

Via

Access Paper or Ask Questions

Can Vehicle Motion Planning Generalize to Realistic Long-tail Scenarios?

Apr 11, 2024

Marcel Hallgarten, Julian Zapata, Martin Stoll, Katrin Renz, Andreas Zell

Figure 1 for Can Vehicle Motion Planning Generalize to Realistic Long-tail Scenarios?

Figure 2 for Can Vehicle Motion Planning Generalize to Realistic Long-tail Scenarios?

Figure 3 for Can Vehicle Motion Planning Generalize to Realistic Long-tail Scenarios?

Figure 4 for Can Vehicle Motion Planning Generalize to Realistic Long-tail Scenarios?

Abstract:Real-world autonomous driving systems must make safe decisions in the face of rare and diverse traffic scenarios. Current state-of-the-art planners are mostly evaluated on real-world datasets like nuScenes (open-loop) or nuPlan (closed-loop). In particular, nuPlan seems to be an expressive evaluation method since it is based on real-world data and closed-loop, yet it mostly covers basic driving scenarios. This makes it difficult to judge a planner's capabilities to generalize to rarely-seen situations. Therefore, we propose a novel closed-loop benchmark interPlan containing several edge cases and challenging driving scenarios. We assess existing state-of-the-art planners on our benchmark and show that neither rule-based nor learning-based planners can safely navigate the interPlan scenarios. A recently evolving direction is the usage of foundation models like large language models (LLM) to handle generalization. We evaluate an LLM-only planner and introduce a novel hybrid planner that combines an LLM-based behavior planner with a rule-based motion planner that achieves state-of-the-art performance on our benchmark.

Via

Access Paper or Ask Questions

DriveLM: Driving with Graph Visual Question Answering

Dec 21, 2023

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Ping Luo, Andreas Geiger, Hongyang Li

Abstract:We study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems to boost generalization and enable interactivity with human users. While recent approaches adapt VLMs to driving via single-round visual question answering (VQA), human drivers reason about decisions in multiple steps. Starting from the localization of key objects, humans estimate object interactions before taking actions. The key insight is that with our proposed task, Graph VQA, where we model graph-structured reasoning through perception, prediction and planning question-answer pairs, we obtain a suitable proxy task to mimic the human reasoning process. We instantiate datasets (DriveLM-Data) built upon nuScenes and CARLA, and propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving. The experiments demonstrate that Graph VQA provides a simple, principled framework for reasoning about a driving scene, and DriveLM-Data provides a challenging benchmark for this task. Our DriveLM-Agent baseline performs end-to-end autonomous driving competitively in comparison to state-of-the-art driving-specific architectures. Notably, its benefits are pronounced when it is evaluated zero-shot on unseen objects or sensor configurations. We hope this work can be the starting point to shed new light on how to apply VLMs for autonomous driving. To facilitate future research, all code, data, and models are available to the public.

Via

Access Paper or Ask Questions

On Offline Evaluation of 3D Object Detection for Autonomous Driving

Aug 24, 2023

Tim Schreier, Katrin Renz, Andreas Geiger, Kashyap Chitta

Figure 1 for On Offline Evaluation of 3D Object Detection for Autonomous Driving

Figure 2 for On Offline Evaluation of 3D Object Detection for Autonomous Driving

Figure 3 for On Offline Evaluation of 3D Object Detection for Autonomous Driving

Abstract:Prior work in 3D object detection evaluates models using offline metrics like average precision since closed-loop online evaluation on the downstream driving task is costly. However, it is unclear how indicative offline results are of driving performance. In this work, we perform the first empirical evaluation measuring how predictive different detection metrics are of driving performance when detectors are integrated into a full self-driving stack. We conduct extensive experiments on urban driving in the CARLA simulator using 16 object detection models. We find that the nuScenes Detection Score has a higher correlation to driving performance than the widely used average precision metric. In addition, our results call for caution on the exclusive reliance on the emerging class of `planner-centric' metrics.

* Appears in: IEEE International Conference on Computer Vision (ICCV'23) Workshops

Via

Access Paper or Ask Questions

PlanT: Explainable Planning Transformers via Object-Level Representations

Oct 25, 2022

Katrin Renz, Kashyap Chitta, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata, Andreas Geiger

Abstract:Planning an optimal route in a complex environment requires efficient reasoning about the surrounding scene. While human drivers prioritize important objects and ignore details not relevant to the decision, learning-based planners typically extract features from dense, high-dimensional grid representations containing all vehicle and road context information. In this paper, we propose PlanT, a novel approach for planning in the context of self-driving that uses a standard transformer architecture. PlanT is based on imitation learning with a compact object-level input representation. On the Longest6 benchmark for CARLA, PlanT outperforms all prior methods (matching the driving score of the expert) while being 5.3x faster than equivalent pixel-based planning baselines during inference. Combining PlanT with an off-the-shelf perception module provides a sensor-based driving system that is more than 10 points better in terms of driving score than the existing state of the art. Furthermore, we propose an evaluation protocol to quantify the ability of planners to identify relevant objects, providing insights regarding their decision-making. Our results indicate that PlanT can focus on the most relevant object in the scene, even when this object is geometrically distant.

* CoRL 2022. Project Page: https://www.katrinrenz.de/plant/

Via

Access Paper or Ask Questions

TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving

May 31, 2022

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, Andreas Geiger

Figure 1 for TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving

Figure 2 for TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving

Figure 3 for TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving

Figure 4 for TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving

Abstract:How should we integrate representations from complementary sensors for autonomous driving? Geometry-based fusion has shown promise for perception (e.g. object detection, motion forecasting). However, in the context of end-to-end driving, we find that imitation learning based on existing sensor fusion methods underperforms in complex driving scenarios with a high density of dynamic agents. Therefore, we propose TransFuser, a mechanism to integrate image and LiDAR representations using self-attention. Our approach uses transformer modules at multiple resolutions to fuse perspective view and bird's eye view feature maps. We experimentally validate its efficacy on a challenging new benchmark with long routes and dense traffic, as well as the official leaderboard of the CARLA urban driving simulator. At the time of submission, TransFuser outperforms all prior work on the CARLA leaderboard in terms of driving score by a large margin. Compared to geometry-based fusion, TransFuser reduces the average collisions per kilometer by 48%.

* arXiv admin note: text overlap with arXiv:2104.09224

Via

Access Paper or Ask Questions

KING: Generating Safety-Critical Driving Scenarios for Robust Imitation via Kinematics Gradients

Apr 28, 2022

Niklas Hanselmann, Katrin Renz, Kashyap Chitta, Apratim Bhattacharyya, Andreas Geiger

Figure 1 for KING: Generating Safety-Critical Driving Scenarios for Robust Imitation via Kinematics Gradients

Figure 2 for KING: Generating Safety-Critical Driving Scenarios for Robust Imitation via Kinematics Gradients

Figure 3 for KING: Generating Safety-Critical Driving Scenarios for Robust Imitation via Kinematics Gradients

Figure 4 for KING: Generating Safety-Critical Driving Scenarios for Robust Imitation via Kinematics Gradients

Abstract:Simulators offer the possibility of safe, low-cost development of self-driving systems. However, current driving simulators exhibit na\"ive behavior models for background traffic. Hand-tuned scenarios are typically added during simulation to induce safety-critical situations. An alternative approach is to adversarially perturb the background traffic trajectories. In this paper, we study this approach to safety-critical driving scenario generation using the CARLA simulator. We use a kinematic bicycle model as a proxy to the simulator's true dynamics and observe that gradients through this proxy model are sufficient for optimizing the background traffic trajectories. Based on this finding, we propose KING, which generates safety-critical driving scenarios with a 20% higher success rate than black-box optimization. By solving the scenarios generated by KING using a privileged rule-based expert algorithm, we obtain training data for an imitation learning policy. After fine-tuning on this new data, we show that the policy becomes better at avoiding collisions. Importantly, our generated data leads to reduced collisions on both held-out scenarios generated via KING as well as traditional hand-crafted scenarios, demonstrating improved robustness.

Via

Access Paper or Ask Questions

Sign Segmentation with Changepoint-Modulated Pseudo-Labelling

Apr 28, 2021

Katrin Renz, Nicolaj C. Stache, Neil Fox, Gül Varol, Samuel Albanie

Figure 1 for Sign Segmentation with Changepoint-Modulated Pseudo-Labelling

Figure 2 for Sign Segmentation with Changepoint-Modulated Pseudo-Labelling

Figure 3 for Sign Segmentation with Changepoint-Modulated Pseudo-Labelling

Figure 4 for Sign Segmentation with Changepoint-Modulated Pseudo-Labelling

Abstract:The objective of this work is to find temporal boundaries between signs in continuous sign language. Motivated by the paucity of annotation available for this task, we propose a simple yet effective algorithm to improve segmentation performance on unlabelled signing footage from a domain of interest. We make the following contributions: (1) We motivate and introduce the task of source-free domain adaptation for sign language segmentation, in which labelled source data is available for an initial training phase, but is not available during adaptation. (2) We propose the Changepoint-Modulated Pseudo-Labelling (CMPL) algorithm to leverage cues from abrupt changes in motion-sensitive feature space to improve pseudo-labelling quality for adaptation. (3) We showcase the effectiveness of our approach for category-agnostic sign segmentation, transferring from the BSLCORPUS to the BSL-1K and RWTH-PHOENIX-Weather 2014 datasets, where we outperform the prior state of the art.

* Appears in: 2021 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW'21). 11 pages

Via

Access Paper or Ask Questions

Sign language segmentation with temporal convolutional networks

Nov 25, 2020

Katrin Renz, Nicolaj C. Stache, Samuel Albanie, Gül Varol

Figure 1 for Sign language segmentation with temporal convolutional networks

Figure 2 for Sign language segmentation with temporal convolutional networks

Figure 3 for Sign language segmentation with temporal convolutional networks

Figure 4 for Sign language segmentation with temporal convolutional networks

Abstract:The objective of this work is to determine the location of temporal boundaries between signs in continuous sign language videos. Our approach employs 3D convolutional neural network representations with iterative temporal segment refinement to resolve ambiguities between sign boundary cues. We demonstrate the effectiveness of our approach on the BSLCORPUS, PHOENIX14 and BSL-1K datasets, showing considerable improvement over the prior state of the art and the ability to generalise to new signers, languages and domains.

Via

Access Paper or Ask Questions