Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bailin Li

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Feb 25, 2024

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, Hang Zhao

Figure 1 for DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Figure 2 for DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Figure 3 for DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Figure 4 for DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Abstract:A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of chain-of-thought (CoT) modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. DriveVLM-Dual achieves robust spatial understanding and real-time inference speed. Extensive experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the effectiveness of DriveVLM and the enhanced performance of DriveVLM-Dual, surpassing existing methods in complex and unpredictable driving conditions.

* Project Page: https://tsinghua-mars-lab.github.io/DriveVLM/

Via

Access Paper or Ask Questions

Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding

Dec 14, 2020

Qingxing Cao, Bailin Li, Xiaodan Liang, Keze Wang, Liang Lin

Figure 1 for Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding

Figure 2 for Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding

Figure 3 for Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding

Figure 4 for Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding

Abstract:Though beneficial for encouraging the Visual Question Answering (VQA) models to discover the underlying knowledge by exploiting the input-output correlation beyond image and text contexts, the existing knowledge VQA datasets are mostly annotated in a crowdsource way, e.g., collecting questions and external reasons from different users via the internet. In addition to the challenge of knowledge reasoning, how to deal with the annotator bias also remains unsolved, which often leads to superficial over-fitted correlations between questions and answers. To address this issue, we propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation. Considering that a desirable VQA model should correctly perceive the image context, understand the question, and incorporate its learned knowledge, our proposed dataset aims to cutoff the shortcut learning exploited by the current deep embedding models and push the research boundary of the knowledge-based visual question reasoning. Specifically, we generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs to disentangle the knowledge from other biases. The programs can select one or two triplets from the scene graph or knowledge base to push multi-step reasoning, avoid answer ambiguity, and balanced the answer distribution. In contrast to the existing VQA datasets, we further imply the following two major constraints on the programs to incorporate knowledge reasoning: i) multiple knowledge triplets can be related to the question, but only one knowledge relates to the image object. This can enforce the VQA model to correctly perceive the image instead of guessing the knowledge based on the given question solely; ii) all questions are based on different knowledge, but the candidate answers are the same for both the training and test sets.

* To appear in TNNLS 2021. Considering that a desirable VQA model should correctly perceive the image context, understand the question, and incorporate its learned knowledge, our proposed dataset aims to cutoff the shortcut learning exploited by the current deep embedding models and push the research boundary of the knowledge-based visual question reasoning

Via

Access Paper or Ask Questions

EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning

Jul 06, 2020

Bailin Li, Bowen Wu, Jiang Su, Guangrun Wang, Liang Lin

Figure 1 for EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning

Figure 2 for EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning

Figure 3 for EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning

Figure 4 for EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning

Abstract:Finding out the computational redundant part of a trained Deep Neural Network (DNN) is the key question that pruning algorithms target on. Many algorithms try to predict model performance of the pruned sub-nets by introducing various evaluation methods. But they are either inaccurate or very complicated for general application. In this work, we present a pruning method called EagleEye, in which a simple yet efficient evaluation component based on adaptive batch normalization is applied to unveil a strong correlation between different pruned DNN structures and their final settled accuracy. This strong correlation allows us to fast spot the pruned candidates with highest potential accuracy without actually fine-tuning them. This module is also general to plug-in and improve some existing pruning algorithms. EagleEye achieves better pruning performance than all of the studied pruning algorithms in our experiments. Concretely, to prune MobileNet V1 and ResNet-50, EagleEye outperforms all compared methods by up to 3.8%. Even in the more challenging experiments of pruning the compact model of MobileNet V1, EagleEye achieves the highest accuracy of 70.9% with an overall 50% operations (FLOPs) pruned. All accuracy results are Top-1 ImageNet classification accuracy. Source code and models are accessible to open-source community https://github.com/anonymous47823493/EagleEye .

* Accepted in ECCV 2020(Oral). Codes are available on https://github.com/anonymous47823493/EagleEye

Via

Access Paper or Ask Questions

Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network

Sep 23, 2019

Qingxing Cao, Bailin Li, Xiaodan Liang, Liang Lin

Figure 1 for Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network

Figure 2 for Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network

Figure 3 for Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network

Figure 4 for Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network

Abstract:Explanation and high-order reasoning capabilities are crucial for real-world visual question answering with diverse levels of inference complexity (e.g., what is the dog that is near the girl playing with?) and important for users to understand and diagnose the trustworthiness of the system. Current VQA benchmarks on natural images with only an accuracy metric end up pushing the models to exploit the dataset biases and cannot provide any interpretable justification, which severally hinders advances in high-level question answering. In this work, we propose a new HVQR benchmark for evaluating explainable and high-order visual question reasoning ability with three distinguishable merits: 1) the questions often contain one or two relationship triplets, which requires the model to have the ability of multistep reasoning to predict plausible answers; 2) we provide an explicit evaluation on a multistep reasoning process that is constructed with image scene graphs and commonsense knowledge bases; and 3) each relationship triplet in a large-scale knowledge base only appears once among all questions, which poses challenges for existing networks that often attempt to overfit the knowledge base that already appears in the training set and enforces the models to handle unseen questions and knowledge fact usage. We also propose a new knowledge-routed modular network (KM-net) that incorporates the multistep reasoning process over a large knowledge base into visual question reasoning. An extensive dataset analysis and comparisons with existing models on the HVQR benchmark show that our benchmark provides explainable evaluations, comprehensive reasoning requirements and realistic challenges of VQA systems, as well as our KM-net's superiority in terms of accuracy and explanation ability.

Via

Access Paper or Ask Questions

Interpretable Visual Question Answering by Reasoning on Dependency Trees

Sep 06, 2018

Qingxing Cao, Xiaodan Liang, Bailin Li, Liang Lin

Figure 1 for Interpretable Visual Question Answering by Reasoning on Dependency Trees

Figure 2 for Interpretable Visual Question Answering by Reasoning on Dependency Trees

Figure 3 for Interpretable Visual Question Answering by Reasoning on Dependency Trees

Figure 4 for Interpretable Visual Question Answering by Reasoning on Dependency Trees

Abstract:Collaborative reasoning for understanding each image-question pair is very critical but underexplored for an interpretable visual question answering system. Although very recent works also attempted to use explicit compositional processes to assemble multiple subtasks embedded in the questions, their models heavily rely on annotations or handcrafted rules to obtain valid reasoning processes, leading to either heavy workloads or poor performance on composition reasoning. In this paper, to better align image and language domains in diverse and unrestricted cases, we propose a novel neural network model that performs global reasoning on a dependency tree parsed from the question, and we thus phrase our model as parse-tree-guided reasoning network (PTGRN). This network consists of three collaborative modules: i) an attention module to exploit the local visual evidence for each word parsed from the question, ii) a gated residual composition module to compose the previously mined evidence, and iii) a parse-tree-guided propagation module to pass the mined evidence along the parse tree. Our PTGRN is thus capable of building an interpretable VQA system that gradually derives the image cues following a question-driven parse-tree reasoning route. Experiments on relational datasets demonstrate the superiority of our PTGRN over current state-of-the-art VQA methods, and the visualization results highlight the explainable capability of our reasoning system.

* 14 pages, 10 figures. arXiv admin note: text overlap with arXiv:1804.00105

Via

Access Paper or Ask Questions