Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jianhao Yuan

Hidden in Plain Sight: Evaluating Abstract Shape Recognition in Vision-Language Models

Nov 09, 2024

Arshia Hemmat, Adam Davies, Tom A. Lamb, Jianhao Yuan, Philip Torr, Ashkan Khakzar, Francesco Pinto

Abstract:Despite the importance of shape perception in human vision, early neural image classifiers relied less on shape information for object recognition than other (often spurious) features. While recent research suggests that current large Vision-Language Models (VLMs) exhibit more reliance on shape, we find them to still be seriously limited in this regard. To quantify such limitations, we introduce IllusionBench, a dataset that challenges current cutting-edge VLMs to decipher shape information when the shape is represented by an arrangement of visual elements in a scene. Our extensive evaluations reveal that, while these shapes are easily detectable by human annotators, current VLMs struggle to recognize them, indicating important avenues for future work in developing more robust visual perception systems. The full dataset and codebase are available at: \url{https://arshiahemmat.github.io/illusionbench/}

Via

Access Paper or Ask Questions

SpatialBot: Precise Spatial Understanding with Vision Language Models

Jun 19, 2024

Wenxiao Cai, Yaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, Bo Zhao

Abstract:Vision Language Models (VLMs) have achieved impressive performance in 2D image understanding, however they are still struggling with spatial understanding which is the foundation of Embodied AI. In this paper, we propose SpatialBot for better spatial understanding by feeding both RGB and depth images. Additionally, we have constructed the SpatialQA dataset, which involves multi-level depth-related questions to train VLMs for depth understanding. Finally, we present SpatialBench to comprehensively evaluate VLMs' capabilities in spatial understanding at different levels. Extensive experiments on our spatial-understanding benchmark, general VLM benchmarks and Embodied AI tasks, demonstrate the remarkable improvements of SpatialBot trained on SpatialQA. The model, code and data are available at https://github.com/BAAI-DCAI/SpatialBot.

Via

Access Paper or Ask Questions

kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies

Apr 15, 2024

Zhongrui Gui, Shuyang Sun, Runjia Li, Jianhao Yuan, Zhaochong An, Karsten Roth, Ameya Prabhu, Philip Torr

Figure 1 for kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies

Figure 2 for kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies

Figure 3 for kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies

Figure 4 for kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies

Abstract:Rapid advancements in continual segmentation have yet to bridge the gap of scaling to large continually expanding vocabularies under compute-constrained scenarios. We discover that traditional continual training leads to catastrophic forgetting under compute constraints, unable to outperform zero-shot segmentation methods. We introduce a novel strategy for semantic and panoptic segmentation with zero forgetting, capable of adapting to continually growing vocabularies without the need for retraining or large memory costs. Our training-free approach, kNN-CLIP, leverages a database of instance embeddings to enable open-vocabulary segmentation approaches to continually expand their vocabulary on any given domain with a single-pass through data, while only storing embeddings minimizing both compute and memory costs. This method achieves state-of-the-art mIoU performance across large-vocabulary semantic and panoptic segmentation datasets. We hope kNN-CLIP represents a step forward in enabling more efficient and adaptable continual segmentation, paving the way for advances in real-world large-vocabulary continual segmentation methods.

* 10 pages, 3 figures

Via

Access Paper or Ask Questions

SynArtifact: Classifying and Alleviating Artifacts in Synthetic Images via Vision-Language Model

Mar 05, 2024

Bin Cao, Jianhao Yuan, Yexin Liu, Jian Li, Shuyang Sun, Jing Liu, Bo Zhao

Figure 1 for SynArtifact: Classifying and Alleviating Artifacts in Synthetic Images via Vision-Language Model

Figure 2 for SynArtifact: Classifying and Alleviating Artifacts in Synthetic Images via Vision-Language Model

Figure 3 for SynArtifact: Classifying and Alleviating Artifacts in Synthetic Images via Vision-Language Model

Figure 4 for SynArtifact: Classifying and Alleviating Artifacts in Synthetic Images via Vision-Language Model

Abstract:In the rapidly evolving area of image synthesis, a serious challenge is the presence of complex artifacts that compromise perceptual realism of synthetic images. To alleviate artifacts and improve quality of synthetic images, we fine-tune Vision-Language Model (VLM) as artifact classifier to automatically identify and classify a wide range of artifacts and provide supervision for further optimizing generative models. Specifically, we develop a comprehensive artifact taxonomy and construct a dataset of synthetic images with artifact annotations for fine-tuning VLM, named SynArtifact-1K. The fine-tuned VLM exhibits superior ability of identifying artifacts and outperforms the baseline by 25.66%. To our knowledge, this is the first time such end-to-end artifact classification task and solution have been proposed. Finally, we leverage the output of VLM as feedback to refine the generative model for alleviating artifacts. Visualization results and user study demonstrate that the quality of images synthesized by the refined diffusion model has been obviously improved.

Via

Access Paper or Ask Questions

Efficient Multimodal Learning from Data-centric Perspective

Feb 18, 2024

Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Wang, Tiejun Huang, Bo Zhao

Figure 1 for Efficient Multimodal Learning from Data-centric Perspective

Figure 2 for Efficient Multimodal Learning from Data-centric Perspective

Figure 3 for Efficient Multimodal Learning from Data-centric Perspective

Figure 4 for Efficient Multimodal Learning from Data-centric Perspective

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated notable capabilities in general visual understanding and reasoning tasks. However, their deployment is hindered by substantial computational costs in both training and inference, limiting accessibility to the broader research and user communities. A straightforward solution is to leverage smaller pre-trained vision and language models, which inevitably causes significant performance drop. In this paper, we demonstrate the possibility to beat the scaling law and train a smaller but better MLLM by exploring more informative training data. Specifically, we introduce Bunny, a family of lightweight MLLMs with flexible vision and language backbones for efficient multimodal learning from condensed training data. Remarkably, our Bunny-3B outperforms the state-of-the-art large MLLMs, especially LLaVA-v1.5-13B, on multiple benchmarks. The code, models and data can be found in https://github.com/BAAI-DCAI/Bunny.

Via

Access Paper or Ask Questions

RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model

Feb 16, 2024

Jianhao Yuan, Shuyang Sun, Daniel Omeiza, Bo Zhao, Paul Newman, Lars Kunze, Matthew Gadd

Abstract:Robots powered by 'blackbox' models need to provide human-understandable explanations which we can trust. Hence, explainability plays a critical role in trustworthy autonomous decision-making to foster transparency and acceptance among end users, especially in complex autonomous driving. Recent advancements in Multi-Modal Large Language models (MLLMs) have shown promising potential in enhancing the explainability as a driving agent by producing control predictions along with natural language explanations. However, severe data scarcity due to expensive annotation costs and significant domain gaps between different datasets makes the development of a robust and generalisable system an extremely challenging task. Moreover, the prohibitively expensive training requirements of MLLM and the unsolved problem of catastrophic forgetting further limit their generalisability post-deployment. To address these challenges, we present RAG-Driver, a novel retrieval-augmented multi-modal large language model that leverages in-context learning for high-performance, explainable, and generalisable autonomous driving. By grounding in retrieved expert demonstration, we empirically validate that RAG-Driver achieves state-of-the-art performance in producing driving action explanations, justifications, and control signal prediction. More importantly, it exhibits exceptional zero-shot generalisation capabilities to unseen environments without further training endeavours.

* 13 pages, 6 figures

Via

Access Paper or Ask Questions

Real-Fake: Effective Training Data Synthesis Through Distribution Matching

Oct 16, 2023

Jianhao Yuan, Jie Zhang, Shuyang Sun, Philip Torr, Bo Zhao

Abstract:Synthetic training data has gained prominence in numerous learning tasks and scenarios, offering advantages such as dataset augmentation, generalization evaluation, and privacy preservation. Despite these benefits, the efficiency of synthetic data generated by current methodologies remains inferior when training advanced deep models exclusively, limiting its practical utility. To address this challenge, we analyze the principles underlying training data synthesis for supervised learning and elucidate a principled theoretical framework from the distribution-matching perspective that explicates the mechanisms governing synthesis efficacy. Through extensive experiments, we demonstrate the effectiveness of our synthetic data across diverse image classification tasks, both as a replacement for and augmentation to real datasets, while also benefits challenging tasks such as out-of-distribution generalization and privacy preservation.

* Code released at (https://github.com/BAAI-DCAI/Training-Data-Synthesis)

Via

Access Paper or Ask Questions

Off the Radar: Uncertainty-Aware Radar Place Recognition with Introspective Querying and Map Maintenance

Jun 21, 2023

Jianhao Yuan, Paul Newman, Matthew Gadd

Abstract:Localisation with Frequency-Modulated Continuous-Wave (FMCW) radar has gained increasing interest due to its inherent resistance to challenging environments. However, complex artefacts of the radar measurement process require appropriate uncertainty estimation to ensure the safe and reliable application of this promising sensor modality. In this work, we propose a multi-session map management system which constructs the best maps for further localisation based on learned variance properties in an embedding space. Using the same variance properties, we also propose a new way to introspectively reject localisation queries that are likely to be incorrect. For this, we apply robust noise-aware metric learning, which both leverages the short-timescale variability of radar data along a driven path (for data augmentation) and predicts the downstream uncertainty in metric-space-based place recognition. We prove the effectiveness of our method over extensive cross-validated tests of the Oxford Radar RobotCar and MulRan dataset. In this, we outperform the current state-of-the-art in radar place recognition and other uncertainty-aware methods when using only single nearest-neighbour queries. We also show consistent performance increases when rejecting queries based on uncertainty over a difficult test environment, which we did not observe for a competing uncertainty-aware place recognition system.

* International Conference on Intelligent Robots and Systems (IROS) 2023
* 8 pages, 6 figures

Via

Access Paper or Ask Questions

Not Just Pretty Pictures: Text-to-Image Generators Enable Interpretable Interventions for Robust Representations

Dec 21, 2022

Jianhao Yuan, Francesco Pinto, Adam Davies, Aarushi Gupta, Philip Torr

Abstract:Neural image classifiers are known to undergo severe performance degradation when exposed to input that exhibits covariate-shift with respect to the training distribution. Successful hand-crafted augmentation pipelines aim at either approximating the expected test domain conditions or to perturb the features that are specific to the training environment. The development of effective pipelines is typically cumbersome, and produce transformations whose impact on the classifier performance are hard to understand and control. In this paper, we show that recent Text-to-Image (T2I) generators' ability to simulate image interventions via natural-language prompts can be leveraged to train more robust models, offering a more interpretable and controllable alternative to traditional augmentation methods. We find that a variety of prompting mechanisms are effective for producing synthetic training data sufficient to achieve state-of-the-art performance in widely-adopted domain-generalization benchmarks and reduce classifiers' dependency on spurious features. Our work suggests that further progress in T2I generation and a tighter integration with other research fields may represent a significant step towards the development of more robust machine learning systems.

Via

Access Paper or Ask Questions