Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haonan Luo

Enhancing Multi-Robot Semantic Navigation Through Multimodal Chain-of-Thought Score Collaboration

Dec 24, 2024

Zhixuan Shen, Haonan Luo, Kexun Chen, Fengmao Lv, Tianrui Li

Figure 1 for Enhancing Multi-Robot Semantic Navigation Through Multimodal Chain-of-Thought Score Collaboration

Figure 2 for Enhancing Multi-Robot Semantic Navigation Through Multimodal Chain-of-Thought Score Collaboration

Figure 3 for Enhancing Multi-Robot Semantic Navigation Through Multimodal Chain-of-Thought Score Collaboration

Figure 4 for Enhancing Multi-Robot Semantic Navigation Through Multimodal Chain-of-Thought Score Collaboration

Abstract:Understanding how humans cooperatively utilize semantic knowledge to explore unfamiliar environments and decide on navigation directions is critical for house service multi-robot systems. Previous methods primarily focused on single-robot centralized planning strategies, which severely limited exploration efficiency. Recent research has considered decentralized planning strategies for multiple robots, assigning separate planning models to each robot, but these approaches often overlook communication costs. In this work, we propose Multimodal Chain-of-Thought Co-Navigation (MCoCoNav), a modular approach that utilizes multimodal Chain-of-Thought to plan collaborative semantic navigation for multiple robots. MCoCoNav combines visual perception with Vision Language Models (VLMs) to evaluate exploration value through probabilistic scoring, thus reducing time costs and achieving stable outputs. Additionally, a global semantic map is used as a communication bridge, minimizing communication overhead while integrating observational results. Guided by scores that reflect exploration trends, robots utilize this map to assess whether to explore new frontier points or revisit history nodes. Experiments on HM3D_v0.2 and MP3D demonstrate the effectiveness of our approach. Our code is available at https://github.com/FrankZxShen/MCoCoNav.git.

* 16 pages, 10 figures, Extended Version of accepted AAAI 2025 Paper

Via

Access Paper or Ask Questions

Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering

Mar 14, 2024

Zhixuan Shen, Haonan Luo, Sijia Li, Tianrui Li

Abstract:Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content. Most existing methods heavily rely on the accuracy of Optical Character Recognition (OCR) systems, and aggressive fine-tuning based on limited spatial location information and erroneous OCR text information often leads to inevitable overfitting. In this paper, we propose a multimodal adversarial training architecture with spatial awareness capabilities. Specifically, we introduce an Adversarial OCR Enhancement (AOE) module, which leverages adversarial training in the embedding space of OCR modality to enhance fault-tolerant representation of OCR texts, thereby reducing noise caused by OCR errors. Simultaneously, We add a Spatial-Aware Self-Attention (SASA) mechanism to help the model better capture the spatial relationships among OCR tokens. Various experiments demonstrate that our method achieves significant performance improvements on both the ST-VQA and TextVQA datasets and provides a novel paradigm for multimodal adversarial training.

* 6 pages, 3 figures, accepted by 2024 IEEE International Conference on Multimedia and Expo

Via

Access Paper or Ask Questions