Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Angus Fung

MLLM-Search: A Zero-Shot Approach to Finding People using Multimodal Large Language Models

Nov 27, 2024

Angus Fung, Aaron Hao Tan, Haitong Wang, Beno Benhabib, Goldie Nejat

Abstract:Robotic search of people in human-centered environments, including healthcare settings, is challenging as autonomous robots need to locate people without complete or any prior knowledge of their schedules, plans or locations. Furthermore, robots need to be able to adapt to real-time events that can influence a person's plan in an environment. In this paper, we present MLLM-Search, a novel zero-shot person search architecture that leverages multimodal large language models (MLLM) to address the mobile robot problem of searching for a person under event-driven scenarios with varying user schedules. Our approach introduces a novel visual prompting method to provide robots with spatial understanding of the environment by generating a spatially grounded waypoint map, representing navigable waypoints by a topological graph and regions by semantic labels. This is incorporated into a MLLM with a region planner that selects the next search region based on the semantic relevance to the search scenario, and a waypoint planner which generates a search path by considering the semantically relevant objects and the local spatial context through our unique spatial chain-of-thought prompting approach. Extensive 3D photorealistic experiments were conducted to validate the performance of MLLM-Search in searching for a person with a changing schedule in different environments. An ablation study was also conducted to validate the main design choices of MLLM-Search. Furthermore, a comparison study with state-of-the art search methods demonstrated that MLLM-Search outperforms existing methods with respect to search efficiency. Real-world experiments with a mobile robot in a multi-room floor of a building showed that MLLM-Search was able to generalize to finding a person in a new unseen environment.

Via

Access Paper or Ask Questions

Find Everything: A General Vision Language Model Approach to Multi-Object Search

Oct 01, 2024

Daniel Choi, Angus Fung, Haitong Wang, Aaron Hao Tan

Figure 1 for Find Everything: A General Vision Language Model Approach to Multi-Object Search

Figure 2 for Find Everything: A General Vision Language Model Approach to Multi-Object Search

Figure 3 for Find Everything: A General Vision Language Model Approach to Multi-Object Search

Figure 4 for Find Everything: A General Vision Language Model Approach to Multi-Object Search

Abstract:The Multi-Object Search (MOS) problem involves navigating to a sequence of locations to maximize the likelihood of finding target objects while minimizing travel costs. In this paper, we introduce a novel approach to the MOS problem, called Finder, which leverages vision language models (VLMs) to locate multiple objects across diverse environments. Specifically, our approach introduces multi-channel score maps to track and reason about multiple objects simultaneously during navigation, along with a score fusion technique that combines scene-level and object-level semantic correlations. Experiments in both simulated and real-world settings showed that Finder outperforms existing methods using deep reinforcement learning and VLMs. Ablation and scalability studies further validated our design choices and robustness with increasing numbers of target objects, respectively. Website: https://find-all-my-things.github.io/

* 12 pages, 6 figures, submitted to ICRA2025

Via

Access Paper or Ask Questions

LDTrack: Dynamic People Tracking by Service Robots using Diffusion Models

Feb 26, 2024

Angus Fung, Beno Benhabib, Goldie Nejat

Abstract:Tracking of dynamic people in cluttered and crowded human-centered environments is a challenging robotics problem due to the presence of intraclass variations including occlusions, pose deformations, and lighting variations. This paper introduces a novel deep learning architecture, using conditional latent diffusion models, the Latent Diffusion Track (LDTrack), for tracking multiple dynamic people under intraclass variations. By uniquely utilizing conditional latent diffusion models to capture temporal person embeddings, our architecture can adapt to appearance changes of people over time. We incorporated a latent feature encoder network which enables the diffusion process to operate within a high-dimensional latent space to allow for the extraction and spatial-temporal refinement of such rich features as person appearance, motion, location, identity, and contextual information. Extensive experiments demonstrate the effectiveness of LDTrack over other state-of-the-art tracking methods in cluttered and crowded human-centered environments under intraclass variations. Namely, the results show our method outperforms existing deep learning robotic people tracking methods in both tracking accuracy and tracking precision with statistical significance.

Via

Access Paper or Ask Questions

Robots Autonomously Detecting People: A Multimodal Deep Contrastive Learning Method Robust to Intraclass Variations

Mar 01, 2022

Angus Fung, Beno Benhabib, Goldie Nejat

Figure 1 for Robots Autonomously Detecting People: A Multimodal Deep Contrastive Learning Method Robust to Intraclass Variations

Figure 2 for Robots Autonomously Detecting People: A Multimodal Deep Contrastive Learning Method Robust to Intraclass Variations

Figure 3 for Robots Autonomously Detecting People: A Multimodal Deep Contrastive Learning Method Robust to Intraclass Variations

Figure 4 for Robots Autonomously Detecting People: A Multimodal Deep Contrastive Learning Method Robust to Intraclass Variations

Abstract:Robotic detection of people in crowded and/or cluttered human-centered environments including hospitals, long-term care, stores and airports is challenging as people can become occluded by other people or objects, and deform due to variations in clothing or pose. There can also be loss of discriminative visual features due to poor lighting. In this paper, we present a novel multimodal person detection architecture to address the mobile robot problem of person detection under intraclass variations. We present a two-stage training approach using 1) a unique pretraining method we define as Temporal Invariant Multimodal Contrastive Learning (TimCLR), and 2) a Multimodal Faster R-CNN (MFRCNN) detector. TimCLR learns person representations that are invariant under intraclass variations through unsupervised learning. Our approach is unique in that it generates image pairs from natural variations within multimodal image sequences, in addition to synthetic data augmentation, and contrasts crossmodal features to transfer invariances between different modalities. These pretrained features are used by the MFRCNN detector for finetuning and person detection from RGB-D images. Extensive experiments validate the performance of our DL architecture in both human-centered crowded and cluttered environments. Results show that our method outperforms existing unimodal and multimodal person detection approaches in terms of detection accuracy in detecting people with body occlusions and pose deformations in different lighting conditions.

Via

Access Paper or Ask Questions