Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rajasimman Madhivanan

HomeEmergency -- Using Audio to Find and Respond to Emergencies in the Home

Apr 01, 2025

James F. Mullen Jr, Dhruva Kumar, Xuewei Qi, Rajasimman Madhivanan, Arnie Sen, Dinesh Manocha, Richard Kim

Abstract:In the United States alone accidental home deaths exceed 128,000 per year. Our work aims to enable home robots who respond to emergency scenarios in the home, preventing injuries and deaths. We introduce a new dataset of household emergencies based in the ThreeDWorld simulator. Each scenario in our dataset begins with an instantaneous or periodic sound which may or may not be an emergency. The agent must navigate the multi-room home scene using prior observations, alongside audio signals and images from the simulator, to determine if there is an emergency or not. In addition to our new dataset, we present a modular approach for localizing and identifying potential home emergencies. Underpinning our approach is a novel probabilistic dynamic scene graph (P-DSG), where our key insight is that graph nodes corresponding to agents can be represented with a probabilistic edge. This edge, when refined using Bayesian inference, enables efficient and effective localization of agents in the scene. We also utilize multi-modal vision-language models (VLMs) as a component in our approach, determining object traits (e.g. flammability) and identifying emergencies. We present a demonstration of our method completing a real-world version of our task on a consumer robot, showing the transferability of both our task and our method. Our dataset will be released to the public upon this papers publication.

Via

Access Paper or Ask Questions

ET-Former: Efficient Triplane Deformable Attention for 3D Semantic Scene Completion From Monocular Camera

Oct 14, 2024

Jing Liang, He Yin, Xuewei Qi, Jong Jin Park, Min Sun, Rajasimman Madhivanan, Dinesh Manocha

Figure 1 for ET-Former: Efficient Triplane Deformable Attention for 3D Semantic Scene Completion From Monocular Camera

Figure 2 for ET-Former: Efficient Triplane Deformable Attention for 3D Semantic Scene Completion From Monocular Camera

Figure 3 for ET-Former: Efficient Triplane Deformable Attention for 3D Semantic Scene Completion From Monocular Camera

Figure 4 for ET-Former: Efficient Triplane Deformable Attention for 3D Semantic Scene Completion From Monocular Camera

Abstract:We introduce ET-Former, a novel end-to-end algorithm for semantic scene completion using a single monocular camera. Our approach generates a semantic occupancy map from single RGB observation while simultaneously providing uncertainty estimates for semantic predictions. By designing a triplane-based deformable attention mechanism, our approach improves geometric understanding of the scene than other SOTA approaches and reduces noise in semantic predictions. Additionally, through the use of a Conditional Variational AutoEncoder (CVAE), we estimate the uncertainties of these predictions. The generated semantic and uncertainty maps will aid in the formulation of navigation strategies that facilitate safe and permissible decision-making in the future. Evaluated on the Semantic-KITTI dataset, ET-Former achieves the highest IoU and mIoU, surpassing other methods by 15.16% in IoU and 24.24% in mIoU, while reducing GPU memory usage of existing methods by 25%-50.5%.

Via

Access Paper or Ask Questions

VLPG-Nav: Object Navigation Using Visual Language Pose Graph and Object Localization Probability Maps

Aug 15, 2024

Senthil Hariharan Arul, Dhruva Kumar, Vivek Sugirtharaj, Richard Kim, Xuewei, Qi, Rajasimman Madhivanan, Arnie Sen, Dinesh Manocha

Figure 1 for VLPG-Nav: Object Navigation Using Visual Language Pose Graph and Object Localization Probability Maps

Figure 2 for VLPG-Nav: Object Navigation Using Visual Language Pose Graph and Object Localization Probability Maps

Figure 3 for VLPG-Nav: Object Navigation Using Visual Language Pose Graph and Object Localization Probability Maps

Figure 4 for VLPG-Nav: Object Navigation Using Visual Language Pose Graph and Object Localization Probability Maps

Abstract:We present VLPG-Nav, a visual language navigation method for guiding robots to specified objects within household scenes. Unlike existing methods primarily focused on navigating the robot toward objects, our approach considers the additional challenge of centering the object within the robot's camera view. Our method builds a visual language pose graph (VLPG) that functions as a spatial map of VL embeddings. Given an open vocabulary object query, we plan a viewpoint for object navigation using the VLPG. Despite navigating to the viewpoint, real-world challenges like object occlusion, displacement, and the robot's localization error can prevent visibility. We build an object localization probability map that leverages the robot's current observations and prior VLPG. When the object isn't visible, the probability map is updated and an alternate viewpoint is computed. In addition, we propose an object-centering formulation that locally adjusts the robot's pose to center the object in the camera view. We evaluate the effectiveness of our approach through simulations and real-world experiments, evaluating its ability to successfully view and center the object within the camera field of view. VLPG-Nav demonstrates improved performance in locating the object, navigating around occlusions, and centering the object within the robot's camera view, outperforming the selected baselines in the evaluation metrics.

Via

Access Paper or Ask Questions

LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation

May 08, 2024

Tianrui Guan, Yurou Yang, Harry Cheng, Muyuan Lin, Richard Kim, Rajasimman Madhivanan, Arnie Sen, Dinesh Manocha

Figure 1 for LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation

Figure 2 for LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation

Figure 3 for LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation

Figure 4 for LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation

Abstract:In this paper, we present LOC-ZSON, a novel Language-driven Object-Centric image representation for object navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1.38 - 13.38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16.67% improvement in terms of navigation success rate, respectively.

* Accepted to ICRA 2024

Via

Access Paper or Ask Questions

Learning to View: Decision Transformers for Active Object Detection

Jan 23, 2023

Wenhao Ding, Nathalie Majcherczyk, Mohit Deshpande, Xuewei Qi, Ding Zhao, Rajasimman Madhivanan, Arnie Sen

Figure 1 for Learning to View: Decision Transformers for Active Object Detection

Figure 2 for Learning to View: Decision Transformers for Active Object Detection

Figure 3 for Learning to View: Decision Transformers for Active Object Detection

Figure 4 for Learning to View: Decision Transformers for Active Object Detection

Abstract:Active perception describes a broad class of techniques that couple planning and perception systems to move the robot in a way to give the robot more information about the environment. In most robotic systems, perception is typically independent of motion planning. For example, traditional object detection is passive: it operates only on the images it receives. However, we have a chance to improve the results if we allow planning to consume detection signals and move the robot to collect views that maximize the quality of the results. In this paper, we use reinforcement learning (RL) methods to control the robot in order to obtain images that maximize the detection quality. Specifically, we propose using a Decision Transformer with online fine-tuning, which first optimizes the policy with a pre-collected expert dataset and then improves the learned policy by exploring better solutions in the environment. We evaluate the performance of proposed method on an interactive dataset collected from an indoor scenario simulator. Experimental results demonstrate that our method outperforms all baselines, including expert policy and pure offline RL methods. We also provide exhaustive analyses of the reward distribution and observation space.

* Accepted to ICRA 2023

Via

Access Paper or Ask Questions