Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kavindie Katuwandeniya

CSIRO Robotics, Clayton, Australia

'What did the Robot do in my Absence?' Video Foundation Models to Enhance Intermittent Supervision

Nov 15, 2024

Kavindie Katuwandeniya, Leimin Tian, Dana Kulić

Figure 1 for 'What did the Robot do in my Absence?' Video Foundation Models to Enhance Intermittent Supervision

Figure 2 for 'What did the Robot do in my Absence?' Video Foundation Models to Enhance Intermittent Supervision

Figure 3 for 'What did the Robot do in my Absence?' Video Foundation Models to Enhance Intermittent Supervision

Figure 4 for 'What did the Robot do in my Absence?' Video Foundation Models to Enhance Intermittent Supervision

Abstract:This paper investigates the application of Video Foundation Models (ViFMs) for generating robot data summaries to enhance intermittent human supervision of robot teams. We propose a novel framework that produces both generic and query-driven summaries of long-duration robot vision data in three modalities: storyboards, short videos, and text. Through a user study involving 30 participants, we evaluate the efficacy of these summary methods in allowing operators to accurately retrieve the observations and actions that occurred while the robot was operating without supervision over an extended duration (40 min). Our findings reveal that query-driven summaries significantly improve retrieval accuracy compared to generic summaries or raw data, albeit with increased task duration. Storyboards are found to be the most effective presentation modality, especially for object-related queries. This work represents, to our knowledge, the first zero-shot application of ViFMs for generating multi-modal robot-to-human communication in intermittent supervision contexts, demonstrating both the promise and limitations of these models in human-robot interaction (HRI) scenarios.

* This work has been submitted to the IEEE RAL for possible publication

Via

Access Paper or Ask Questions

Multi-modal Scene-compliant User Intention Estimation for Navigation

Jun 13, 2021

Kavindie Katuwandeniya, Stefan H. Kiss, Lei Shi, Jaime Valls Miro

Figure 1 for Multi-modal Scene-compliant User Intention Estimation for Navigation

Figure 2 for Multi-modal Scene-compliant User Intention Estimation for Navigation

Figure 3 for Multi-modal Scene-compliant User Intention Estimation for Navigation

Figure 4 for Multi-modal Scene-compliant User Intention Estimation for Navigation

Abstract:A multi-modal framework to generated user intention distributions when operating a mobile vehicle is proposed in this work. The model learns from past observed trajectories and leverages traversability information derived from the visual surroundings to produce a set of future trajectories, suitable to be directly embedded into a perception-action shared control strategy on a mobile agent, or as a safety layer to supervise the prudent operation of the vehicle. We base our solution on a conditional Generative Adversarial Network with Long-Short Term Memory cells to capture trajectory distributions conditioned on past trajectories, further fused with traversability probabilities derived from visual segmentation with a Convolutional Neural Network. The proposed data-driven framework results in a significant reduction in error of the predicted trajectories (versus the ground truth) from comparable strategies in the literature (e.g. Social-GAN) that fail to account for information other than the agent's past history. Experiments were conducted on a dataset collected with a custom wheelchair model built onto the open-source urban driving simulator CARLA, proving also that the proposed framework can be used with a small, un-annotated dataset.

* 6 pages, 6 figures

Via

Access Paper or Ask Questions