Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Motonari Kambara

Mobile Manipulation Instruction Generation from Multiple Images with Automatic Metric Enhancement

Jan 28, 2025

Kei Katsumata, Motonari Kambara, Daichi Yashima, Ryosuke Korekata, Komei Sugiura

Abstract:We consider the problem of generating free-form mobile manipulation instructions based on a target object image and receptacle image. Conventional image captioning models are not able to generate appropriate instructions because their architectures are typically optimized for single-image. In this study, we propose a model that handles both the target object and receptacle to generate free-form instruction sentences for mobile manipulation tasks. Moreover, we introduce a novel training method that effectively incorporates the scores from both learning-based and n-gram based automatic evaluation metrics as rewards. This method enables the model to learn the co-occurrence relationships between words and appropriate paraphrases. Results demonstrate that our proposed method outperforms baseline methods including representative multimodal large language models on standard automatic evaluation metrics. Moreover, physical experiments reveal that using our method to augment data on language instructions improves the performance of an existing multimodal language understanding model for mobile manipulation.

* Accepted for IEEE RA-L 2025

Via

Access Paper or Ask Questions

Future Success Prediction in Open-Vocabulary Object Manipulation Tasks Based on End-Effector Trajectories

Jan 08, 2025

Motonari Kambara, Komei Sugiura

Figure 1 for Future Success Prediction in Open-Vocabulary Object Manipulation Tasks Based on End-Effector Trajectories

Figure 2 for Future Success Prediction in Open-Vocabulary Object Manipulation Tasks Based on End-Effector Trajectories

Figure 3 for Future Success Prediction in Open-Vocabulary Object Manipulation Tasks Based on End-Effector Trajectories

Figure 4 for Future Success Prediction in Open-Vocabulary Object Manipulation Tasks Based on End-Effector Trajectories

Abstract:This study addresses a task designed to predict the future success or failure of open-vocabulary object manipulation. In this task, the model is required to make predictions based on natural language instructions, egocentric view images before manipulation, and the given end-effector trajectories. Conventional methods typically perform success prediction only after the manipulation is executed, limiting their efficiency in executing the entire task sequence. We propose a novel approach that enables the prediction of success or failure by aligning the given trajectories and images with natural language instructions. We introduce Trajectory Encoder to apply learnable weighting to the input trajectories, allowing the model to consider temporal dynamics and interactions between objects and the end effector, improving the model's ability to predict manipulation outcomes accurately. We constructed a dataset based on the RT-1 dataset, a large-scale benchmark for open-vocabulary object manipulation tasks, to evaluate our method. The experimental results show that our method achieved a higher prediction accuracy than baseline approaches.

* Accepted for presentation at LangRob @ CoRL 2024

Via

Access Paper or Ask Questions

Task Success Prediction and Open-Vocabulary Object Manipulation

Dec 26, 2024

Motonari Kambara, Komei Sugiura

Figure 1 for Task Success Prediction and Open-Vocabulary Object Manipulation

Figure 2 for Task Success Prediction and Open-Vocabulary Object Manipulation

Figure 3 for Task Success Prediction and Open-Vocabulary Object Manipulation

Figure 4 for Task Success Prediction and Open-Vocabulary Object Manipulation

* Accepted for presentation at LangRob @ CoRL 2024

Via

Access Paper or Ask Questions

Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations

Oct 01, 2024

Miyu Goko, Motonari Kambara, Daichi Saito, Seitaro Otsuki, Komei Sugiura

Abstract:In this study, we consider the problem of predicting task success for open-vocabulary manipulation by a manipulator, based on instruction sentences and egocentric images before and after manipulation. Conventional approaches, including multimodal large language models (MLLMs), often fail to appropriately understand detailed characteristics of objects and/or subtle changes in the position of objects. We propose Contrastive $\lambda$-Repformer, which predicts task success for table-top manipulation tasks by aligning images with instruction sentences. Our method integrates the following three key types of features into a multi-level aligned representation: features that preserve local image information; features aligned with natural language; and features structured through natural language. This allows the model to focus on important changes by looking at the differences in the representation between two images. We evaluate Contrastive $\lambda$-Repformer on a dataset based on a large-scale standard dataset, the RT-1 dataset, and on a physical robot platform. The results show that our approach outperformed existing approaches including MLLMs. Our best model achieved an improvement of 8.66 points in accuracy compared to the representative MLLM-based model.

* Accepted for presentation at CoRL2024

Via

Access Paper or Ask Questions

Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks

Jul 18, 2024

Takumi Komatsu, Motonari Kambara, Shumpei Hatanaka, Haruka Matsuo, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi, Komei Sugiura

Figure 1 for Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks

Figure 2 for Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks

Figure 3 for Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks

Figure 4 for Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks

Abstract:Domestic service robots (DSRs) that support people in everyday environments have been widely investigated. However, their ability to predict and describe future risks resulting from their own actions remains insufficient. In this study, we focus on the linguistic explainability of DSRs. Most existing methods do not explicitly model the region of possible collisions; thus, they do not properly generate descriptions of these regions. In this paper, we propose the Nearest Neighbor Future Captioning Model that introduces the Nearest Neighbor Language Model for future captioning of possible collisions, which enhances the model output with a nearest neighbors retrieval mechanism. Furthermore, we introduce the Collision Attention Module that attends regions of possible collisions, which enables our model to generate descriptions that adequately reflect the objects associated with possible collisions. To validate our method, we constructed a new dataset containing samples of collisions that can occur when a DSR places an object in a simulation environment. The experimental results demonstrated that our method outperformed baseline methods, based on the standard metrics. In particular, on CIDEr-D, the baseline method obtained 25.09 points, whereas our method obtained 33.08 points.

* Accepted for presentation at Advanced Robotics 24

Via

Access Paper or Ask Questions

Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models

Jul 01, 2024

Takayuki Nishimura, Katsuyuki Kuyo, Motonari Kambara, Komei Sugiura

Figure 1 for Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models

Figure 2 for Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models

Figure 3 for Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models

Figure 4 for Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models

Abstract:We consider the task of generating segmentation masks for the target object from an object manipulation instruction, which allows users to give open vocabulary instructions to domestic service robots. Conventional segmentation generation approaches often fail to account for objects outside the camera's field of view and cases in which the order of vertices differs but still represents the same polygon, which leads to erroneous mask generation. In this study, we propose a novel method that generates segmentation masks from open vocabulary instructions. We implement a novel loss function using optimal transport to prevent significant loss where the order of vertices differs but still represents the same polygon. To evaluate our approach, we constructed a new dataset based on the REVERIE dataset and Matterport3D dataset. The results demonstrated the effectiveness of the proposed method compared with existing mask generation methods. Remarkably, our best model achieved a +16.32% improvement on the dataset compared with a representative polygon-based method.

* Accepted for presentation at IROS2024

Via

Access Paper or Ask Questions

Learning-To-Rank Approach for Identifying Everyday Objects Using a Physical-World Search Engine

Dec 26, 2023

Kanta Kaneda, Shunya Nagashima, Ryosuke Korekata, Motonari Kambara, Komei Sugiura

Abstract:Domestic service robots offer a solution to the increasing demand for daily care and support. A human-in-the-loop approach that combines automation and operator intervention is considered to be a realistic approach to their use in society. Therefore, we focus on the task of retrieving target objects from open-vocabulary user instructions in a human-in-the-loop setting, which we define as the learning-to-rank physical objects (LTRPO) task. For example, given the instruction "Please go to the dining room which has a round table. Pick up the bottle on it," the model is required to output a ranked list of target objects that the operator/user can select. In this paper, we propose MultiRankIt, which is a novel approach for the LTRPO task. MultiRankIt introduces the Crossmodal Noun Phrase Encoder to model the relationship between phrases that contain referring expressions and the target bounding box, and the Crossmodal Region Feature Encoder to model the relationship between the target object and multiple images of its surrounding contextual environment. Additionally, we built a new dataset for the LTRPO task that consists of instructions with complex referring expressions accompanied by real indoor environmental images that feature various target objects. We validated our model on the dataset and it outperformed the baseline method in terms of the mean reciprocal rank and recall@k. Furthermore, we conducted physical experiments in a setting where a domestic service robot retrieved everyday objects in a standardized domestic environment, based on users' instruction in a human--in--the--loop setting. The experimental results demonstrate that the success rate for object retrieval achieved 80%. Our code is available at https://github.com/keio-smilab23/MultiRankIt.

* Accepted for RAL 2023

Via

Access Paper or Ask Questions

DialMAT: Dialogue-Enabled Transformer with Moment-Based Adversarial Training

Nov 12, 2023

Kanta Kaneda, Ryosuke Korekata, Yuiga Wada, Shunya Nagashima, Motonari Kambara, Yui Iioka, Haruka Matsuo, Yuto Imai, Takayuki Nishimura, Komei Sugiura

Abstract:This paper focuses on the DialFRED task, which is the task of embodied instruction following in a setting where an agent can actively ask questions about the task. To address this task, we propose DialMAT. DialMAT introduces Moment-based Adversarial Training, which incorporates adversarial perturbations into the latent space of language, image, and action. Additionally, it introduces a crossmodal parallel feature extraction mechanism that applies foundation models to both language and image. We evaluated our model using a dataset constructed from the DialFRED dataset and demonstrated superior performance compared to the baseline method in terms of success rate and path weighted success rate. The model secured the top position in the DialFRED Challenge, which took place at the CVPR 2023 Embodied AI workshop.

* Accepted for presentation at Fourth Annual Embodied AI Workshop at CVPR

Via

Access Paper or Ask Questions

Fully Automated Task Management for Generation, Execution, and Evaluation: A Framework for Fetch-and-Carry Tasks with Natural Language Instructions in Continuous Space

Nov 07, 2023

Motonari Kambara, Komei Sugiura

Abstract:This paper aims to develop a framework that enables a robot to execute tasks based on visual information, in response to natural language instructions for Fetch-and-Carry with Object Grounding (FCOG) tasks. Although there have been many frameworks, they usually rely on manually given instruction sentences. Therefore, evaluations have only been conducted with fixed tasks. Furthermore, many multimodal language understanding models for the benchmarks only consider discrete actions. To address the limitations, we propose a framework for the full automation of the generation, execution, and evaluation of FCOG tasks. In addition, we introduce an approach to solving the FCOG tasks by dividing them into four distinct subtasks.

* Accepted at presentation for CVPR 2023 Embodied AI Workshop

Via

Access Paper or Ask Questions

Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks

Jul 14, 2023

Ryosuke Korekata, Motonari Kambara, Yu Yoshida, Shintaro Ishikawa, Yosuke Kawasaki, Masaki Takahashi, Komei Sugiura

Figure 1 for Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks

Figure 2 for Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks

Figure 3 for Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks

Figure 4 for Switching Head-Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks

Abstract:This paper describes a domestic service robot (DSR) that fetches everyday objects and carries them to specified destinations according to free-form natural language instructions. Given an instruction such as "Move the bottle on the left side of the plate to the empty chair," the DSR is expected to identify the bottle and the chair from multiple candidates in the environment and carry the target object to the destination. Most of the existing multimodal language understanding methods are impractical in terms of computational complexity because they require inferences for all combinations of target object candidates and destination candidates. We propose Switching Head-Tail Funnel UNITER, which solves the task by predicting the target object and the destination individually using a single model. Our method is validated on a newly-built dataset consisting of object manipulation instructions and semi photo-realistic images captured in a standard Embodied AI simulator. The results show that our method outperforms the baseline method in terms of language comprehension accuracy. Furthermore, we conduct physical experiments in which a DSR delivers standardized everyday objects in a standardized domestic environment as requested by instructions with referring expressions. The experimental results show that the object grasping and placing actions are achieved with success rates of more than 90%.

* Accepted for presentation at IROS2023

Via

Access Paper or Ask Questions