Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuchen Mo

DoorBot: Closed-Loop Task Planning and Manipulation for Door Opening in the Wild with Haptic Feedback

Apr 12, 2025

Zhi Wang, Yuchen Mo, Shengmiao Jin, Wenzhen Yuan

Abstract:Robots operating in unstructured environments face significant challenges when interacting with everyday objects like doors. They particularly struggle to generalize across diverse door types and conditions. Existing vision-based and open-loop planning methods often lack the robustness to handle varying door designs, mechanisms, and push/pull configurations. In this work, we propose a haptic-aware closed-loop hierarchical control framework that enables robots to explore and open different unseen doors in the wild. Our approach leverages real-time haptic feedback, allowing the robot to adjust its strategy dynamically based on force feedback during manipulation. We test our system on 20 unseen doors across different buildings, featuring diverse appearances and mechanical types. Our framework achieves a 90% success rate, demonstrating its ability to generalize and robustly handle varied door-opening tasks. This scalable solution offers potential applications in broader open-world articulated object manipulation tasks.

* In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2025)

Via

Access Paper or Ask Questions

Sensor-Invariant Tactile Representation

Feb 27, 2025

Harsh Gupta, Yuchen Mo, Shengmiao Jin, Wenzhen Yuan

Abstract:High-resolution tactile sensors have become critical for embodied perception and robotic manipulation. However, a key challenge in the field is the lack of transferability between sensors due to design and manufacturing variations, which result in significant differences in tactile signals. This limitation hinders the ability to transfer models or knowledge learned from one sensor to another. To address this, we introduce a novel method for extracting Sensor-Invariant Tactile Representations (SITR), enabling zero-shot transfer across optical tactile sensors. Our approach utilizes a transformer-based architecture trained on a diverse dataset of simulated sensor designs, allowing it to generalize to new sensors in the real world with minimal calibration. Experimental results demonstrate the method's effectiveness across various tactile sensing applications, facilitating data and model transferability for future advancements in the field.

* Accepted to ICLR'25

Via

Access Paper or Ask Questions

Learning to Double Guess: An Active Perception Approach for Estimating the Center of Mass of Arbitrary Objects

Feb 04, 2025

Shengmiao Jin, Yuchen Mo, Wenzhen Yuan

Abstract:Manipulating arbitrary objects in unstructured environments is a significant challenge in robotics, primarily due to difficulties in determining an object's center of mass. This paper introduces U-GRAPH: Uncertainty-Guided Rotational Active Perception with Haptics, a novel framework to enhance the center of mass estimation using active perception. Traditional methods often rely on single interaction and are limited by the inherent inaccuracies of Force-Torque (F/T) sensors. Our approach circumvents these limitations by integrating a Bayesian Neural Network (BNN) to quantify uncertainty and guide the robotic system through multiple, information-rich interactions via grid search and a neural network that scores each action. We demonstrate the remarkable generalizability and transferability of our method with training on a small dataset with limited variation yet still perform well on unseen complex real-world objects.

* Accepted to ICRA 25; 7 pages, 5 figures

Via

Access Paper or Ask Questions

InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions

Oct 18, 2023

Hanbo Zhang, Jie Xu, Yuchen Mo, Tao Kong

Figure 1 for InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions

Figure 2 for InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions

Figure 3 for InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions

Figure 4 for InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions

Abstract:Ambiguity is ubiquitous in human communication. Previous approaches in Human-Robot Interaction (HRI) have often relied on predefined interaction templates, leading to reduced performance in realistic and open-ended scenarios. To address these issues, we present a large-scale dataset, \invig, for interactive visual grounding under language ambiguity. Our dataset comprises over 520K images accompanied by open-ended goal-oriented disambiguation dialogues, encompassing millions of object instances and corresponding question-answer pairs. Leveraging the \invig dataset, we conduct extensive studies and propose a set of baseline solutions for end-to-end interactive visual disambiguation and grounding, achieving a 45.6\% success rate during validation. To the best of our knowledge, the \invig dataset is the first large-scale dataset for resolving open-ended interactive visual grounding, presenting a practical yet highly challenging benchmark for ambiguity-aware HRI. Codes and datasets are available at: \href{https://openivg.github.io}{https://openivg.github.io}.

* 8 pages, 9 figures, 3 tables, under review

Via

Access Paper or Ask Questions

Combo of Thinking and Observing for Outside-Knowledge VQA

May 10, 2023

Qingyi Si, Yuchen Mo, Zheng Lin, Huishan Ji, Weiping Wang

Figure 1 for Combo of Thinking and Observing for Outside-Knowledge VQA

Figure 2 for Combo of Thinking and Observing for Outside-Knowledge VQA

Figure 3 for Combo of Thinking and Observing for Outside-Knowledge VQA

Figure 4 for Combo of Thinking and Observing for Outside-Knowledge VQA

Abstract:Outside-knowledge visual question answering is a challenging task that requires both the acquisition and the use of open-ended real-world knowledge. Some existing solutions draw external knowledge into the cross-modality space which overlooks the much vaster textual knowledge in natural-language space, while others transform the image into a text that further fuses with the textual knowledge into the natural-language space and completely abandons the use of visual features. In this paper, we are inspired to constrain the cross-modality space into the same space of natural-language space which makes the visual features preserved directly, and the model still benefits from the vast knowledge in natural-language space. To this end, we propose a novel framework consisting of a multimodal encoder, a textual encoder and an answer decoder. Such structure allows us to introduce more types of knowledge including explicit and implicit multimodal and textual knowledge. Extensive experiments validate the superiority of the proposed method which outperforms the state-of-the-art by 6.17% accuracy. We also conduct comprehensive ablations of each component, and systematically study the roles of varying types of knowledge. Codes and knowledge data can be found at https://github.com/PhoebusSi/Thinking-while-Observing.

* ACL-23, Main Conference

Via

Access Paper or Ask Questions

Rethinking Generative Coverage: A Pointwise Guaranteed Approach

Feb 20, 2019

Peilin Zhong, Yuchen Mo, Chang Xiao, Pengyu Chen, Changxi Zheng

Figure 1 for Rethinking Generative Coverage: A Pointwise Guaranteed Approach

Figure 2 for Rethinking Generative Coverage: A Pointwise Guaranteed Approach

Figure 3 for Rethinking Generative Coverage: A Pointwise Guaranteed Approach

Figure 4 for Rethinking Generative Coverage: A Pointwise Guaranteed Approach

Abstract:All generative models have to combat missing modes. The conventional wisdom is by reducing a statistical distance (such as f-divergence) between the generated distribution and the provided data distribution through training. We defy this wisdom. We show that even a small statistical distance does not imply a plausible mode coverage, because this distance measures a global similarity between two distributions, but not their similarity in local regions--which is needed to ensure a complete mode coverage. From a starkly different perspective, we view the battle against missing modes as a two-player game, between a player choosing a data point and an adversary choosing a generator aiming to cover that data point. Enlightened by von Neumann's minimax theorem, we see that if a generative model can approximate a data distribution moderately well under a global statistical distance measure, then we should be able to find a mixture of generators which collectively covers every data point and thus every mode with a lower-bounded probability density. A constructive realization of this minimax duality--that is, our proposed algorithm of finding the mixture of generators--is connected to a multiplicative weights update rule. We prove the pointwise coverage guarantee of our algorithm, and our experiments on real and synthetic data confirm better mode coverage over recent approaches that also use a mixture of generators but focus on global statistical distances.

Via

Access Paper or Ask Questions

Active Clothing Material Perception using Tactile Sensing and Deep Learning

Feb 25, 2018

Wenzhen Yuan, Yuchen Mo, Shaoxiong Wang, Edward Adelson

Figure 1 for Active Clothing Material Perception using Tactile Sensing and Deep Learning

Figure 2 for Active Clothing Material Perception using Tactile Sensing and Deep Learning

Figure 3 for Active Clothing Material Perception using Tactile Sensing and Deep Learning

Figure 4 for Active Clothing Material Perception using Tactile Sensing and Deep Learning

Abstract:Humans represent and discriminate the objects in the same category using their properties, and an intelligent robot should be able to do the same. In this paper, we build a robot system that can autonomously perceive the object properties through touch. We work on the common object category of clothing. The robot moves under the guidance of an external Kinect sensor, and squeezes the clothes with a GelSight tactile sensor, then it recognizes the 11 properties of the clothing according to the tactile data. Those properties include the physical properties, like thickness, fuzziness, softness and durability, and semantic properties, like wearing season and preferred washing methods. We collect a dataset of 153 varied pieces of clothes, and conduct 6616 robot exploring iterations on them. To extract the useful information from the high-dimensional sensory output, we applied Convolutional Neural Networks (CNN) on the tactile data for recognizing the clothing properties, and on the Kinect depth images for selecting exploration locations. Experiments show that using the trained neural networks, the robot can autonomously explore the unknown clothes and learn their properties. This work proposes a new framework for active tactile perception system with vision-touch system, and has potential to enable robots to help humans with varied clothing related housework.

* ICRA 2018 accepted

Via

Access Paper or Ask Questions