Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hongyu Xie

SEM: Enhancing Spatial Understanding for Robust Robot Manipulation

May 22, 2025

Xuewu Lin, Tianwei Lin, Lichao Huang, Hongyu Xie, Yiwei Jin, Keyu Li, Zhizhong Su

Abstract:A key challenge in robot manipulation lies in developing policy models with strong spatial understanding, the ability to reason about 3D geometry, object relations, and robot embodiment. Existing methods often fall short: 3D point cloud models lack semantic abstraction, while 2D image encoders struggle with spatial reasoning. To address this, we propose SEM (Spatial Enhanced Manipulation model), a novel diffusion-based policy framework that explicitly enhances spatial understanding from two complementary perspectives. A spatial enhancer augments visual representations with 3D geometric context, while a robot state encoder captures embodiment-aware structure through graphbased modeling of joint dependencies. By integrating these modules, SEM significantly improves spatial understanding, leading to robust and generalizable manipulation across diverse tasks that outperform existing baselines.

Via

Access Paper or Ask Questions

BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence

Nov 22, 2024

Xuewu Lin, Tianwei Lin, Lichao Huang, Hongyu Xie, Zhizhong Su

Figure 1 for BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence

Figure 2 for BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence

Figure 3 for BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence

Figure 4 for BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence

Abstract:In embodied intelligence systems, a key component is 3D perception algorithm, which enables agents to understand their surrounding environments. Previous algorithms primarily rely on point cloud, which, despite offering precise geometric information, still constrain perception performance due to inherent sparsity, noise, and data scarcity. In this work, we introduce a novel image-centric 3D perception model, BIP3D, which leverages expressive image features with explicit 3D position encoding to overcome the limitations of point-centric methods. Specifically, we leverage pre-trained 2D vision foundation models to enhance semantic understanding, and introduce a spatial enhancer module to improve spatial understanding. Together, these modules enable BIP3D to achieve multi-view, multi-modal feature fusion and end-to-end 3D perception. In our experiments, BIP3D outperforms current state-of-the-art results on the EmbodiedScan benchmark, achieving improvements of 5.69% in the 3D detection task and 15.25% in the 3D visual grounding task.

Via

Access Paper or Ask Questions