Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiaozhu Ju

ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning

Jun 06, 2025

Zhao Jin, Zhengping Che, Zhen Zhao, Kun Wu, Yuheng Zhang, Yinuo Zhao, Zehui Liu, Qiang Zhang, Xiaozhu Ju, Jing Tian(+2 more)

Abstract:Robot learning increasingly relies on simulation to advance complex ability such as dexterous manipulations and precise interactions, necessitating high-quality digital assets to bridge the sim-to-real gap. However, existing open-source articulated-object datasets for simulation are limited by insufficient visual realism and low physical fidelity, which hinder their utility for training models mastering robotic tasks in real world. To address these challenges, we introduce ArtVIP, a comprehensive open-source dataset comprising high-quality digital-twin articulated objects, accompanied by indoor-scene assets. Crafted by professional 3D modelers adhering to unified standards, ArtVIP ensures visual realism through precise geometric meshes and high-resolution textures, while physical fidelity is achieved via fine-tuned dynamic parameters. Meanwhile, the dataset pioneers embedded modular interaction behaviors within assets and pixel-level affordance annotations. Feature-map visualization and optical motion capture are employed to quantitatively demonstrate ArtVIP's visual and physical fidelity, with its applicability validated across imitation learning and reinforcement learning experiments. Provided in USD format with detailed production guidelines, ArtVIP is fully open-source, benefiting the research community and advancing robot learning research. Our project is at https://x-humanoid-artvip.github.io/ .

Via

Access Paper or Ask Questions

Occupancy World Model for Robots

May 07, 2025

Zhang Zhang, Qiang Zhang, Wei Cui, Shuai Shi, Yijie Guo, Gang Han, Wen Zhao, Jingkai Sun, Jiahang Cao, Jiaxu Wang(+5 more)

Abstract:Understanding and forecasting the scene evolutions deeply affect the exploration and decision of embodied agents. While traditional methods simulate scene evolutions through trajectory prediction of potential instances, current works use the occupancy world model as a generative framework for describing fine-grained overall scene dynamics. However, existing methods cluster on the outdoor structured road scenes, while ignoring the exploration of forecasting 3D occupancy scene evolutions for robots in indoor scenes. In this work, we explore a new framework for learning the scene evolutions of observed fine-grained occupancy and propose an occupancy world model based on the combined spatio-temporal receptive field and guided autoregressive transformer to forecast the scene evolutions, called RoboOccWorld. We propose the Conditional Causal State Attention (CCSA), which utilizes camera poses of next state as conditions to guide the autoregressive transformer to adapt and understand the indoor robotics scenarios. In order to effectively exploit the spatio-temporal cues from historical observations, Hybrid Spatio-Temporal Aggregation (HSTA) is proposed to obtain the combined spatio-temporal receptive field based on multi-scale spatio-temporal windows. In addition, we restructure the OccWorld-ScanNet benchmark based on local annotations to facilitate the evaluation of the indoor 3D occupancy scene evolution prediction task. Experimental results demonstrate that our RoboOccWorld outperforms state-of-the-art methods in indoor 3D occupancy scene evolution prediction task. The code will be released soon.

Via

Access Paper or Ask Questions

EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks

Mar 14, 2025

Yi Zhang, Qiang Zhang, Xiaozhu Ju, Zhaoyang Liu, Jilei Mao, Jingkai Sun, Jintao Wu, Shixiong Gao, Shihan Cai, Zhiyuan Qin(+6 more)

Abstract:While multimodal large language models (MLLMs) have made groundbreaking progress in embodied intelligence, they still face significant challenges in spatial reasoning for complex long-horizon tasks. To address this gap, we propose EmbodiedVSR (Embodied Visual Spatial Reasoning), a novel framework that integrates dynamic scene graph-guided Chain-of-Thought (CoT) reasoning to enhance spatial understanding for embodied agents. By explicitly constructing structured knowledge representations through dynamic scene graphs, our method enables zero-shot spatial reasoning without task-specific fine-tuning. This approach not only disentangles intricate spatial relationships but also aligns reasoning steps with actionable environmental dynamics. To rigorously evaluate performance, we introduce the eSpatial-Benchmark, a comprehensive dataset including real-world embodied scenarios with fine-grained spatial annotations and adaptive task difficulty levels. Experiments demonstrate that our framework significantly outperforms existing MLLM-based methods in accuracy and reasoning coherence, particularly in long-horizon tasks requiring iterative environment interaction. The results reveal the untapped potential of MLLMs for embodied intelligence when equipped with structured, explainable reasoning mechanisms, paving the way for more reliable deployment in real-world spatial applications. The codes and datasets will be released soon.

* technical report

Via

Access Paper or Ask Questions

RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation

Dec 18, 2024

Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang(+26 more)

Abstract:Developing robust and general-purpose robotic manipulation policies is a key goal in the field of robotics. To achieve effective generalization, it is essential to construct comprehensive datasets that encompass a large number of demonstration trajectories and diverse tasks. Unlike vision or language data that can be collected from the Internet, robotic datasets require detailed observations and manipulation actions, necessitating significant investment in hardware-software infrastructure and human labor. While existing works have focused on assembling various individual robot datasets, there remains a lack of a unified data collection standard and insufficient diversity in tasks, scenarios, and robot types. In this paper, we introduce RoboMIND (Multi-embodiment Intelligence Normative Data for Robot manipulation), featuring 55k real-world demonstration trajectories across 279 diverse tasks involving 61 different object classes. RoboMIND is collected through human teleoperation and encompasses comprehensive robotic-related information, including multi-view RGB-D images, proprioceptive robot state information, end effector details, and linguistic task descriptions. To ensure dataset consistency and reliability during policy learning, RoboMIND is built on a unified data collection platform and standardized protocol, covering four distinct robotic embodiments. We provide a thorough quantitative and qualitative analysis of RoboMIND across multiple dimensions, offering detailed insights into the diversity of our datasets. In our experiments, we conduct extensive real-world testing with four state-of-the-art imitation learning methods, demonstrating that training with RoboMIND data results in a high manipulation success rate and strong generalization. Our project is at https://x-humanoid-robomind.github.io/.

Via

Access Paper or Ask Questions

An Efficient Generalizable Framework for Visuomotor Policies via Control-aware Augmentation and Privilege-guided Distillation

Jan 17, 2024

Yinuo Zhao, Kun Wu, Tianjiao Yi, Zhiyuan Xu, Xiaozhu Ju, Zhengping Che, Qinru Qiu, Chi Harold Liu, Jian Tang

Abstract:Visuomotor policies, which learn control mechanisms directly from high-dimensional visual observations, confront challenges in adapting to new environments with intricate visual variations. Data augmentation emerges as a promising method for bridging these generalization gaps by enriching data variety. However, straightforwardly augmenting the entire observation shall impose excessive burdens on policy learning and may even result in performance degradation. In this paper, we propose to improve the generalization ability of visuomotor policies as well as preserve training stability from two aspects: 1) We learn a control-aware mask through a self-supervised reconstruction task with three auxiliary losses and then apply strong augmentation only to those control-irrelevant regions based on the mask to reduce the generalization gaps. 2) To address training instability issues prevalent in visual reinforcement learning (RL), we distill the knowledge from a pretrained RL expert processing low-level environment states, to the student visuomotor policy. The policy is subsequently deployed to unseen environments without any further finetuning. We conducted comparison and ablation studies across various benchmarks: the DMControl Generalization Benchmark (DMC-GB), the enhanced Robot Manipulation Distraction Benchmark (RMDB), and a specialized long-horizontal drawer-opening robotic task. The extensive experimental results well demonstrate the effectiveness of our method, e.g., showing a 17\% improvement over previous methods in the video-hard setting of DMC-GB.

Via

Access Paper or Ask Questions

Recursive Hierarchical Projection for Whole-Body Control with Task Priority Transition

Sep 22, 2021

Gang Han, Jiajun Wang, Xiaozhu Ju, Mingguo Zhao

Figure 1 for Recursive Hierarchical Projection for Whole-Body Control with Task Priority Transition

Figure 2 for Recursive Hierarchical Projection for Whole-Body Control with Task Priority Transition

Figure 3 for Recursive Hierarchical Projection for Whole-Body Control with Task Priority Transition

Figure 4 for Recursive Hierarchical Projection for Whole-Body Control with Task Priority Transition

Abstract:Redundant robots are desired to execute multitasks with different priorities simultaneously. The task priorities are necessary to be transitioned for complex task scheduling of whole-body control (WBC). Many methods focused on guaranteeing the control continuity during task priority transition, however either increased the computation consumption or sacrificed the accuracy of tasks inevitably. This work formulates the WBC problem with task priority transition as an Hierarchical Quadratic Programming (HQP) with Recursive Hierarchical Projection (RHP) matrices. The tasks of each level are solved recursively through HQP. We propose the RHP matrix to form the continuously changing projection of each level so that the task priority transition is achieved without increasing computation consumption. Additionally, the recursive approach solves the WBC problem without losing the accuracy of tasks. We verify the effectiveness of this scheme by the comparative simulations of the reactive collision avoidance through multi-tasks priority transitions.

* 6 pages, 9 figures, submitted to ICRA 2022

Via

Access Paper or Ask Questions

Mixed Control for Whole-Body Compliance of a Humanoid Robot

Sep 16, 2021

Xiaozhu Ju, Jiajun Wang, Gang Han, Mingguo Zhao

Figure 1 for Mixed Control for Whole-Body Compliance of a Humanoid Robot

Figure 2 for Mixed Control for Whole-Body Compliance of a Humanoid Robot

Figure 3 for Mixed Control for Whole-Body Compliance of a Humanoid Robot

Figure 4 for Mixed Control for Whole-Body Compliance of a Humanoid Robot

Abstract:The hierarchical quadratic programming (HQP) is commonly applied to consider strict hierarchies of multi-tasks and robot's physical inequality constraints during whole-body compliance. However, for the one-step HQP, the solution can oscillate when it is close to the boundary of constraints. It is because the abrupt hit of the bounds gives rise to unrealisable jerks and even infeasible solutions. This paper proposes the mixed control, which blends the single-axis model predictive control (MPC) and proportional derivate (PD) control for the whole-body compliance to overcome these deficiencies. The MPC predicts the distances between the bounds and the control target of the critical tasks, and it provides smooth and feasible solutions by prediction and optimisation in advance. However, applying MPC will inevitably increase the computation time. Therefore, to achieve a 500 Hz servo rate, the PD controllers still regulate other tasks to save computation resources. Also, we use a more efficient null space projection (NSP) whole-body controller instead of the HQP and distribute the single-axis MPCs into four CPU cores for parallel computation. Finally, we validate the desired capabilities of the proposed strategy via Simulations and the experiment on the humanoid robot Walker X.

* 6 pages, 5 figures, submitted to ICRA 2022

Via

Access Paper or Ask Questions

Whole-Body Control with Motion/Force Transmissibility for Parallel-Legged Robot

Sep 15, 2021

Jiajun Wang, Gang Han, Xiaozhu Ju, Mingguo Zhao

Figure 1 for Whole-Body Control with Motion/Force Transmissibility for Parallel-Legged Robot

Figure 2 for Whole-Body Control with Motion/Force Transmissibility for Parallel-Legged Robot

Figure 3 for Whole-Body Control with Motion/Force Transmissibility for Parallel-Legged Robot

Figure 4 for Whole-Body Control with Motion/Force Transmissibility for Parallel-Legged Robot

Abstract:Whole-body control (WBC) has been applied to the locomotion of legged robots. However, current WBC methods have not considered the intrinsic features of parallel mechanisms, especially motion/force transmissibility (MFT). In this work, we propose an MFT-enhanced WBC scheme. Introducing MFT into a WBC is challenging due to the nonlinear relationship between MFT indices and the robot configuration. To overcome this challenge, we establish the MFT preferable space of the robot and formulate it as a polyhedron in the joint space at the acceleration level. Then, the WBC employs the polyhedron as a soft constraint. As a result, the robot possesses high-speed and high-acceleration capabilities by satisfying this constraint as well as staying away from its singularity. In contrast with the WBC without considering MFT, our proposed scheme is more robust to external disturbances, e.g., push recovery and uneven terrain locomotion. simulations and experiments on a parallel-legged bipedal robot are provided to demonstrate the performance and robustness of the proposed method.

* 6 pages, 7 figures, submitted to ICRA 2022

Via

Access Paper or Ask Questions