Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tongzhou Mu

Towards Embodiment Scaling Laws in Robot Locomotion

May 09, 2025

Bo Ai, Liu Dai, Nico Bohlinger, Dichen Li, Tongzhou Mu, Zhanxin Wu, K. Fay, Henrik I. Christensen, Jan Peters, Hao Su

Abstract:Developing generalist agents that can operate across diverse tasks, environments, and physical embodiments is a grand challenge in robotics and artificial intelligence. In this work, we focus on the axis of embodiment and investigate embodiment scaling laws$\unicode{x2013}$the hypothesis that increasing the number of training embodiments improves generalization to unseen ones. Using robot locomotion as a test bed, we procedurally generate a dataset of $\sim$1,000 varied embodiments, spanning humanoids, quadrupeds, and hexapods, and train generalist policies capable of handling diverse observation and action spaces on random subsets. We find that increasing the number of training embodiments improves generalization to unseen ones, and scaling embodiments is more effective in enabling embodiment-level generalization than scaling data on small, fixed sets of embodiments. Notably, our best policy, trained on the full dataset, zero-shot transfers to novel embodiments in the real world, such as Unitree Go2 and H1. These results represent a step toward general embodied intelligence, with potential relevance to adaptive control for configurable robots, co-design of morphology and control, and beyond.

* 32 pages. Project website: https://embodiment-scaling-laws.github.io/

Via

Access Paper or Ask Questions

Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning

Mar 03, 2025

Adrià López Escoriza, Nicklas Hansen, Stone Tao, Tongzhou Mu, Hao Su

Figure 1 for Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning

Figure 2 for Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning

Figure 3 for Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning

Figure 4 for Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning

Abstract:Long-horizon tasks in robotic manipulation present significant challenges in reinforcement learning (RL) due to the difficulty of designing dense reward functions and effectively exploring the expansive state-action space. However, despite a lack of dense rewards, these tasks often have a multi-stage structure, which can be leveraged to decompose the overall objective into manageable subgoals. In this work, we propose DEMO3, a framework that exploits this structure for efficient learning from visual inputs. Specifically, our approach incorporates multi-stage dense reward learning, a bi-phasic training scheme, and world model learning into a carefully designed demonstration-augmented RL framework that strongly mitigates the challenge of exploration in long-horizon tasks. Our evaluations demonstrate that our method improves data-efficiency by an average of 40% and by 70% on particularly difficult tasks compared to state-of-the-art approaches. We validate this across 16 sparse-reward tasks spanning four domains, including challenging humanoid visual control tasks using as few as five demonstrations.

* Project page can be found at https://adrialopezescoriza.github.io/demo3/

Via

Access Paper or Ask Questions

Responsive Noise-Relaying Diffusion Policy: Responsive and Efficient Visuomotor Control

Feb 18, 2025

Zhuoqun Chen, Xiu Yuan, Tongzhou Mu, Hao Su

Figure 1 for Responsive Noise-Relaying Diffusion Policy: Responsive and Efficient Visuomotor Control

Figure 2 for Responsive Noise-Relaying Diffusion Policy: Responsive and Efficient Visuomotor Control

Figure 3 for Responsive Noise-Relaying Diffusion Policy: Responsive and Efficient Visuomotor Control

Figure 4 for Responsive Noise-Relaying Diffusion Policy: Responsive and Efficient Visuomotor Control

Abstract:Imitation learning is an efficient method for teaching robots a variety of tasks. Diffusion Policy, which uses a conditional denoising diffusion process to generate actions, has demonstrated superior performance, particularly in learning from multi-modal demonstrates. However, it relies on executing multiple actions to retain performance and prevent mode bouncing, which limits its responsiveness, as actions are not conditioned on the most recent observations. To address this, we introduce Responsive Noise-Relaying Diffusion Policy (RNR-DP), which maintains a noise-relaying buffer with progressively increasing noise levels and employs a sequential denoising mechanism that generates immediate, noise-free actions at the head of the sequence, while appending noisy actions at the tail. This ensures that actions are responsive and conditioned on the latest observations, while maintaining motion consistency through the noise-relaying buffer. This design enables the handling of tasks requiring responsive control, and accelerates action generation by reusing denoising steps. Experiments on response-sensitive tasks demonstrate that, compared to Diffusion Policy, ours achieves 18% improvement in success rate. Further evaluation on regular tasks demonstrates that RNR-DP also exceeds the best acceleration method by 6.9%, highlighting its computational efficiency advantage in scenarios where responsiveness is less critical.

Via

Access Paper or Ask Questions

Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model

Dec 18, 2024

Xiu Yuan, Tongzhou Mu, Stone Tao, Yunhao Fang, Mengke Zhang, Hao Su

Figure 1 for Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model

Figure 2 for Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model

Figure 3 for Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model

Figure 4 for Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model

Abstract:Recent advancements in robot learning have used imitation learning with large models and extensive demonstrations to develop effective policies. However, these models are often limited by the quantity, quality, and diversity of demonstrations. This paper explores improving offline-trained imitation learning models through online interactions with the environment. We introduce Policy Decorator, which uses a model-agnostic residual policy to refine large imitation learning models during online interactions. By implementing controlled exploration strategies, Policy Decorator enables stable, sample-efficient online learning. Our evaluation spans eight tasks across two benchmarks-ManiSkill and Adroit-and involves two state-of-the-art imitation learning models (Behavior Transformer and Diffusion Policy). The results show Policy Decorator effectively improves the offline-trained policies and preserves the smooth motion of imitation learning models, avoiding the erratic behaviors of pure RL policies. See our project page (https://policydecorator.github.io) for videos.

* Explore videos, data, code, and more at https://policydecorator.github.io

Via

Access Paper or Ask Questions

When Should We Prefer State-to-Visual DAgger Over Visual Reinforcement Learning?

Dec 18, 2024

Tongzhou Mu, Zhaoyang Li, Stanisław Wiktor Strzelecki, Xiu Yuan, Yunchao Yao, Litian Liang, Hao Su

Figure 1 for When Should We Prefer State-to-Visual DAgger Over Visual Reinforcement Learning?

Figure 2 for When Should We Prefer State-to-Visual DAgger Over Visual Reinforcement Learning?

Figure 3 for When Should We Prefer State-to-Visual DAgger Over Visual Reinforcement Learning?

Figure 4 for When Should We Prefer State-to-Visual DAgger Over Visual Reinforcement Learning?

Abstract:Learning policies from high-dimensional visual inputs, such as pixels and point clouds, is crucial in various applications. Visual reinforcement learning is a promising approach that directly trains policies from visual observations, although it faces challenges in sample efficiency and computational costs. This study conducts an empirical comparison of State-to-Visual DAgger, a two-stage framework that initially trains a state policy before adopting online imitation to learn a visual policy, and Visual RL across a diverse set of tasks. We evaluate both methods across 16 tasks from three benchmarks, focusing on their asymptotic performance, sample efficiency, and computational costs. Surprisingly, our findings reveal that State-to-Visual DAgger does not universally outperform Visual RL but shows significant advantages in challenging tasks, offering more consistent performance. In contrast, its benefits in sample efficiency are less pronounced, although it often reduces the overall wall-clock time required for training. Based on our findings, we provide recommendations for practitioners and hope that our results contribute valuable perspectives for future research in visual policy learning.

* Accepted by The 39th Annual AAAI Conference on Artificial Intelligence (AAAI 2025)

Via

Access Paper or Ask Questions

ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI

Oct 01, 2024

Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-kai Chan(+10 more)

Figure 1 for ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI

Figure 2 for ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI

Figure 3 for ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI

Figure 4 for ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI

Abstract:Simulation has enabled unprecedented compute-scalable approaches to robot learning. However, many existing simulation frameworks typically support a narrow range of scenes/tasks and lack features critical for scaling generalizable robotics and sim2real. We introduce and open source ManiSkill3, the fastest state-visual GPU parallelized robotics simulator with contact-rich physics targeting generalizable manipulation. ManiSkill3 supports GPU parallelization of many aspects including simulation+rendering, heterogeneous simulation, pointclouds/voxels visual input, and more. Simulation with rendering on ManiSkill3 can run 10-1000x faster with 2-3x less GPU memory usage than other platforms, achieving up to 30,000+ FPS in benchmarked environments due to minimal python/pytorch overhead in the system, simulation on the GPU, and the use of the SAPIEN parallel rendering system. Tasks that used to take hours to train can now take minutes. We further provide the most comprehensive range of GPU parallelized environments/tasks spanning 12 distinct domains including but not limited to mobile manipulation for tasks such as drawing, humanoids, and dextrous manipulation in realistic scenes designed by artists or real-world digital twins. In addition, millions of demonstration frames are provided from motion planning, RL, and teleoperation. ManiSkill3 also provides a comprehensive set of baselines that span popular RL and learning-from-demonstrations algorithms.

* Project website: http://maniskill.ai/

Via

Access Paper or Ask Questions

DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks

Apr 25, 2024

Tongzhou Mu, Minghua Liu, Hao Su

Figure 1 for DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks

Figure 2 for DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks

Figure 3 for DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks

Figure 4 for DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks

Abstract:The success of many RL techniques heavily relies on human-engineered dense rewards, which typically demand substantial domain expertise and extensive trial and error. In our work, we propose DrS (Dense reward learning from Stages), a novel approach for learning reusable dense rewards for multi-stage tasks in a data-driven manner. By leveraging the stage structures of the task, DrS learns a high-quality dense reward from sparse rewards and demonstrations if given. The learned rewards can be \textit{reused} in unseen tasks, thus reducing the human effort for reward engineering. Extensive experiments on three physical robot manipulation task families with 1000+ task variants demonstrate that our learned rewards can be reused in unseen tasks, resulting in improved performance and sample efficiency of RL algorithms. The learned rewards even achieve comparable performance to human-engineered rewards on some tasks. See our project page (https://sites.google.com/view/iclr24drs) for more details.

* ICLR 2024. Explore videos, data, code, and more at https://sites.google.com/view/iclr24drs

Via

Access Paper or Ask Questions

AdaDemo: Data-Efficient Demonstration Expansion for Generalist Robotic Agent

Apr 11, 2024

Tongzhou Mu, Yijie Guo, Jie Xu, Ankit Goyal, Hao Su, Dieter Fox, Animesh Garg

Figure 1 for AdaDemo: Data-Efficient Demonstration Expansion for Generalist Robotic Agent

Figure 2 for AdaDemo: Data-Efficient Demonstration Expansion for Generalist Robotic Agent

Figure 3 for AdaDemo: Data-Efficient Demonstration Expansion for Generalist Robotic Agent

Figure 4 for AdaDemo: Data-Efficient Demonstration Expansion for Generalist Robotic Agent

Abstract:Encouraged by the remarkable achievements of language and vision foundation models, developing generalist robotic agents through imitation learning, using large demonstration datasets, has become a prominent area of interest in robot learning. The efficacy of imitation learning is heavily reliant on the quantity and quality of the demonstration datasets. In this study, we aim to scale up demonstrations in a data-efficient way to facilitate the learning of generalist robotic agents. We introduce AdaDemo (Adaptive Online Demonstration Expansion), a general framework designed to improve multi-task policy learning by actively and continually expanding the demonstration dataset. AdaDemo strategically collects new demonstrations to address the identified weakness in the existing policy, ensuring data efficiency is maximized. Through a comprehensive evaluation on a total of 22 tasks across two robotic manipulation benchmarks (RLBench and Adroit), we demonstrate AdaDemo's capability to progressively improve policy performance by guiding the generation of high-quality demonstration datasets in a data-efficient manner.

Via

Access Paper or Ask Questions

Unleashing the Creative Mind: Language Model As Hierarchical Policy For Improved Exploration on Challenging Problem Solving

Nov 01, 2023

Zhan Ling, Yunhao Fang, Xuanlin Li, Tongzhou Mu, Mingu Lee, Reza Pourreza, Roland Memisevic, Hao Su

Figure 1 for Unleashing the Creative Mind: Language Model As Hierarchical Policy For Improved Exploration on Challenging Problem Solving

Figure 2 for Unleashing the Creative Mind: Language Model As Hierarchical Policy For Improved Exploration on Challenging Problem Solving

Figure 3 for Unleashing the Creative Mind: Language Model As Hierarchical Policy For Improved Exploration on Challenging Problem Solving

Figure 4 for Unleashing the Creative Mind: Language Model As Hierarchical Policy For Improved Exploration on Challenging Problem Solving

Abstract:Large Language Models (LLMs) have achieved tremendous progress, yet they still often struggle with challenging reasoning problems. Current approaches address this challenge by sampling or searching detailed and low-level reasoning chains. However, these methods are still limited in their exploration capabilities, making it challenging for correct solutions to stand out in the huge solution space. In this work, we unleash LLMs' creative potential for exploring multiple diverse problem solving strategies by framing an LLM as a hierarchical policy via in-context learning. This policy comprises of a visionary leader that proposes multiple diverse high-level problem-solving tactics as hints, accompanied by a follower that executes detailed problem-solving processes following each of the high-level instruction. The follower uses each of the leader's directives as a guide and samples multiple reasoning chains to tackle the problem, generating a solution group for each leader proposal. Additionally, we propose an effective and efficient tournament-based approach to select among these explored solution groups to reach the final answer. Our approach produces meaningful and inspiring hints, enhances problem-solving strategy exploration, and improves the final answer accuracy on challenging problems in the MATH dataset. Code will be released at https://github.com/lz1oceani/LLM-As-Hierarchical-Policy.

Via

Access Paper or Ask Questions

Accelerated Doubly Stochastic Gradient Algorithm for Large-scale Empirical Risk Minimization

Apr 23, 2023

Zebang Shen, Hui Qian, Tongzhou Mu, Chao Zhang

Abstract:Nowadays, algorithms with fast convergence, small memory footprints, and low per-iteration complexity are particularly favorable for artificial intelligence applications. In this paper, we propose a doubly stochastic algorithm with a novel accelerating multi-momentum technique to solve large scale empirical risk minimization problem for learning tasks. While enjoying a provably superior convergence rate, in each iteration, such algorithm only accesses a mini batch of samples and meanwhile updates a small block of variable coordinates, which substantially reduces the amount of memory reference when both the massive sample size and ultra-high dimensionality are involved. Empirical studies on huge scale datasets are conducted to illustrate the efficiency of our method in practice.

* Accepted to IJCAI 2017. Corresponding author: Hui Qian

Via

Access Paper or Ask Questions