Abstract:Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to support 13 SEA languages while retaining proficiency in Chinese and English. Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA languages. We also deliver a comprehensive cookbook on how to develop the multilingual model in an efficient manner, including five key aspects: data curation, pre-training, post-training, model customization and evaluation. We hope that Sailor2 model (Apache 2.0 license) will drive language development in the SEA region, and Sailor2 cookbook will inspire researchers to build more inclusive LLMs for other under-served languages.
Abstract:Existing data-dependent and best-of-both-worlds regret bounds for multi-armed bandits problems have limited adaptivity as they are either data-dependent but not best-of-both-worlds (BOBW), BOBW but not data-dependent or have sub-optimal $O(\sqrt{T\ln{T}})$ worst-case guarantee in the adversarial regime. To overcome these limitations, we propose real-time stability-penalty matching (SPM), a new method for obtaining regret bounds that are simultaneously data-dependent, best-of-both-worlds and $T$-optimal for multi-armed bandits problems. In particular, we show that real-time SPM obtains bounds with worst-case guarantees of order $O(\sqrt{T})$ in the adversarial regime and $O(\ln{T})$ in the stochastic regime while simultaneously being adaptive to data-dependent quantities such as sparsity, variations, and small losses. Our results are obtained by extending the SPM technique for tuning the learning rates in the follow-the-regularized-leader (FTRL) framework, which further indicates that the combination of SPM and FTRL is a promising approach for proving new adaptive bounds in online learning problems.
Abstract:Current optimization-based control techniques for humanoid locomotion struggle to adapt step duration and placement simultaneously in dynamic walking gaits due to their reliance on fixed-time discretization, which limits responsiveness to terrain conditions and results in suboptimal performance in challenging environments. In this work, we propose a Gait-Net-augmented implicit kino-dynamic model-predictive control (MPC) to simultaneously optimize step location, step duration, and contact forces for natural variable-frequency locomotion. The proposed method incorporates a Gait-Net-augmented Sequential Convex MPC algorithm to solve multi-linearly constrained variables by iterative quadratic programs. At its core, a lightweight Gait-frequency Network (Gait-Net) determines the preferred step duration in terms of variable MPC sampling times, simplifying step duration optimization to the parameter level. Additionally, it enhances and updates the spatial reference trajectory within each sequential iteration by incorporating local solutions, allowing the projection of kinematic constraints to the design of reference trajectories. We validate the proposed algorithm in high-fidelity simulations and on small-size humanoid hardware, demonstrating its capability for variable-frequency and 3-D discrete terrain locomotion with only a one-step preview of terrain data.
Abstract:Humanoid robots have great potential to perform various human-level skills. These skills involve locomotion, manipulation, and cognitive capabilities. Driven by advances in machine learning and the strength of existing model-based approaches, these capabilities have progressed rapidly, but often separately. Therefore, a timely overview of current progress and future trends in this fast-evolving field is essential. This survey first summarizes the model-based planning and control that have been the backbone of humanoid robotics for the past three decades. We then explore emerging learning-based methods, with a focus on reinforcement learning and imitation learning that enhance the versatility of loco-manipulation skills. We examine the potential of integrating foundation models with humanoid embodiments, assessing the prospects for developing generalist humanoid agents. In addition, this survey covers emerging research for whole-body tactile sensing that unlocks new humanoid skills that involve physical interactions. The survey concludes with a discussion of the challenges and future trends.
Abstract:As legged robots take on roles in industrial and autonomous construction, collaborative loco-manipulation is crucial for handling large and heavy objects that exceed the capabilities of a single robot. However, ensuring the safety of these multi-robot tasks is essential to prevent accidents and guarantee reliable operation. This paper presents a hierarchical control system for object manipulation using a team of quadrupedal robots. The combination of the motion planner and the decentralized locomotion controller in a hierarchical structure enables safe, adaptive planning for teams in complex scenarios. A high-level nonlinear model predictive control planner generates collision-free paths by incorporating control barrier functions, accounting for static and dynamic obstacles. This process involves calculating contact points and forces while adapting to unknown objects and terrain properties. The decentralized loco-manipulation controller then ensures each robot maintains stable locomotion and manipulation based on the planner's guidance. The effectiveness of our method is carefully examined in simulations under various conditions and validated in real-life setups with robot hardware. By modifying the object's configuration, the robot team can maneuver unknown objects through an environment containing both static and dynamic obstacles. We have made our code publicly available in an open-source repository at \url{https://github.com/DRCL-USC/collaborative_loco_manipulation}.
Abstract:As legged robots are deployed in industrial and autonomous construction tasks requiring collaborative manipulation, they must handle object manipulation while maintaining stable locomotion. The challenge intensifies in real-world environments, where they should traverse discrete terrain, avoid obstacles, and coordinate with other robots for safe loco-manipulation. This work addresses safe motion planning for collaborative manipulation of an unknown payload on discrete terrain while avoiding obstacles. Our approach uses two sets of model predictive controllers (MPCs) as motion planners: a global MPC generates a safe trajectory for the team with obstacle avoidance, while decentralized MPCs for each robot ensure safe footholds on discrete terrain as they follow the global trajectory. A model reference adaptive whole-body controller (MRA-WBC) then tracks the desired path, compensating for model uncertainties from the unknown payload. We validated our method in simulation and hardware on a team of Unitree robots. The results demonstrate that our approach successfully guides the team through obstacle courses, requiring planar positioning and height adjustments, and all happening on discrete terrain such as stepping stones.
Abstract:The minimax sample complexity of group distributionally robust optimization (GDRO) has been determined up to a $\log(K)$ factor, for $K$ the number of groups. In this work, we venture beyond the minimax perspective via a novel notion of sparsity that we dub $(\lambda, \beta)$-sparsity. In short, this condition means that at any parameter $\theta$, there is a set of at most $\beta$ groups whose risks at $\theta$ all are at least $\lambda$ larger than the risks of the other groups. To find an $\epsilon$-optimal $\theta$, we show via a novel algorithm and analysis that the $\epsilon$-dependent term in the sample complexity can swap a linear dependence on $K$ for a linear dependence on the potentially much smaller $\beta$. This improvement leverages recent progress in sleeping bandits, showing a fundamental connection between the two-player zero-sum game optimization framework for GDRO and per-action regret bounds in sleeping bandits. The aforementioned result assumes having a particular $\lambda$ as input. Perhaps surprisingly, we next show an adaptive algorithm which, up to log factors, gets sample complexity that adapts to the best $(\lambda, \beta)$-sparsity condition that holds. Finally, for a particular input $\lambda$, we also show how to get a dimension-free sample complexity result.
Abstract:Loco-manipulation calls for effective whole-body control and contact-rich interactions with the object and the environment. Existing learning-based control frameworks rely on task-specific engineered rewards, training a set of low-level skill policies and explicitly switching between them with a high-level policy or FSM, leading to quasi-static and fragile transitions between skills. In contrast, for solving highly dynamic tasks such as soccer, the robot should run towards the ball, decelerating into an optimal approach configuration to seamlessly switch to dribbling and eventually score a goal - a continuum of smooth motion. To this end, we propose to learn a single Oracle Guided Multi-mode Policy (OGMP) for mastering all the required modes and transition maneuvers to solve uni-object bipedal loco-manipulation tasks. Specifically, we design a multi-mode oracle as a closed loop state-reference generator, viewing it as a hybrid automaton with continuous reference generating dynamics and discrete mode jumps. Given such an oracle, we then train an OGMP through bounded exploration around the generated reference. Furthermore, to enforce the policy to learn the desired sequence of mode transitions, we present a novel task-agnostic mode-switching preference reward that enhances performance. The proposed approach results in successful dynamic loco-manipulation in omnidirectional soccer and box-moving tasks with a 16-DoF bipedal robot HECTOR. Supplementary video results are available at https://www.youtube.com/watch?v=gfDaRqobheg
Abstract:Current humanoid push-recovery strategies often use whole-body motion, yet posture regulation is often overlooked. For instance, during manipulation tasks, the upper body may need to stay upright and have minimal recovery displacement. This paper introduces a novel approach to enhancing humanoid push-recovery performance under unknown disturbances and regulating body posture by tailoring the recovery stepping strategy. We propose a hierarchical-MPC-based scheme that analyzes and detects instability in the prediction window and quickly recovers through adapting gait frequency. Our approach integrates a high-level nonlinear MPC, a posture-aware gait frequency adaptation planner, and a low-level convex locomotion MPC. The planners predict the center of mass (CoM) state trajectories that can be assessed for precursors of potential instability and posture deviation. In simulation, we demonstrate improved maximum recoverable impulse by 131% on average compared with baseline approaches. In hardware experiments, a 125 ms advancement in recovery stepping timing/reflex has been observed with the proposed approach, We also demonstrate improved push-recovery performance and minimized attitude change under 0.2 rad.
Abstract:Achieving precise target jumping with legged robots poses a significant challenge due to the long flight phase and the uncertainties inherent in contact dynamics and hardware. Forcefully attempting these agile motions on hardware could result in severe failures and potential damage. Motivated by these challenging problems, we propose an Iterative Learning Control (ILC) approach that aims to learn and refine jumping skills from easy to difficult, instead of directly learning these challenging tasks. We verify that learning from simplicity can enhance safety and target jumping accuracy over trials. Compared to other ILC approaches for legged locomotion, our method can tackle the problem of a long flight phase where control input is not available. In addition, our approach allows the robot to apply what it learns from a simple jumping task to accomplish more challenging tasks within a few trials directly in hardware, instead of learning from scratch. We validate the method via extensive experiments in the A1 model and hardware for various jumping tasks. Starting from a small jump (e.g., a forward leap of 40cm), our learning approach empowers the robot to accomplish a variety of challenging targets, including jumping onto a 20cm high box, jumping to a greater distance of up to 60cm, as well as performing jumps while carrying an unknown payload of 2kg. Our framework can allow the robot to reach the desired position and orientation targets with approximate errors of 1cm and 1 degree within a few trials.