Abstract:Recent advancements in visual generation technologies have markedly increased the scale and availability of video datasets, which are crucial for training effective video generation models. However, a significant lack of high-quality, human-centric video datasets presents a challenge to progress in this field. To bridge this gap, we introduce OpenHumanVid, a large-scale and high-quality human-centric video dataset characterized by precise and detailed captions that encompass both human appearance and motion states, along with supplementary human motion conditions, including skeleton sequences and speech audio. To validate the efficacy of this dataset and the associated training strategies, we propose an extension of existing classical diffusion transformer architectures and conduct further pretraining of our models on the proposed dataset. Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos while preserving performance in general video generation tasks. Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs. Based on these insights and corresponding methodologies, the straightforward extended network trained on the proposed dataset demonstrates an obvious improvement in the generation of human-centric videos. Project page https://fudan-generative-vision.github.io/OpenHumanVid
Abstract:Leveraging the computing and sensing capabilities of vehicles, vehicular federated learning (VFL) has been applied to edge training for connected vehicles. The dynamic and interconnected nature of vehicular networks presents unique opportunities to harness direct vehicle-to-vehicle (V2V) communications, enhancing VFL training efficiency. In this paper, we formulate a stochastic optimization problem to optimize the VFL training performance, considering the energy constraints and mobility of vehicles, and propose a V2V-enhanced dynamic scheduling (VEDS) algorithm to solve it. The model aggregation requirements of VFL and the limited transmission time due to mobility result in a stepwise objective function, which presents challenges in solving the problem. We thus propose a derivative-based drift-plus-penalty method to convert the long-term stochastic optimization problem to an online mixed integer nonlinear programming (MINLP) problem, and provide a theoretical analysis to bound the performance gap between the online solution and the offline optimal solution. Further analysis of the scheduling priority reduces the original problem into a set of convex optimization problems, which are efficiently solved using the interior-point method. Experimental results demonstrate that compared with the state-of-the-art benchmarks, the proposed algorithm enhances the image classification accuracy on the CIFAR-10 dataset by 3.18% and reduces the average displacement errors on the Argoverse trajectory prediction dataset by 10.21%.
Abstract:Hierarchical federated learning (HFL) enables distributed training of models across multiple devices with the help of several edge servers and a cloud edge server in a privacy-preserving manner. In this paper, we consider HFL with highly mobile devices, mainly targeting at vehicular networks. Through convergence analysis, we show that mobility influences the convergence speed by both fusing the edge data and shuffling the edge models. While mobility is usually considered as a challenge from the perspective of communication, we prove that it increases the convergence speed of HFL with edge-level heterogeneous data, since more diverse data can be incorporated. Furthermore, we demonstrate that a higher speed leads to faster convergence, since it accelerates the fusion of data. Simulation results show that mobility increases the model accuracy of HFL by up to 15.1% when training a convolutional neural network on the CIFAR-10 dataset.
Abstract:This work evaluates the impact of time step frequency and component scale on robotic manipulation simulation accuracy. Increasing the time step frequency for small-scale objects is shown to improve simulation accuracy. This simulation, demonstrating pre-assembly part picking for two object geometries, serves as a starting point for discussing how to improve Sim2Real transfer in robotic assembly processes.
Abstract:Federated learning enables distributed training of machine learning (ML) models across multiple devices in a privacy-preserving manner. Hierarchical federated learning (HFL) is further proposed to meet the requirements of both latency and coverage. In this paper, we consider a data-heterogeneous HFL scenario with mobility, mainly targeting vehicular networks. We derive the convergence upper bound of HFL with respect to mobility and data heterogeneity, and analyze how mobility impacts the performance of HFL. While mobility is considered as a challenge from a communication point of view, our goal here is to exploit mobility to improve the learning performance by mitigating data heterogeneity. Simulation results verify the analysis and show that mobility can indeed improve the model accuracy by up to 15.1\% when training a convolutional neural network on the CIFAR-10 dataset using HFL.
Abstract:We present a multi-robot task and motion planning method that, when applied to the rearrangement of objects by manipulators, produces solution times up to three orders of magnitude faster than existing methods. We achieve this improvement by decomposing the planning space into subspaces for independent manipulators, objects, and manipulators holding objects. We represent this decomposition with a hypergraph where vertices are substates and hyperarcs are transitions between substates. Existing methods use graph-based representations where vertices are full states and edges are transitions between states. Using the hypergraph reduces the size of the planning space-for multi-manipulator object rearrangement, the number of hypergraph vertices scales linearly with the number of either robots or objects, while the number of hyperarcs scales quadratically with the number of robots and linearly with the number of objects. In contrast, the number of vertices and edges in graph-based representations scale exponentially in the number of robots and objects. Additionally, the hypergraph provides a structure to reason over varying levels of (de)coupled spaces and transitions between them enabling a hybrid search of the planning space. We show that similar gains can be achieved for other multi-robot task and motion planning problems.
Abstract:A robot needs multiple interaction modes to robustly collaborate with a human in complicated industrial tasks. We develop a Coexistence-and-Cooperation (CoCo) human-robot collaboration system. Coexistence mode enables the robot to work with the human on different sub-tasks independently in a shared space. Cooperation mode enables the robot to follow human guidance and recover failures. A human intention tracking algorithm takes in both human and robot motion measurements as input and provides a switch on the interaction modes. We demonstrate the effectiveness of CoCo system in a use case analogous to a real world multi-step assembly task.
Abstract:Significant progress in robotics reveals new opportunities to advance manufacturing. Next-generation industrial automation will require both integration of distinct robotic technologies and their application to challenging industrial environments. This paper presents lessons from a collaborative assembly project between three academic research groups and an industry partner. The goal of the project is to develop a flexible, safe, and productive manufacturing cell for sub-centimeter precision assembly. Solving this problem in a high-mix, low-volume production line motivates multiple research thrusts in robotics. This work identifies new directions in collaborative robotics for industrial applications and offers insight toward strengthening collaborations between institutions in academia and industry on the development of new technologies.
Abstract:Collaborative robots require effective intention estimation to safely and smoothly work with humans in less structured tasks such as industrial assembly. During these tasks, human intention continuously changes across multiple steps, and is composed of a hierarchy including high-level interactive intention and low-level task intention. Thus, we propose the concept of intention tracking and introduce a collaborative robot system with a hierarchical framework that concurrently tracks intentions at both levels by observing force/torque measurements, robot state sequences, and tracked human trajectories. The high-level intention estimate enables the robot to both (1) safely avoid collision with the human to minimize interruption and (2) cooperatively approach the human and help recover from an assembly failure through admittance control. The low-level intention estimate provides the robot with task-specific information (e.g., which part the human is working on) for concurrent task execution. We implement the system on a UR5e robot, and demonstrate robust, seamless and ergonomic collaboration between the human and the robot in an assembly use case through an ablative pilot study.
Abstract:This paper presents the accessibility and small-time local controllability (STLC) results for $N$-link horizontal planar manipulators with only one unactuated joint. STLC is important in controls, both for design considerations and because large and swinging maneuvers may be avoided for close reconfiguration if a system is STLC. Despite the fact that controllability of underactuated horizontal planar manipulators has been extensively studied, most work focused only on three-link and global controllability. This paper thus has two contributions: 1) using Lie brackets to study the accessibility and STLC for underactuated two-link manipulators with different actuator configurations, and illustrating the results from a perspective of system dynamics, 2) obtaining the accessibility and STLC results for $N$-link manipulators with one unactuated joint by considering realistic models and different actuator configurations. It is found that an $N$-link ($N\geq 3$) with the first joint actuated is STLC for a subset of equilibrium points based on Sussmann's general theorem for STLC. Moreover, with the dynamics of $N$-link considered in the controllability analysis, it gives relatively simple forms for the nontrivial vector fields, which make it easy to determine at which configurations a model loses full rank condition for accessibility.