Abstract:Customized image generation is crucial for delivering personalized content based on user-provided image prompts, aligning large-scale text-to-image diffusion models with individual needs. However, existing models often overlook the relationships between customized objects in generated images. Instead, this work addresses that gap by focusing on relation-aware customized image generation, which aims to preserve the identities from image prompts while maintaining the predicate relations described in text prompts. Specifically, we introduce RelationBooth, a framework that disentangles identity and relation learning through a well-curated dataset. Our training data consists of relation-specific images, independent object images containing identity information, and text prompts to guide relation generation. Then, we propose two key modules to tackle the two main challenges: generating accurate and natural relations, especially when significant pose adjustments are required, and avoiding object confusion in cases of overlap. First, we introduce a keypoint matching loss that effectively guides the model in adjusting object poses closely tied to their relationships. Second, we incorporate local features from the image prompts to better distinguish between objects, preventing confusion in overlapping cases. Extensive results on three benchmarks demonstrate the superiority of RelationBooth in generating precise relations while preserving object identities across a diverse set of objects and relations. The source code and trained models will be made available to the public.
Abstract:We propose a method for formally certifying and quantifying individual fairness of deep neural networks (DNN). Individual fairness guarantees that any two individuals who are identical except for a legally protected attribute (e.g., gender or race) receive the same treatment. While there are existing techniques that provide such a guarantee, they tend to suffer from lack of scalability or accuracy as the size and input dimension of the DNN increase. Our method overcomes this limitation by applying abstraction to a symbolic interval based analysis of the DNN followed by iterative refinement guided by the fairness property. Furthermore, our method lifts the symbolic interval based analysis from conventional qualitative certification to quantitative certification, by computing the percentage of individuals whose classification outputs are provably fair, instead of merely deciding if the DNN is fair. We have implemented our method and evaluated it on deep neural networks trained on four popular fairness research datasets. The experimental results show that our method is not only more accurate than state-of-the-art techniques but also several orders-of-magnitude faster.
Abstract:Out-of-Time Ordered Correlators (OTOCs) are widely used to investigate information scrambling in quantum systems. However, directly computing OTOCs with classical computers is often impractical. This is due to the need to simulate the dynamics of quantum many-body systems, which entails exponentially-scaling computational costs with system size. Similarly, exact simulation of the dynamics with a quantum computer (QC) will generally require a fault-tolerant QC, which is currently beyond technological capabilities. Therefore, alternative approaches are needed for computing OTOCs and related quantities. In this study, we explore four parameterised sets of Hamiltonians describing quantum systems of interest in condensed matter physics. For each set, we investigate whether classical kernel methods can accurately learn the XZ-OTOC as well as a particular sum of OTOCs, as functions of the Hamiltonian parameters. We frame the problem as a regression task, generating labelled data via an efficient numerical algorithm that utilises matrix product operators to simulate quantum many-body systems, with up to 40 qubits. Using this data, we train a variety of standard kernel machines and observe that the best kernels consistently achieve a high coefficient of determination ($R^2$) on the testing sets, typically between 0.9 and 0.99, and almost always exceeding 0.8. This demonstrates that classical kernels supplied with a moderate amount of training data can be used to closely and efficiently approximate OTOCs and related quantities for a diverse range of quantum many-body systems.
Abstract:We present SMPLOlympics, a collection of physically simulated environments that allow humanoids to compete in a variety of Olympic sports. Sports simulation offers a rich and standardized testing ground for evaluating and improving the capabilities of learning algorithms due to the diversity and physically demanding nature of athletic activities. As humans have been competing in these sports for many years, there is also a plethora of existing knowledge on the preferred strategy to achieve better performance. To leverage these existing human demonstrations from videos and motion capture, we design our humanoid to be compatible with the widely-used SMPL and SMPL-X human models from the vision and graphics community. We provide a suite of individual sports environments, including golf, javelin throw, high jump, long jump, and hurdling, as well as competitive sports, including both 1v1 and 2v2 games such as table tennis, tennis, fencing, boxing, soccer, and basketball. Our analysis shows that combining strong motion priors with simple rewards can result in human-like behavior in various sports. By providing a unified sports benchmark and baseline implementation of state and reward designs, we hope that SMPLOlympics can help the control and animation communities achieve human-like and performant behaviors.
Abstract:Recent years have seen significant advancements in humanoid control, largely due to the availability of large-scale motion capture data and the application of reinforcement learning methodologies. However, many real-world tasks, such as moving large and heavy furniture, require multi-character collaboration. Given the scarcity of data on multi-character collaboration and the efficiency challenges associated with multi-agent learning, these tasks cannot be straightforwardly addressed using training paradigms designed for single-agent scenarios. In this paper, we introduce Cooperative Human-Object Interaction (CooHOI), a novel framework that addresses multi-character objects transporting through a two-phase learning paradigm: individual skill acquisition and subsequent transfer. Initially, a single agent learns to perform tasks using the Adversarial Motion Priors (AMP) framework. Following this, the agent learns to collaborate with others by considering the shared dynamics of the manipulated object during parallel training using Multi Agent Proximal Policy Optimization (MAPPO). When one agent interacts with the object, resulting in specific object dynamics changes, the other agents learn to respond appropriately, thereby achieving implicit communication and coordination between teammates. Unlike previous approaches that relied on tracking-based methods for multi-character HOI, CooHOI is inherently efficient, does not depend on motion capture data of multi-character interactions, and can be seamlessly extended to include more participants and a wide range of object types
Abstract:The advancement of diffusion models has pushed the boundary of text-to-3D object generation. While it is straightforward to composite objects into a scene with reasonable geometry, it is nontrivial to texture such a scene perfectly due to style inconsistency and occlusions between objects. To tackle these problems, we propose a coarse-to-fine 3D scene texturing framework, referred to as RoomTex, to generate high-fidelity and style-consistent textures for untextured compositional scene meshes. In the coarse stage, RoomTex first unwraps the scene mesh to a panoramic depth map and leverages ControlNet to generate a room panorama, which is regarded as the coarse reference to ensure the global texture consistency. In the fine stage, based on the panoramic image and perspective depth maps, RoomTex will refine and texture every single object in the room iteratively along a series of selected camera views, until this object is completely painted. Moreover, we propose to maintain superior alignment between RGB and depth spaces via subtle edge detection methods. Extensive experiments show our method is capable of generating high-quality and diverse room textures, and more importantly, supporting interactive fine-grained texture control and flexible scene editing thanks to our inpainting-based framework and compositional mesh input. Our project page is available at https://qwang666.github.io/RoomTex/.
Abstract:We address the challenge of content diversity and controllability in pedestrian simulation for driving scenarios. Recent pedestrian animation frameworks have a significant limitation wherein they primarily focus on either following trajectory [46] or the content of the reference video [57], consequently overlooking the potential diversity of human motion within such scenarios. This limitation restricts the ability to generate pedestrian behaviors that exhibit a wider range of variations and realistic motions and therefore restricts its usage to provide rich motion content for other components in the driving simulation system, e.g., suddenly changed motion to which the autonomous vehicle should respond. In our approach, we strive to surpass the limitation by showcasing diverse human motions obtained from various sources, such as generated human motions, in addition to following the given trajectory. The fundamental contribution of our framework lies in combining the motion tracking task with trajectory following, which enables the tracking of specific motion parts (e.g., upper body) while simultaneously following the given trajectory by a single policy. This way, we significantly enhance both the diversity of simulated human motion within the given scenario and the controllability of the content, including language-based control. Our framework facilitates the generation of a wide range of human motions, contributing to greater realism and adaptability in pedestrian simulations for driving scenarios. More information is on our project page https://wangjingbo1219.github.io/papers/CVPR2024_PACER_PLUS/PACERPLUSPage.html .
Abstract:This work introduces MotionLCM, extending controllable motion generation to a real-time level. Existing methods for spatial control in text-conditioned motion generation suffer from significant runtime inefficiency. To address this issue, we first propose the motion latent consistency model (MotionLCM) for motion generation, building upon the latent diffusion model (MLD). By employing one-step (or few-step) inference, we further improve the runtime efficiency of the motion latent diffusion model for motion generation. To ensure effective controllability, we incorporate a motion ControlNet within the latent space of MotionLCM and enable explicit control signals (e.g., pelvis trajectory) in the vanilla motion space to control the generation process directly, similar to controlling other latent-free diffusion models for motion generation. By employing these techniques, our approach can generate human motions with text and control signals in real-time. Experimental results demonstrate the remarkable generation and controlling capabilities of MotionLCM while maintaining real-time runtime efficiency.
Abstract:This paper addresses the task of 3D clothed human generation from textural descriptions. Previous works usually encode the human body and clothes as a holistic model and generate the whole model in a single-stage optimization, which makes them struggle for clothing editing and meanwhile lose fine-grained control over the whole generation process. To solve this, we propose a layer-wise clothed human representation combined with a progressive optimization strategy, which produces clothing-disentangled 3D human models while providing control capacity for the generation process. The basic idea is progressively generating a minimal-clothed human body and layer-wise clothes. During clothing generation, a novel stratified compositional rendering method is proposed to fuse multi-layer human models, and a new loss function is utilized to help decouple the clothing model from the human body. The proposed method achieves high-quality disentanglement, which thereby provides an effective way for 3D garment generation. Extensive experiments demonstrate that our approach achieves state-of-the-art 3D clothed human generation while also supporting cloth editing applications such as virtual try-on. Project page: http://jtdong.com/tela_layer/
Abstract:Generating natural human grasps necessitates consideration of not just object geometry but also semantic information. Solely depending on object shape for grasp generation confines the applications of prior methods in downstream tasks. This paper presents a novel semantic-based grasp generation method, termed SemGrasp, which generates a static human grasp pose by incorporating semantic information into the grasp representation. We introduce a discrete representation that aligns the grasp space with semantic space, enabling the generation of grasp postures in accordance with language instructions. A Multimodal Large Language Model (MLLM) is subsequently fine-tuned, integrating object, grasp, and language within a unified semantic space. To facilitate the training of SemGrasp, we have compiled a large-scale, grasp-text-aligned dataset named CapGrasp, featuring about 260k detailed captions and 50k diverse grasps. Experimental findings demonstrate that SemGrasp efficiently generates natural human grasps in alignment with linguistic intentions. Our code, models, and dataset are available publicly at: https://kailinli.github.io/SemGrasp.