Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jingmin Chen

Stimulating Imagination: Towards General-purpose Object Rearrangement

Aug 03, 2024

Jianyang Wu, Jie Gu, Xiaokang Ma, Chu Tang, Jingmin Chen

Figure 1 for Stimulating Imagination: Towards General-purpose Object Rearrangement

Figure 2 for Stimulating Imagination: Towards General-purpose Object Rearrangement

Figure 3 for Stimulating Imagination: Towards General-purpose Object Rearrangement

Figure 4 for Stimulating Imagination: Towards General-purpose Object Rearrangement

Abstract:General-purpose object placement is a fundamental capability of an intelligent generalist robot, i.e., being capable of rearranging objects following human instructions even in novel environments. To achieve this, we break the rearrangement down into three parts, including object localization, goal imagination and robot control, and propose a framework named SPORT. SPORT leverages pre-trained large vision models for broad semantic reasoning about objects, and learns a diffusion-based 3D pose estimator to ensure physically-realistic results. Only object types (to be moved or reference) are communicated between these two parts, which brings two benefits. One is that we can fully leverage the powerful ability of open-set object localization and recognition since no specific fine-tuning is needed for robotic scenarios. Furthermore, the diffusion-based estimator only need to "imagine" the poses of the moving and reference objects after the placement, while no necessity for their semantic information. Thus the training burden is greatly reduced and no massive training is required. The training data for goal pose estimation is collected in simulation and annotated with GPT-4. A set of simulation and real-world experiments demonstrate the potential of our approach to accomplish general-purpose object rearrangement, placing various objects following precise instructions.

* 9 pages

Via

Access Paper or Ask Questions

VideoTetris: Towards Compositional Text-to-Video Generation

Jun 06, 2024

Ye Tian, Ling Yang, Haotian Yang, Yuan Gao, Yufan Deng, Jingmin Chen, Xintao Wang, Zhaochen Yu, Xin Tao, Pengfei Wan(+2 more)

Abstract:Diffusion models have demonstrated great success in text-to-video (T2V) generation. However, existing methods may face challenges when handling complex (long) video generation scenarios that involve multiple objects or dynamic changes in object numbers. To address these limitations, we propose VideoTetris, a novel framework that enables compositional T2V generation. Specifically, we propose spatio-temporal compositional diffusion to precisely follow complex textual semantics by manipulating and composing the attention maps of denoising networks spatially and temporally. Moreover, we propose an enhanced video data preprocessing to enhance the training data regarding motion dynamics and prompt understanding, equipped with a new reference frame attention mechanism to improve the consistency of auto-regressive video generation. Extensive experiments demonstrate that our VideoTetris achieves impressive qualitative and quantitative results in compositional T2V generation. Code is available at: https://github.com/YangLing0818/VideoTetris

* Code: https://github.com/YangLing0818/VideoTetris

Via

Access Paper or Ask Questions

Exploiting Behavioral Consistence for Universal User Representation

Dec 11, 2020

Jie Gu, Feng Wang, Qinghui Sun, Zhiquan Ye, Xiaoxiao Xu, Jingmin Chen, Jun Zhang

Figure 1 for Exploiting Behavioral Consistence for Universal User Representation

Figure 2 for Exploiting Behavioral Consistence for Universal User Representation

Figure 3 for Exploiting Behavioral Consistence for Universal User Representation

Figure 4 for Exploiting Behavioral Consistence for Universal User Representation

Abstract:User modeling is critical for developing personalized services in industry. A common way for user modeling is to learn user representations that can be distinguished by their interests or preferences. In this work, we focus on developing universal user representation model. The obtained universal representations are expected to contain rich information, and be applicable to various downstream applications without further modifications (e.g., user preference prediction and user profiling). Accordingly, we can be free from the heavy work of training task-specific models for every downstream task as in previous works. In specific, we propose Self-supervised User Modeling Network (SUMN) to encode behavior data into the universal representation. It includes two key components. The first one is a new learning objective, which guides the model to fully identify and preserve valuable user information under a self-supervised learning framework. The other one is a multi-hop aggregation layer, which benefits the model capacity in aggregating diverse behaviors. Extensive experiments on benchmark datasets show that our approach can outperform state-of-the-art unsupervised representation methods, and even compete with supervised ones.

* Preprint of accepted AAAI2021 paper

Via

Access Paper or Ask Questions