Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Junbang Liang

Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

Jun 24, 2024

Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, Carl Vondrick

Figure 1 for Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

Figure 2 for Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

Figure 3 for Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

Figure 4 for Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

Abstract:A key challenge in manipulation is learning a policy that can robustly generalize to diverse visual environments. A promising mechanism for learning robust policies is to leverage video generative models, which are pretrained on large-scale datasets of internet videos. In this paper, we propose a visuomotor policy learning framework that fine-tunes a video diffusion model on human demonstrations of a given task. At test time, we generate an example of an execution of the task conditioned on images of a novel scene, and use this synthesized execution directly to control the robot. Our key insight is that using common tools allows us to effortlessly bridge the embodiment gap between the human hand and the robot manipulator. We evaluate our approach on four tasks of increasing complexity and demonstrate that harnessing internet-scale generative models allows the learned policy to achieve a significantly higher degree of generalization than existing behavior cloning approaches.

* Project page: https://dreamitate.cs.columbia.edu/

Via

Access Paper or Ask Questions

PaperBot: Learning to Design Real-World Tools Using Paper

Mar 14, 2024

Ruoshi Liu, Junbang Liang, Sruthi Sudhakar, Huy Ha, Cheng Chi, Shuran Song, Carl Vondrick

Figure 1 for PaperBot: Learning to Design Real-World Tools Using Paper

Figure 2 for PaperBot: Learning to Design Real-World Tools Using Paper

Figure 3 for PaperBot: Learning to Design Real-World Tools Using Paper

Figure 4 for PaperBot: Learning to Design Real-World Tools Using Paper

Abstract:Paper is a cheap, recyclable, and clean material that is often used to make practical tools. Traditional tool design either relies on simulation or physical analysis, which is often inaccurate and time-consuming. In this paper, we propose PaperBot, an approach that directly learns to design and use a tool in the real world using paper without human intervention. We demonstrated the effectiveness and efficiency of PaperBot on two tool design tasks: 1. learning to fold and throw paper airplanes for maximum travel distance 2. learning to cut paper into grippers that exert maximum gripping force. We present a self-supervised learning framework that learns to perform a sequence of folding, cutting, and dynamic manipulation actions in order to optimize the design and use of a tool. We deploy our system to a real-world two-arm robotic system to solve challenging design tasks that involve aerodynamics (paper airplane) and friction (paper gripper) that are impossible to simulate accurately.

* Project Website: https://paperbot.cs.columbia.edu/

Via

Access Paper or Ask Questions

SHARE: Single-view Human Adversarial REconstruction

Dec 30, 2023

Shreelekha Revankar, Shijia Liao, Yu Shen, Junbang Liang, Huaishu Peng, Ming Lin

Abstract:The accuracy of 3D Human Pose and Shape reconstruction (HPS) from an image is progressively improving. Yet, no known method is robust across all image distortion. To address issues due to variations of camera poses, we introduce SHARE, a novel fine-tuning method that utilizes adversarial data augmentation to enhance the robustness of existing HPS techniques. We perform a comprehensive analysis on the impact of camera poses on HPS reconstruction outcomes. We first generated large-scale image datasets captured systematically from diverse camera perspectives. We then established a mapping between camera poses and reconstruction errors as a continuous function that characterizes the relationship between camera poses and HPS quality. Leveraging this representation, we introduce RoME (Regions of Maximal Error), a novel sampling technique for our adversarial fine-tuning method. The SHARE framework is generalizable across various single-view HPS methods and we demonstrate its performance on HMR, SPIN, PARE, CLIFF and ExPose. Our results illustrate a reduction in mean joint errors across single-view HPS techniques, for images captured from multiple camera positions without compromising their baseline performance. In many challenging cases, our method surpasses the performance of existing models, highlighting its practical significance for diverse real-world applications.

Via

Access Paper or Ask Questions

VLAP: Efficient Video-Language Alignment via Frame Prompting and Distilling for Video Question Answering

Dec 13, 2023

Xijun Wang, Junbang Liang, Chun-Kai Wang, Kenan Deng, Yu Lou, Ming Lin, Shan Yang

Abstract:In this work, we propose an efficient Video-Language Alignment via Frame-Prompting and Distilling (VLAP) network. Our VLAP model addresses both efficient frame sampling and effective cross-modal alignment in a unified way. In our VLAP network, we design a new learnable question-aware Frame-Prompter together with a new cross-modal distillation (QFormer-Distiller) module. Pre-trained large image-language models have shown promising results on problems such as visual question answering. However, how to efficiently and effectively sample image frames when adapting pre-trained large image-language model to video-language alignment is still the major challenge. Compared with prior work, our VLAP model demonstrates the capability of selecting key frames with critical contents, thus improving the video-language alignment accuracy while reducing the inference latency (+3.3% on NExT-QA Temporal with 3.0X speed up). Overall, our VLAP network outperforms (e.g. +4.6% on STAR Interaction and +2.2% on STAR average with 3.0X speed up, ours 2-frames out-perform SeViLA 4-frames on VLEP with 4.2X speed up) the state-of-the-art methods on the video question-answering benchmarks.

Via

Access Paper or Ask Questions

MeSa: Masked, Geometric, and Supervised Pre-training for Monocular Depth Estimation

Oct 06, 2023

Muhammad Osama Khan, Junbang Liang, Chun-Kai Wang, Shan Yang, Yu Lou

Abstract:Pre-training has been an important ingredient in developing strong monocular depth estimation models in recent years. For instance, self-supervised learning (SSL) is particularly effective by alleviating the need for large datasets with dense ground-truth depth maps. However, despite these improvements, our study reveals that the later layers of the SOTA SSL method are actually suboptimal. By examining the layer-wise representations, we demonstrate significant changes in these later layers during fine-tuning, indicating the ineffectiveness of their pre-trained features for depth estimation. To address these limitations, we propose MeSa, a comprehensive framework that leverages the complementary strengths of masked, geometric, and supervised pre-training. Hence, MeSa benefits from not only general-purpose representations learnt via masked pre training but also specialized depth-specific features acquired via geometric and supervised pre-training. Our CKA layer-wise analysis confirms that our pre-training strategy indeed produces improved representations for the later layers, overcoming the drawbacks of the SOTA SSL method. Furthermore, via experiments on the NYUv2 and IBims-1 datasets, we demonstrate that these enhanced representations translate to performance improvements in both the in-distribution and out-of-distribution settings. We also investigate the influence of the pre-training dataset and demonstrate the efficacy of pre-training on LSUN, which yields significantly better pre-trained representations. Overall, our approach surpasses the masked pre-training SSL method by a substantial margin of 17.1% on the RMSE. Moreover, even without utilizing any recently proposed techniques, MeSa also outperforms the most recent methods and establishes a new state-of-the-art for monocular depth estimation on the challenging NYUv2 dataset.

Via

Access Paper or Ask Questions

ICAR: Image-based Complementary Auto Reasoning

Aug 17, 2023

Xijun Wang, Anqi Liang, Junbang Liang, Ming Lin, Yu Lou, Shan Yang

Figure 1 for ICAR: Image-based Complementary Auto Reasoning

Figure 2 for ICAR: Image-based Complementary Auto Reasoning

Figure 3 for ICAR: Image-based Complementary Auto Reasoning

Figure 4 for ICAR: Image-based Complementary Auto Reasoning

Abstract:Scene-aware Complementary Item Retrieval (CIR) is a challenging task which requires to generate a set of compatible items across domains. Due to the subjectivity, it is difficult to set up a rigorous standard for both data collection and learning objectives. To address this challenging task, we propose a visual compatibility concept, composed of similarity (resembling in color, geometry, texture, and etc.) and complementarity (different items like table vs chair completing a group). Based on this notion, we propose a compatibility learning framework, a category-aware Flexible Bidirectional Transformer (FBT), for visual "scene-based set compatibility reasoning" with the cross-domain visual similarity input and auto-regressive complementary item generation. We introduce a "Flexible Bidirectional Transformer (FBT)" consisting of an encoder with flexible masking, a category prediction arm, and an auto-regressive visual embedding prediction arm. And the inputs for FBT are cross-domain visual similarity invariant embeddings, making this framework quite generalizable. Furthermore, our proposed FBT model learns the inter-object compatibility from a large set of scene images in a self-supervised way. Compared with the SOTA methods, this approach achieves up to 5.3% and 9.6% in FITB score and 22.3% and 31.8% SFID improvement on fashion and furniture, respectively.

Via

Access Paper or Ask Questions

Differentiable Simulation of Soft Multi-body Systems

May 03, 2022

Yi-Ling Qiao, Junbang Liang, Vladlen Koltun, Ming C. Lin

Figure 1 for Differentiable Simulation of Soft Multi-body Systems

Figure 2 for Differentiable Simulation of Soft Multi-body Systems

Figure 3 for Differentiable Simulation of Soft Multi-body Systems

Figure 4 for Differentiable Simulation of Soft Multi-body Systems

Abstract:We present a method for differentiable simulation of soft articulated bodies. Our work enables the integration of differentiable physical dynamics into gradient-based pipelines. We develop a top-down matrix assembly algorithm within Projective Dynamics and derive a generalized dry friction model for soft continuum using a new matrix splitting strategy. We derive a differentiable control framework for soft articulated bodies driven by muscles, joint torques, or pneumatic tubes. The experiments demonstrate that our designs make soft body simulation more stable and realistic compared to other frameworks. Our method accelerates the solution of system identification problems by more than an order of magnitude, and enables efficient gradient-based learning of motion control with soft robots.

* NeurIPS 2021

Via

Access Paper or Ask Questions

Efficient Differentiable Simulation of Articulated Bodies

Sep 16, 2021

Yi-Ling Qiao, Junbang Liang, Vladlen Koltun, Ming C. Lin

Figure 1 for Efficient Differentiable Simulation of Articulated Bodies

Figure 2 for Efficient Differentiable Simulation of Articulated Bodies

Figure 3 for Efficient Differentiable Simulation of Articulated Bodies

Figure 4 for Efficient Differentiable Simulation of Articulated Bodies

Abstract:We present a method for efficient differentiable simulation of articulated bodies. This enables integration of articulated body dynamics into deep learning frameworks, and gradient-based optimization of neural networks that operate on articulated bodies. We derive the gradients of the forward dynamics using spatial algebra and the adjoint method. Our approach is an order of magnitude faster than autodiff tools. By only saving the initial states throughout the simulation process, our method reduces memory requirements by two orders of magnitude. We demonstrate the utility of efficient differentiable dynamics for articulated bodies in a variety of applications. We show that reinforcement learning with articulated systems can be accelerated using gradients provided by our method. In applications to control and inverse problems, gradient-based optimization enabled by our work accelerates convergence by more than an order of magnitude.

* ICML 2021

Via

Access Paper or Ask Questions

Scalable Differentiable Physics for Learning and Control

Jul 04, 2020

Yi-Ling Qiao, Junbang Liang, Vladlen Koltun, Ming C. Lin

Figure 1 for Scalable Differentiable Physics for Learning and Control

Figure 2 for Scalable Differentiable Physics for Learning and Control

Figure 3 for Scalable Differentiable Physics for Learning and Control

Figure 4 for Scalable Differentiable Physics for Learning and Control

Abstract:Differentiable physics is a powerful approach to learning and control problems that involve physical objects and environments. While notable progress has been made, the capabilities of differentiable physics solvers remain limited. We develop a scalable framework for differentiable physics that can support a large number of objects and their interactions. To accommodate objects with arbitrary geometry and topology, we adopt meshes as our representation and leverage the sparsity of contacts for scalable differentiable collision handling. Collisions are resolved in localized regions to minimize the number of optimization variables even when the number of simulated objects is high. We further accelerate implicit differentiation of optimization with nonlinear constraints. Experiments demonstrate that the presented framework requires up to two orders of magnitude less memory and computation in comparison to recent particle-based methods. We further validate the approach on inverse problems and control scenarios, where it outperforms derivative-free and model-free baselines by at least an order of magnitude.

* Proceedings of the 37th International Conference on Machine Learning, ICML 2020

Via

Access Paper or Ask Questions

Shape-Aware Human Pose and Shape Reconstruction Using Multi-View Images

Aug 26, 2019

Junbang Liang, Ming C. Lin

Figure 1 for Shape-Aware Human Pose and Shape Reconstruction Using Multi-View Images

Figure 2 for Shape-Aware Human Pose and Shape Reconstruction Using Multi-View Images

Figure 3 for Shape-Aware Human Pose and Shape Reconstruction Using Multi-View Images

Figure 4 for Shape-Aware Human Pose and Shape Reconstruction Using Multi-View Images

Abstract:We propose a scalable neural network framework to reconstruct the 3D mesh of a human body from multi-view images, in the subspace of the SMPL model. Use of multi-view images can significantly reduce the projection ambiguity of the problem, increasing the reconstruction accuracy of the 3D human body under clothing. Our experiments show that this method benefits from the synthetic dataset generated from our pipeline since it has good flexibility of variable control and can provide ground-truth for validation. Our method outperforms existing methods on real-world images, especially on shape estimations.

* To be published to ICCV 2019

Via

Access Paper or Ask Questions