Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lin Yen-Chen

Cosmos World Foundation Model Platform for Physical AI

Jan 07, 2025

NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen(+69 more)

Figure 1 for Cosmos World Foundation Model Platform for Physical AI

Figure 2 for Cosmos World Foundation Model Platform for Physical AI

Figure 3 for Cosmos World Foundation Model Platform for Physical AI

Figure 4 for Cosmos World Foundation Model Platform for Physical AI

Abstract:Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our platform open-source and our models open-weight with permissive licenses available via https://github.com/NVIDIA/Cosmos.

Via

Access Paper or Ask Questions

Shelving, Stacking, Hanging: Relational Pose Diffusion for Multi-modal Rearrangement

Jul 10, 2023

Anthony Simeonov, Ankit Goyal, Lucas Manuelli, Lin Yen-Chen, Alina Sarmiento, Alberto Rodriguez, Pulkit Agrawal, Dieter Fox

Figure 1 for Shelving, Stacking, Hanging: Relational Pose Diffusion for Multi-modal Rearrangement

Figure 2 for Shelving, Stacking, Hanging: Relational Pose Diffusion for Multi-modal Rearrangement

Figure 3 for Shelving, Stacking, Hanging: Relational Pose Diffusion for Multi-modal Rearrangement

Figure 4 for Shelving, Stacking, Hanging: Relational Pose Diffusion for Multi-modal Rearrangement

Abstract:We propose a system for rearranging objects in a scene to achieve a desired object-scene placing relationship, such as a book inserted in an open slot of a bookshelf. The pipeline generalizes to novel geometries, poses, and layouts of both scenes and objects, and is trained from demonstrations to operate directly on 3D point clouds. Our system overcomes challenges associated with the existence of many geometrically-similar rearrangement solutions for a given scene. By leveraging an iterative pose de-noising training procedure, we can fit multi-modal demonstration data and produce multi-modal outputs while remaining precise and accurate. We also show the advantages of conditioning on relevant local geometric features while ignoring irrelevant global structure that harms both generalization and precision. We demonstrate our approach on three distinct rearrangement tasks that require handling multi-modality and generalization over object shape and pose in both simulation and the real world. Project website, code, and videos: https://anthonysimeonov.github.io/rpdiff-multi-modal/

* Project page: https://anthonysimeonov.github.io/rpdiff-multi-modal/

Via

Access Paper or Ask Questions

MIRA: Mental Imagery for Robotic Affordances

Dec 12, 2022

Lin Yen-Chen, Pete Florence, Andy Zeng, Jonathan T. Barron, Yilun Du, Wei-Chiu Ma, Anthony Simeonov, Alberto Rodriguez Garcia, Phillip Isola

Abstract:Humans form mental images of 3D scenes to support counterfactual imagination, planning, and motor control. Our abilities to predict the appearance and affordance of the scene from previously unobserved viewpoints aid us in performing manipulation tasks (e.g., 6-DoF kitting) with a level of ease that is currently out of reach for existing robot learning frameworks. In this work, we aim to build artificial systems that can analogously plan actions on top of imagined images. To this end, we introduce Mental Imagery for Robotic Affordances (MIRA), an action reasoning framework that optimizes actions with novel-view synthesis and affordance prediction in the loop. Given a set of 2D RGB images, MIRA builds a consistent 3D scene representation, through which we synthesize novel orthographic views amenable to pixel-wise affordances prediction for action optimization. We illustrate how this optimization process enables us to generalize to unseen out-of-plane rotations for 6-DoF robotic manipulation tasks given a limited number of demonstrations, paving the way toward machines that autonomously learn to understand the world around them for planning actions.

* CoRL 2022, webpage: https://yenchenlin.me/mira

Via

Access Paper or Ask Questions

SE(3)-Equivariant Relational Rearrangement with Neural Descriptor Fields

Nov 17, 2022

Anthony Simeonov, Yilun Du, Lin Yen-Chen, Alberto Rodriguez, Leslie Pack Kaelbling, Tomas Lozano-Perez, Pulkit Agrawal

Abstract:We present a method for performing tasks involving spatial relations between novel object instances initialized in arbitrary poses directly from point cloud observations. Our framework provides a scalable way for specifying new tasks using only 5-10 demonstrations. Object rearrangement is formalized as the question of finding actions that configure task-relevant parts of the object in a desired alignment. This formalism is implemented in three steps: assigning a consistent local coordinate frame to the task-relevant object parts, determining the location and orientation of this coordinate frame on unseen object instances, and executing an action that brings these frames into the desired alignment. We overcome the key technical challenge of determining task-relevant local coordinate frames from a few demonstrations by developing an optimization method based on Neural Descriptor Fields (NDFs) and a single annotated 3D keypoint. An energy-based learning scheme to model the joint configuration of the objects that satisfies a desired relational task further improves performance. The method is tested on three multi-object rearrangement tasks in simulation and on a real robot. Project website, videos, and code: https://anthonysimeonov.github.io/r-ndf/

* CoRL 2022, first two authors contributed equally, website and code: https://anthonysimeonov.github.io/r-ndf/

Via

Access Paper or Ask Questions

Vision Transformer for NeRF-Based View Synthesis from a Single Input Image

Jul 12, 2022

Kai-En Lin, Lin Yen-Chen, Wei-Sheng Lai, Tsung-Yi Lin, Yi-Chang Shih, Ravi Ramamoorthi

Figure 1 for Vision Transformer for NeRF-Based View Synthesis from a Single Input Image

Figure 2 for Vision Transformer for NeRF-Based View Synthesis from a Single Input Image

Figure 3 for Vision Transformer for NeRF-Based View Synthesis from a Single Input Image

Figure 4 for Vision Transformer for NeRF-Based View Synthesis from a Single Input Image

Abstract:Although neural radiance fields (NeRF) have shown impressive advances for novel view synthesis, most methods typically require multiple input images of the same scene with accurate camera poses. In this work, we seek to substantially reduce the inputs to a single unposed image. Existing approaches condition on local image features to reconstruct a 3D object, but often render blurry predictions at viewpoints that are far away from the source view. To address this issue, we propose to leverage both the global and local features to form an expressive 3D representation. The global features are learned from a vision transformer, while the local features are extracted from a 2D convolutional network. To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering. This novel 3D representation allows the network to reconstruct unseen regions without enforcing constraints like symmetry or canonical coordinate systems. Our method can render novel views from only a single input image and generalize across multiple object categories using a single model. Quantitative and qualitative evaluations demonstrate that the proposed method achieves state-of-the-art performance and renders richer details than existing approaches.

* Project website: https://cseweb.ucsd.edu/~viscomp/projects/VisionNeRF/

Via

Access Paper or Ask Questions

NeRF-Supervision: Learning Dense Object Descriptors from Neural Radiance Fields

Mar 03, 2022

Lin Yen-Chen, Pete Florence, Jonathan T. Barron, Tsung-Yi Lin, Alberto Rodriguez, Phillip Isola

Figure 1 for NeRF-Supervision: Learning Dense Object Descriptors from Neural Radiance Fields

Figure 2 for NeRF-Supervision: Learning Dense Object Descriptors from Neural Radiance Fields

Figure 3 for NeRF-Supervision: Learning Dense Object Descriptors from Neural Radiance Fields

Figure 4 for NeRF-Supervision: Learning Dense Object Descriptors from Neural Radiance Fields

Abstract:Thin, reflective objects such as forks and whisks are common in our daily lives, but they are particularly challenging for robot perception because it is hard to reconstruct them using commodity RGB-D cameras or multi-view stereo techniques. While traditional pipelines struggle with objects like these, Neural Radiance Fields (NeRFs) have recently been shown to be remarkably effective for performing view synthesis on objects with thin structures or reflective materials. In this paper we explore the use of NeRF as a new source of supervision for robust robot vision systems. In particular, we demonstrate that a NeRF representation of a scene can be used to train dense object descriptors. We use an optimized NeRF to extract dense correspondences between multiple views of an object, and then use these correspondences as training data for learning a view-invariant representation of the object. NeRF's usage of a density field allows us to reformulate the correspondence problem with a novel distribution-of-depths formulation, as opposed to the conventional approach of using a depth map. Dense correspondence models supervised with our method significantly outperform off-the-shelf learned descriptors by 106% (PCK@3px metric, more than doubling performance) and outperform our baseline supervised with multi-view stereo by 29%. Furthermore, we demonstrate the learned dense descriptors enable robots to perform accurate 6-degree of freedom (6-DoF) pick and place of thin and reflective objects.

* ICRA 2022, Website: https://yenchenlin.me/nerf-supervision/

Via

Access Paper or Ask Questions

Learning to See before Learning to Act: Visual Pre-training for Manipulation

Jul 01, 2021

Lin Yen-Chen, Andy Zeng, Shuran Song, Phillip Isola, Tsung-Yi Lin

Figure 1 for Learning to See before Learning to Act: Visual Pre-training for Manipulation

Figure 2 for Learning to See before Learning to Act: Visual Pre-training for Manipulation

Figure 3 for Learning to See before Learning to Act: Visual Pre-training for Manipulation

Figure 4 for Learning to See before Learning to Act: Visual Pre-training for Manipulation

Abstract:Does having visual priors (e.g. the ability to detect objects) facilitate learning to perform vision-based manipulation (e.g. picking up objects)? We study this problem under the framework of transfer learning, where the model is first trained on a passive vision task, and adapted to perform an active manipulation task. We find that pre-training on vision tasks significantly improves generalization and sample efficiency for learning to manipulate objects. However, realizing these gains requires careful selection of which parts of the model to transfer. Our key insight is that outputs of standard vision models highly correlate with affordance maps commonly used in manipulation. Therefore, we explore directly transferring model parameters from vision networks to affordance prediction networks, and show that this can result in successful zero-shot adaptation, where a robot can pick up certain objects with zero robotic experience. With just a small amount of robotic experience, we can further fine-tune the affordance model to achieve better results. With just 10 minutes of suction experience or 1 hour of grasping experience, our method achieves ~80% success rate at picking up novel objects.

* Accepted to ICRA 2020. Porject page: http://yenchenlin.me/vision2action/

Via

Access Paper or Ask Questions

Neural Volume Rendering: NeRF And Beyond

Jan 14, 2021

Frank Dellaert, Lin Yen-Chen

Abstract:Besides the COVID-19 pandemic and political upheaval in the US, 2020 was also the year in which neural volume rendering exploded onto the scene, triggered by the impressive NeRF paper by Mildenhall et al. (2020). Both of us have tried to capture this excitement, Frank on a blog post (Dellaert, 2020) and Yen-Chen in a Github collection (Yen-Chen, 2020). This note is an annotated bibliography of the relevant papers, and we posted the associated bibtex file on the repository.

* Blog: https://dellaert.github.io/NeRF/ Bibtex: https://github.com/yenchenlin/awesome-NeRF

Via

Access Paper or Ask Questions

iNeRF: Inverting Neural Radiance Fields for Pose Estimation

Dec 10, 2020

Lin Yen-Chen, Pete Florence, Jonathan T. Barron, Alberto Rodriguez, Phillip Isola, Tsung-Yi Lin

Figure 1 for iNeRF: Inverting Neural Radiance Fields for Pose Estimation

Figure 2 for iNeRF: Inverting Neural Radiance Fields for Pose Estimation

Figure 3 for iNeRF: Inverting Neural Radiance Fields for Pose Estimation

Figure 4 for iNeRF: Inverting Neural Radiance Fields for Pose Estimation

Abstract:We present iNeRF, a framework that performs pose estimation by "inverting" a trained Neural Radiance Field (NeRF). NeRFs have been shown to be remarkably effective for the task of view synthesis - synthesizing photorealistic novel views of real-world scenes or objects. In this work, we investigate whether we can apply analysis-by-synthesis with NeRF for 6DoF pose estimation - given an image, find the translation and rotation of a camera relative to a 3D model. Starting from an initial pose estimate, we use gradient descent to minimize the residual between pixels rendered from an already-trained NeRF and pixels in an observed image. In our experiments, we first study 1) how to sample rays during pose refinement for iNeRF to collect informative gradients and 2) how different batch sizes of rays affect iNeRF on a synthetic dataset. We then show that for complex real-world scenes from the LLFF dataset, iNeRF can improve NeRF by estimating the camera poses of novel images and using these images as additional training data for NeRF. Finally, we show iNeRF can be combined with feature-based pose initialization. The approach outperforms all other RGB-based methods relying on synthetic data on LineMOD.

* Website: http://yenchenlin.me/inerf/

Via

Access Paper or Ask Questions

Debiased Contrastive Learning

Jul 05, 2020

Ching-Yao Chuang, Joshua Robinson, Lin Yen-Chen, Antonio Torralba, Stefanie Jegelka

Figure 1 for Debiased Contrastive Learning

Figure 2 for Debiased Contrastive Learning

Figure 3 for Debiased Contrastive Learning

Figure 4 for Debiased Contrastive Learning

Abstract:A prominent technique for self-supervised representation learning has been to contrast semantically similar and dissimilar pairs of samples. Without access to labels, dissimilar (negative) points are typically taken to be randomly sampled datapoints, implicitly accepting that these points may, in reality, actually have the same label. Perhaps unsurprisingly, we observe that sampling negative examples from truly different labels improves performance, in a synthetic setting where labels are available. Motivated by this observation, we develop a debiased contrastive objective that corrects for the sampling of same-label datapoints, even without knowledge of the true labels. Empirically, the proposed objective consistently outperforms the state-of-the-art for representation learning in vision, language, and reinforcement learning benchmarks. Theoretically, we establish generalization bounds for the downstream classification task.

Via

Access Paper or Ask Questions