Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Linyi Jin

Eye2Eye: A Simple Approach for Monocular-to-Stereo Video Synthesis

Apr 30, 2025

Michal Geyer, Omer Tov, Linyi Jin, Richard Tucker, Inbar Mosseri, Tali Dekel, Noah Snavely

Abstract:The rising popularity of immersive visual experiences has increased interest in stereoscopic 3D video generation. Despite significant advances in video synthesis, creating 3D videos remains challenging due to the relative scarcity of 3D video data. We propose a simple approach for transforming a text-to-video generator into a video-to-stereo generator. Given an input video, our framework automatically produces the video frames from a shifted viewpoint, enabling a compelling 3D effect. Prior and concurrent approaches for this task typically operate in multiple phases, first estimating video disparity or depth, then warping the video accordingly to produce a second view, and finally inpainting the disoccluded regions. This approach inherently fails when the scene involves specular surfaces or transparent objects. In such cases, single-layer disparity estimation is insufficient, resulting in artifacts and incorrect pixel shifts during warping. Our work bypasses these restrictions by directly synthesizing the new viewpoint, avoiding any intermediate steps. This is achieved by leveraging a pre-trained video model's priors on geometry, object materials, optics, and semantics, without relying on external geometry models or manually disentangling geometry from the synthesis process. We demonstrate the advantages of our approach in complex, real-world scenarios featuring diverse object materials and compositions. See videos on https://video-eye2eye.github.io

Via

Access Paper or Ask Questions

Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos

Dec 12, 2024

Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, Aleksander Holynski

Abstract:Learning to understand dynamic 3D scenes from imagery is crucial for applications ranging from robotics to scene reconstruction. Yet, unlike other problems where large-scale supervised training has enabled rapid progress, directly supervising methods for recovering 3D motion remains challenging due to the fundamental difficulty of obtaining ground truth annotations. We present a system for mining high-quality 4D reconstructions from internet stereoscopic, wide-angle videos. Our system fuses and filters the outputs of camera pose estimation, stereo depth estimation, and temporal tracking methods into high-quality dynamic 3D reconstructions. We use this method to generate large-scale data in the form of world-consistent, pseudo-metric 3D point clouds with long-term motion trajectories. We demonstrate the utility of this data by training a variant of DUSt3R to predict structure and 3D motion from real-world image pairs, showing that training on our reconstructed data enables generalization to diverse real-world scenes. Project page: https://stereo4d.github.io

Via

Access Paper or Ask Questions

MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos

Dec 05, 2024

Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, Noah Snavely

Figure 1 for MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos

Figure 2 for MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos

Figure 3 for MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos

Figure 4 for MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos

Abstract:We present a system that allows for accurate, fast, and robust estimation of camera parameters and depth maps from casual monocular videos of dynamic scenes. Most conventional structure from motion and monocular SLAM techniques assume input videos that feature predominantly static scenes with large amounts of parallax. Such methods tend to produce erroneous estimates in the absence of these conditions. Recent neural network-based approaches attempt to overcome these challenges; however, such methods are either computationally expensive or brittle when run on dynamic videos with uncontrolled camera motion or unknown field of view. We demonstrate the surprising effectiveness of a deep visual SLAM framework: with careful modifications to its training and inference schemes, this system can scale to real-world videos of complex dynamic scenes with unconstrained camera paths, including videos with little camera parallax. Extensive experiments on both synthetic and real videos demonstrate that our system is significantly more accurate and robust at camera pose and depth estimation when compared with prior and concurrent work, with faster or comparable running times. See interactive results on our project page: https://mega-sam.github.io/

Via

Access Paper or Ask Questions

3DFIRES: Few Image 3D REconstruction for Scenes with Hidden Surface

Mar 13, 2024

Linyi Jin, Nilesh Kulkarni, David Fouhey

Abstract:This paper introduces 3DFIRES, a novel system for scene-level 3D reconstruction from posed images. Designed to work with as few as one view, 3DFIRES reconstructs the complete geometry of unseen scenes, including hidden surfaces. With multiple view inputs, our method produces full reconstruction within all camera frustums. A key feature of our approach is the fusion of multi-view information at the feature level, enabling the production of coherent and comprehensive 3D reconstruction. We train our system on non-watertight scans from large-scale real scene dataset. We show it matches the efficacy of single-view reconstruction methods with only one input and surpasses existing techniques in both quantitative and qualitative measures for sparse-view 3D reconstruction.

* Accepted to CVPR 2024. Project Page https://jinlinyi.github.io/3DFIRES/

Via

Access Paper or Ask Questions

FAR: Flexible, Accurate and Robust 6DoF Relative Camera Pose Estimation

Mar 05, 2024

Chris Rockwell, Nilesh Kulkarni, Linyi Jin, Jeong Joon Park, Justin Johnson, David F. Fouhey

Abstract:Estimating relative camera poses between images has been a central problem in computer vision. Methods that find correspondences and solve for the fundamental matrix offer high precision in most cases. Conversely, methods predicting pose directly using neural networks are more robust to limited overlap and can infer absolute translation scale, but at the expense of reduced precision. We show how to combine the best of both methods; our approach yields results that are both precise and robust, while also accurately inferring translation scales. At the heart of our model lies a Transformer that (1) learns to balance between solved and learned pose estimations, and (2) provides a prior to guide a solver. A comprehensive analysis supports our design choices and demonstrates that our method adapts flexibly to various feature extractors and correspondence estimators, showing state-of-the-art performance in 6DoF pose estimation on Matterport3D, InteriorNet, StreetLearn, and Map-free Relocalization.

* Accepted to CVPR 2024. Project Page: https://crockwell.github.io/far/

Via

Access Paper or Ask Questions

Learning to Predict Scene-Level Implicit 3D from Posed RGBD Data

Jun 14, 2023

Nilesh Kulkarni, Linyi Jin, Justin Johnson, David F. Fouhey

Figure 1 for Learning to Predict Scene-Level Implicit 3D from Posed RGBD Data

Figure 2 for Learning to Predict Scene-Level Implicit 3D from Posed RGBD Data

Figure 3 for Learning to Predict Scene-Level Implicit 3D from Posed RGBD Data

Figure 4 for Learning to Predict Scene-Level Implicit 3D from Posed RGBD Data

Abstract:We introduce a method that can learn to predict scene-level implicit functions for 3D reconstruction from posed RGBD data. At test time, our system maps a previously unseen RGB image to a 3D reconstruction of a scene via implicit functions. While implicit functions for 3D reconstruction have often been tied to meshes, we show that we can train one using only a set of posed RGBD images. This setting may help 3D reconstruction unlock the sea of accelerometer+RGBD data that is coming with new phones. Our system, D2-DRDF, can match and sometimes outperform current methods that use mesh supervision and shows better robustness to sparse data.

* Project page this https://nileshkulkarni.github.io/d2drdf/

Via

Access Paper or Ask Questions

Perspective Fields for Single Image Camera Calibration

Dec 06, 2022

Linyi Jin, Jianming Zhang, Yannick Hold-Geoffroy, Oliver Wang, Kevin Matzen, Matthew Sticha, David F. Fouhey

Figure 1 for Perspective Fields for Single Image Camera Calibration

Figure 2 for Perspective Fields for Single Image Camera Calibration

Figure 3 for Perspective Fields for Single Image Camera Calibration

Figure 4 for Perspective Fields for Single Image Camera Calibration

Abstract:Geometric camera calibration is often required for applications that understand the perspective of the image. We propose perspective fields as a representation that models the local perspective properties of an image. Perspective Fields contain per-pixel information about the camera view, parameterized as an up vector and a latitude value. This representation has a number of advantages as it makes minimal assumptions about the camera model and is invariant or equivariant to common image editing operations like cropping, warping, and rotation. It is also more interpretable and aligned with human perception. We train a neural network to predict Perspective Fields and the predicted Perspective Fields can be converted to calibration parameters easily. We demonstrate the robustness of our approach under various scenarios compared with camera calibration-based methods and show example applications in image compositing.

* Project Page https://jinlinyi.github.io/PerspectiveFields/

Via

Access Paper or Ask Questions

PlaneFormers: From Sparse View Planes to 3D Reconstruction

Aug 08, 2022

Samir Agarwala, Linyi Jin, Chris Rockwell, David F. Fouhey

Figure 1 for PlaneFormers: From Sparse View Planes to 3D Reconstruction

Figure 2 for PlaneFormers: From Sparse View Planes to 3D Reconstruction

Figure 3 for PlaneFormers: From Sparse View Planes to 3D Reconstruction

Figure 4 for PlaneFormers: From Sparse View Planes to 3D Reconstruction

Abstract:We present an approach for the planar surface reconstruction of a scene from images with limited overlap. This reconstruction task is challenging since it requires jointly reasoning about single image 3D reconstruction, correspondence between images, and the relative camera pose between images. Past work has proposed optimization-based approaches. We introduce a simpler approach, the PlaneFormer, that uses a transformer applied to 3D-aware plane tokens to perform 3D reasoning. Our experiments show that our approach is substantially more effective than prior work, and that several 3D-specific design decisions are crucial for its success.

* Accepted to ECCV 2022

Via

Access Paper or Ask Questions

Understanding 3D Object Articulation in Internet Videos

Mar 30, 2022

Shengyi Qian, Linyi Jin, Chris Rockwell, Siyi Chen, David F. Fouhey

Figure 1 for Understanding 3D Object Articulation in Internet Videos

Figure 2 for Understanding 3D Object Articulation in Internet Videos

Figure 3 for Understanding 3D Object Articulation in Internet Videos

Figure 4 for Understanding 3D Object Articulation in Internet Videos

Abstract:We propose to investigate detecting and characterizing the 3D planar articulation of objects from ordinary videos. While seemingly easy for humans, this problem poses many challenges for computers. We propose to approach this problem by combining a top-down detection system that finds planes that can be articulated along with an optimization approach that solves for a 3D plane that can explain a sequence of observed articulations. We show that this system can be trained on a combination of videos and 3D scan datasets. When tested on a dataset of challenging Internet videos and the Charades dataset, our approach obtains strong performance. Project site: https://jasonqsy.github.io/Articulation3D

* CVPR 2022

Via

Access Paper or Ask Questions

Planar Surface Reconstruction from Sparse Views

Mar 26, 2021

Linyi Jin, Shengyi Qian, Andrew Owens, David F. Fouhey

Figure 1 for Planar Surface Reconstruction from Sparse Views

Figure 2 for Planar Surface Reconstruction from Sparse Views

Figure 3 for Planar Surface Reconstruction from Sparse Views

Figure 4 for Planar Surface Reconstruction from Sparse Views

Abstract:The paper studies planar surface reconstruction of indoor scenes from two views with unknown camera poses. While prior approaches have successfully created object-centric reconstructions of many scenes, they fail to exploit other structures, such as planes, which are typically the dominant components of indoor scenes. In this paper, we reconstruct planar surfaces from multiple views, while jointly estimating camera pose. Our experiments demonstrate that our method is able to advance the state of the art of reconstruction from sparse views, on challenging scenes from Matterport3D. Project site: https://jinlinyi.github.io/SparsePlanes/

Via

Access Paper or Ask Questions