Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yen-Chi Cheng

3D-Fixup: Advancing Photo Editing with 3D Priors

May 15, 2025

Yen-Chi Cheng, Krishna Kumar Singh, Jae Shin Yoon, Alex Schwing, Liangyan Gui, Matheus Gadelha, Paul Guerrero, Nanxuan Zhao

Figure 1 for 3D-Fixup: Advancing Photo Editing with 3D Priors

Figure 2 for 3D-Fixup: Advancing Photo Editing with 3D Priors

Figure 3 for 3D-Fixup: Advancing Photo Editing with 3D Priors

Figure 4 for 3D-Fixup: Advancing Photo Editing with 3D Priors

Abstract:Despite significant advances in modeling image priors via diffusion models, 3D-aware image editing remains challenging, in part because the object is only specified via a single image. To tackle this challenge, we propose 3D-Fixup, a new framework for editing 2D images guided by learned 3D priors. The framework supports difficult editing situations such as object translation and 3D rotation. To achieve this, we leverage a training-based approach that harnesses the generative power of diffusion models. As video data naturally encodes real-world physical dynamics, we turn to video data for generating training data pairs, i.e., a source and a target frame. Rather than relying solely on a single trained model to infer transformations between source and target frames, we incorporate 3D guidance from an Image-to-3D model, which bridges this challenging task by explicitly projecting 2D information into 3D space. We design a data generation pipeline to ensure high-quality 3D guidance throughout training. Results show that by integrating these 3D priors, 3D-Fixup effectively supports complex, identity coherent 3D-aware edits, achieving high-quality results and advancing the application of diffusion models in realistic image manipulation. The code is provided at https://3dfixup.github.io/

* SIGGRAPH 2025. Project page: https://3dfixup.github.io/

Via

Access Paper or Ask Questions

Virtual Pets: Animatable Animal Generation in 3D Scenes

Dec 21, 2023

Yen-Chi Cheng, Chieh Hubert Lin, Chaoyang Wang, Yash Kant, Sergey Tulyakov, Alexander Schwing, Liangyan Gui, Hsin-Ying Lee

Abstract:Toward unlocking the potential of generative models in immersive 4D experiences, we introduce Virtual Pet, a novel pipeline to model realistic and diverse motions for target animal species within a 3D environment. To circumvent the limited availability of 3D motion data aligned with environmental geometry, we leverage monocular internet videos and extract deformable NeRF representations for the foreground and static NeRF representations for the background. For this, we develop a reconstruction strategy, encompassing species-level shared template learning and per-video fine-tuning. Utilizing the reconstructed data, we then train a conditional 3D motion model to learn the trajectory and articulation of foreground animals in the context of 3D backgrounds. We showcase the efficacy of our pipeline with comprehensive qualitative and quantitative evaluations using cat videos. We also demonstrate versatility across unseen cats and indoor environments, producing temporally coherent 4D outputs for enriched virtual experiences.

* Preprint. Project page: https://yccyenchicheng.github.io/VirtualPets/

Via

Access Paper or Ask Questions

DreaMo: Articulated 3D Reconstruction From A Single Casual Video

Dec 07, 2023

Tao Tu, Ming-Feng Li, Chieh Hubert Lin, Yen-Chi Cheng, Min Sun, Ming-Hsuan Yang

Figure 1 for DreaMo: Articulated 3D Reconstruction From A Single Casual Video

Figure 2 for DreaMo: Articulated 3D Reconstruction From A Single Casual Video

Figure 3 for DreaMo: Articulated 3D Reconstruction From A Single Casual Video

Figure 4 for DreaMo: Articulated 3D Reconstruction From A Single Casual Video

Abstract:Articulated 3D reconstruction has valuable applications in various domains, yet it remains costly and demands intensive work from domain experts. Recent advancements in template-free learning methods show promising results with monocular videos. Nevertheless, these approaches necessitate a comprehensive coverage of all viewpoints of the subject in the input video, thus limiting their applicability to casually captured videos from online sources. In this work, we study articulated 3D shape reconstruction from a single and casually captured internet video, where the subject's view coverage is incomplete. We propose DreaMo that jointly performs shape reconstruction while solving the challenging low-coverage regions with view-conditioned diffusion prior and several tailored regularizations. In addition, we introduce a skeleton generation strategy to create human-interpretable skeletons from the learned neural bones and skinning weights. We conduct our study on a self-collected internet video collection characterized by incomplete view coverage. DreaMo shows promising quality in novel-view rendering, detailed articulated shape reconstruction, and skeleton generation. Extensive qualitative and quantitative studies validate the efficacy of each proposed component, and show existing methods are unable to solve correct geometry due to the incomplete view coverage.

* Project page: https://ttaoretw.github.io/DreaMo/

Via

Access Paper or Ask Questions

SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation

Dec 08, 2022

Yen-Chi Cheng, Hsin-Ying Lee, Sergey Tulyakov, Alexander Schwing, Liangyan Gui

Abstract:In this work, we present a novel framework built to simplify 3D asset generation for amateur users. To enable interactive generation, our method supports a variety of input modalities that can be easily provided by a human, including images, text, partially observed shapes and combinations of these, further allowing to adjust the strength of each input. At the core of our approach is an encoder-decoder, compressing 3D shapes into a compact latent representation, upon which a diffusion model is learned. To enable a variety of multi-modal inputs, we employ task-specific encoders with dropout followed by a cross-attention mechanism. Due to its flexibility, our model naturally supports a variety of tasks, outperforming prior works on shape completion, image-based 3D reconstruction, and text-to-3D. Most interestingly, our model can combine all these tasks into one swiss-army-knife tool, enabling the user to perform shape generation using incomplete shapes, images, and textual descriptions at the same time, providing the relative weights for each input and facilitating interactivity. Despite our approach being shape-only, we further show an efficient method to texture the generated shape using large-scale text-to-image models.

* Pre-print. Please check the project page: https://yccyenchicheng.github.io/SDFusion/

Via

Access Paper or Ask Questions

AutoSDF: Shape Priors for 3D Completion, Reconstruction and Generation

Mar 17, 2022

Paritosh Mittal, Yen-Chi Cheng, Maneesh Singh, Shubham Tulsiani

Figure 1 for AutoSDF: Shape Priors for 3D Completion, Reconstruction and Generation

Figure 2 for AutoSDF: Shape Priors for 3D Completion, Reconstruction and Generation

Figure 3 for AutoSDF: Shape Priors for 3D Completion, Reconstruction and Generation

Figure 4 for AutoSDF: Shape Priors for 3D Completion, Reconstruction and Generation

Abstract:Powerful priors allow us to perform inference with insufficient information. In this paper, we propose an autoregressive prior for 3D shapes to solve multimodal 3D tasks such as shape completion, reconstruction, and generation. We model the distribution over 3D shapes as a non-sequential autoregressive distribution over a discretized, low-dimensional, symbolic grid-like latent representation of 3D shapes. This enables us to represent distributions over 3D shapes conditioned on information from an arbitrary set of spatially anchored query locations and thus perform shape completion in such arbitrary settings (e.g., generating a complete chair given only a view of the back leg). We also show that the learned autoregressive prior can be leveraged for conditional tasks such as single-view reconstruction and language-based generation. This is achieved by learning task-specific naive conditionals which can be approximated by light-weight models trained on minimal paired data. We validate the effectiveness of the proposed method using both quantitative and qualitative evaluation and show that the proposed method outperforms the specialized state-of-the-art methods trained for individual tasks. The project page with code and video visualizations can be found at https://yccyenchicheng.github.io/AutoSDF/.

* In CVPR 2022. The first two authors contributed equally to this work. Project: https://yccyenchicheng.github.io/AutoSDF/

Via

Access Paper or Ask Questions

InfinityGAN: Towards Infinite-Resolution Image Synthesis

Apr 08, 2021

Chieh Hubert Lin, Hsin-Ying Lee, Yen-Chi Cheng, Sergey Tulyakov, Ming-Hsuan Yang

Figure 1 for InfinityGAN: Towards Infinite-Resolution Image Synthesis

Figure 2 for InfinityGAN: Towards Infinite-Resolution Image Synthesis

Figure 3 for InfinityGAN: Towards Infinite-Resolution Image Synthesis

Figure 4 for InfinityGAN: Towards Infinite-Resolution Image Synthesis

Abstract:We present InfinityGAN, a method to generate arbitrary-resolution images. The problem is associated with several key challenges. First, scaling existing models to a high resolution is resource-constrained, both in terms of computation and availability of high-resolution training data. Infinity-GAN trains and infers patch-by-patch seamlessly with low computational resources. Second, large images should be locally and globally consistent, avoid repetitive patterns, and look realistic. To address these, InfinityGAN takes global appearance, local structure and texture into account.With this formulation, we can generate images with resolution and level of detail not attainable before. Experimental evaluation supports that InfinityGAN generates imageswith superior global structure compared to baselines at the same time featuring parallelizable inference. Finally, we how several applications unlocked by our approach, such as fusing styles spatially, multi-modal outpainting and image inbetweening at arbitrary input and output resolutions

* Project page: https://hubert0527.github.io/infinityGAN/

Via

Access Paper or Ask Questions

In&Out : Diverse Image Outpainting via GAN Inversion

Apr 01, 2021

Yen-Chi Cheng, Chieh Hubert Lin, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Ming-Hsuan Yang

Figure 1 for In&Out : Diverse Image Outpainting via GAN Inversion

Figure 2 for In&Out : Diverse Image Outpainting via GAN Inversion

Figure 3 for In&Out : Diverse Image Outpainting via GAN Inversion

Figure 4 for In&Out : Diverse Image Outpainting via GAN Inversion

Abstract:Image outpainting seeks for a semantically consistent extension of the input image beyond its available content. Compared to inpainting -- filling in missing pixels in a way coherent with the neighboring pixels -- outpainting can be achieved in more diverse ways since the problem is less constrained by the surrounding pixels. Existing image outpainting methods pose the problem as a conditional image-to-image translation task, often generating repetitive structures and textures by replicating the content available in the input image. In this work, we formulate the problem from the perspective of inverting generative adversarial networks. Our generator renders micro-patches conditioned on their joint latent code as well as their individual positions in the image. To outpaint an image, we seek for multiple latent codes not only recovering available patches but also synthesizing diverse outpainting by patch-based generation. This leads to richer structure and content in the outpainted regions. Furthermore, our formulation allows for outpainting conditioned on the categorical input, thereby enabling flexible user controls. Extensive experimental results demonstrate the proposed method performs favorably against existing in- and outpainting methods, featuring higher visual quality and diversity.

* Project Page: https://yccyenchicheng.github.io/InOut/

Via

Access Paper or Ask Questions

Controllable Image Synthesis via SegVAE

Jul 17, 2020

Yen-Chi Cheng, Hsin-Ying Lee, Min Sun, Ming-Hsuan Yang

Figure 1 for Controllable Image Synthesis via SegVAE

Figure 2 for Controllable Image Synthesis via SegVAE

Figure 3 for Controllable Image Synthesis via SegVAE

Figure 4 for Controllable Image Synthesis via SegVAE

Abstract:Flexible user controls are desirable for content creation and image editing. A semantic map is commonly used intermediate representation for conditional image generation. Compared to the operation on raw RGB pixels, the semantic map enables simpler user modification. In this work, we specifically target at generating semantic maps given a label-set consisting of desired categories. The proposed framework, SegVAE, synthesizes semantic maps in an iterative manner using conditional variational autoencoder. Quantitative and qualitative experiments demonstrate that the proposed model can generate realistic and diverse semantic maps. We also apply an off-the-shelf image-to-image translation model to generate realistic RGB images to better understand the quality of the synthesized semantic maps. Furthermore, we showcase several real-world image-editing applications including object removal, object insertion, and object replacement.

* ECCV 2020. Project page: https://yccyenchicheng.github.io/SegVAE/ Code: https://github.com/yccyenchicheng/SegVAE

Via

Access Paper or Ask Questions

Radiotherapy Target Contouring with Convolutional Gated Graph Neural Network

Apr 05, 2019

Chun-Hung Chao, Yen-Chi Cheng, Hsien-Tzu Cheng, Chi-Wen Huang, Tsung-Ying Ho, Chen-Kan Tseng, Le Lu, Min Sun

Figure 1 for Radiotherapy Target Contouring with Convolutional Gated Graph Neural Network

Figure 2 for Radiotherapy Target Contouring with Convolutional Gated Graph Neural Network

Abstract:Tomography medical imaging is essential in the clinical workflow of modern cancer radiotherapy. Radiation oncologists identify cancerous tissues, applying delineation on treatment regions throughout all image slices. This kind of task is often formulated as a volumetric segmentation task by means of 3D convolutional networks with considerable computational cost. Instead, inspired by the treating methodology of considering meaningful information across slices, we used Gated Graph Neural Network to frame this problem more efficiently. More specifically, we propose convolutional recurrent Gated Graph Propagator (GGP) to propagate high-level information through image slices, with learnable adjacency weighted matrix. Furthermore, as physicians often investigate a few specific slices to refine their decision, we model this slice-wise interaction procedure to further improve our segmentation result. This can be set by editing any slice effortlessly as updating predictions of other slices using GGP. To evaluate our method, we collect an Esophageal Cancer Radiotherapy Target Treatment Contouring dataset of 81 patients which includes tomography images with radiotherapy target. On this dataset, our convolutional graph network produces state-of-the-art results and outperforms the baselines. With the addition of interactive setting, performance is improved even further. Our method has the potential to be easily applied to diverse kinds of medical tasks with volumetric images. Incorporating both the ability to make a feasible prediction and to consider the human interactive input, the proposed method is suitable for clinical scenarios.

* Machine Learning for Health (ML4H) Workshop at NeurIPS 2018. Version 2

Via

Access Paper or Ask Questions

Point-to-Point Video Generation

Apr 05, 2019

Tsun-Hsuan Wang, Yen-Chi Cheng, Chieh Hubert Lin, Hwann-Tzong Chen, Min Sun

Figure 1 for Point-to-Point Video Generation

Figure 2 for Point-to-Point Video Generation

Figure 3 for Point-to-Point Video Generation

Figure 4 for Point-to-Point Video Generation

Abstract:While image manipulation achieves tremendous breakthroughs (e.g., generating realistic faces) in recent years, video generation is much less explored and harder to control, which limits its applications in the real world. For instance, video editing requires temporal coherence across multiple clips and thus poses both start and end constraints within a video sequence. We introduce point-to-point video generation that controls the generation process with two control points: the targeted start- and end-frames. The task is challenging since the model not only generates a smooth transition of frames, but also plans ahead to ensure that the generated end-frame conforms to the targeted end-frame for videos of various length. We propose to maximize the modified variational lower bound of conditional data likelihood under a skip-frame training strategy. Our model can generate sequences such that their end-frame is consistent with the targeted end-frame without loss of quality and diversity. Extensive experiments are conducted on Stochastic Moving MNIST, Weizmann Human Action, and Human3.6M to evaluate the effectiveness of the proposed method. We demonstrate our method under a series of scenarios (e.g., dynamic length generation) and the qualitative results showcase the potential and merits of point-to-point generation. For project page, see https://zswang666.github.io/P2PVG-Project-Page/

* *indicates equal contribution; ver.1

Via

Access Paper or Ask Questions