Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yanrui Bin

NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

Apr 15, 2025

Yanrui Bin, Wenbo Hu, Haoyuan Wang, Xinya Chen, Bing Wang

Figure 1 for NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

Figure 2 for NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

Figure 3 for NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

Figure 4 for NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

Abstract:Surface normal estimation serves as a cornerstone for a spectrum of computer vision applications. While numerous efforts have been devoted to static image scenarios, ensuring temporal coherence in video-based normal estimation remains a formidable challenge. Instead of merely augmenting existing methods with temporal components, we present NormalCrafter to leverage the inherent temporal priors of video diffusion models. To secure high-fidelity normal estimation across sequences, we propose Semantic Feature Regularization (SFR), which aligns diffusion features with semantic cues, encouraging the model to concentrate on the intrinsic semantics of the scene. Moreover, we introduce a two-stage training protocol that leverages both latent and pixel space learning to preserve spatial accuracy while maintaining long temporal context. Extensive evaluations demonstrate the efficacy of our method, showcasing a superior performance in generating temporally consistent normal sequences with intricate details from diverse videos.

* 9 pages, 6 figures, Project Page: https://normalcrafter.github.io/

Via

Access Paper or Ask Questions

Learning 3D-Aware GANs from Unposed Images with Template Feature Field

Apr 08, 2024

Xinya Chen, Hanlei Guo, Yanrui Bin, Shangzhan Zhang, Yuanbo Yang, Yue Wang, Yujun Shen, Yiyi Liao

Figure 1 for Learning 3D-Aware GANs from Unposed Images with Template Feature Field

Figure 2 for Learning 3D-Aware GANs from Unposed Images with Template Feature Field

Figure 3 for Learning 3D-Aware GANs from Unposed Images with Template Feature Field

Figure 4 for Learning 3D-Aware GANs from Unposed Images with Template Feature Field

Abstract:Collecting accurate camera poses of training images has been shown to well serve the learning of 3D-aware generative adversarial networks (GANs) yet can be quite expensive in practice. This work targets learning 3D-aware GANs from unposed images, for which we propose to perform on-the-fly pose estimation of training images with a learned template feature field (TeFF). Concretely, in addition to a generative radiance field as in previous approaches, we ask the generator to also learn a field from 2D semantic features while sharing the density from the radiance field. Such a framework allows us to acquire a canonical 3D feature template leveraging the dataset mean discovered by the generative model, and further efficiently estimate the pose parameters on real data. Experimental results on various challenging datasets demonstrate the superiority of our approach over state-of-the-art alternatives from both the qualitative and the quantitative perspectives.

* https://XDimlab.github.io/TeFF

Via

Access Paper or Ask Questions

VeRi3D: Generative Vertex-based Radiance Fields for 3D Controllable Human Image Synthesis

Sep 09, 2023

Xinya Chen, Jiaxin Huang, Yanrui Bin, Lu Yu, Yiyi Liao

Figure 1 for VeRi3D: Generative Vertex-based Radiance Fields for 3D Controllable Human Image Synthesis

Figure 2 for VeRi3D: Generative Vertex-based Radiance Fields for 3D Controllable Human Image Synthesis

Figure 3 for VeRi3D: Generative Vertex-based Radiance Fields for 3D Controllable Human Image Synthesis

Figure 4 for VeRi3D: Generative Vertex-based Radiance Fields for 3D Controllable Human Image Synthesis

Abstract:Unsupervised learning of 3D-aware generative adversarial networks has lately made much progress. Some recent work demonstrates promising results of learning human generative models using neural articulated radiance fields, yet their generalization ability and controllability lag behind parametric human models, i.e., they do not perform well when generalizing to novel pose/shape and are not part controllable. To solve these problems, we propose VeRi3D, a generative human vertex-based radiance field parameterized by vertices of the parametric human template, SMPL. We map each 3D point to the local coordinate system defined on its neighboring vertices, and use the corresponding vertex feature and local coordinates for mapping it to color and density values. We demonstrate that our simple approach allows for generating photorealistic human images with free control over camera pose, human pose, shape, as well as enabling part-level editing.

Via

Access Paper or Ask Questions

Adversarial Refinement Network for Human Motion Prediction

Nov 24, 2020

Xianjin Chao, Yanrui Bin, Wenqing Chu, Xuan Cao, Yanhao Ge, Chengjie Wang, Jilin Li, Feiyue Huang, Howard Leung

Figure 1 for Adversarial Refinement Network for Human Motion Prediction

Figure 2 for Adversarial Refinement Network for Human Motion Prediction

Figure 3 for Adversarial Refinement Network for Human Motion Prediction

Figure 4 for Adversarial Refinement Network for Human Motion Prediction

Abstract:Human motion prediction aims to predict future 3D skeletal sequences by giving a limited human motion as inputs. Two popular methods, recurrent neural networks and feed-forward deep networks, are able to predict rough motion trend, but motion details such as limb movement may be lost. To predict more accurate future human motion, we propose an Adversarial Refinement Network (ARNet) following a simple yet effective coarse-to-fine mechanism with novel adversarial error augmentation. Specifically, we take both the historical motion sequences and coarse prediction as input of our cascaded refinement network to predict refined human motion and strengthen the refinement network with adversarial error augmentation. During training, we deliberately introduce the error distribution by learning through the adversarial mechanism among different subjects. In testing, our cascaded refinement network alleviates the prediction error from the coarse predictor resulting in a finer prediction robustly. This adversarial error augmentation provides rich error cases as input to our refinement network, leading to better generalization performance on the testing dataset. We conduct extensive experiments on three standard benchmark datasets and show that our proposed ARNet outperforms other state-of-the-art methods, especially on challenging aperiodic actions in both short-term and long-term predictions.

* Accepted by ACCV 2020(Oral)

Via

Access Paper or Ask Questions

Adversarial Semantic Data Augmentation for Human Pose Estimation

Aug 03, 2020

Yanrui Bin, Xuan Cao, Xinya Chen, Yanhao Ge, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Changxin Gao, Nong Sang

Figure 1 for Adversarial Semantic Data Augmentation for Human Pose Estimation

Figure 2 for Adversarial Semantic Data Augmentation for Human Pose Estimation

Figure 3 for Adversarial Semantic Data Augmentation for Human Pose Estimation

Figure 4 for Adversarial Semantic Data Augmentation for Human Pose Estimation

Abstract:Human pose estimation is the task of localizing body keypoints from still images. The state-of-the-art methods suffer from insufficient examples of challenging cases such as symmetric appearance, heavy occlusion and nearby person. To enlarge the amounts of challenging cases, previous methods augmented images by cropping and pasting image patches with weak semantics, which leads to unrealistic appearance and limited diversity. We instead propose Semantic Data Augmentation (SDA), a method that augments images by pasting segmented body parts with various semantic granularity. Furthermore, we propose Adversarial Semantic Data Augmentation (ASDA), which exploits a generative network to dynamiclly predict tailored pasting configuration. Given off-the-shelf pose estimation network as discriminator, the generator seeks the most confusing transformation to increase the loss of the discriminator while the discriminator takes the generated sample as input and learns from it. The whole pipeline is optimized in an adversarial manner. State-of-the-art results are achieved on challenging benchmarks.

Via

Access Paper or Ask Questions

Relevant Region Prediction for Crowd Counting

May 20, 2020

Xinya Chen, Yanrui Bin, Changxin Gao, Nong Sang, Hao Tang

Figure 1 for Relevant Region Prediction for Crowd Counting

Figure 2 for Relevant Region Prediction for Crowd Counting

Figure 3 for Relevant Region Prediction for Crowd Counting

Figure 4 for Relevant Region Prediction for Crowd Counting

Abstract:Crowd counting is a concerned and challenging task in computer vision. Existing density map based methods excessively focus on the individuals' localization which harms the crowd counting performance in highly congested scenes. In addition, the dependency between the regions of different density is also ignored. In this paper, we propose Relevant Region Prediction (RRP) for crowd counting, which consists of the Count Map and the Region Relation-Aware Module (RRAM). Each pixel in the count map represents the number of heads falling into the corresponding local area in the input image, which discards the detailed spatial information and forces the network pay more attention to counting rather than localizing individuals. Based on the Graph Convolutional Network (GCN), Region Relation-Aware Module is proposed to capture and exploit the important region dependency. The module builds a fully connected directed graph between the regions of different density where each node (region) is represented by weighted global pooled feature, and GCN is learned to map this region graph to a set of relation-aware regions representations. Experimental results on three datasets show that our method obviously outperforms other existing state-of-the-art methods.

* accepted by Neurocomputing

Via

Access Paper or Ask Questions