Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuhongze Zhou

CamDirector: Towards Long-Term Coherent Video Trajectory Editing

Feb 27, 2026

Zhihao Shi, Kejia Yin, Weilin Wan, Yuhongze Zhou, Yuanhao Yu, Xinxin Zuo, Qiang Sun, Juwei Lu

Abstract:Video (camera) trajectory editing aims to synthesize new videos that follow user-defined camera paths while preserving scene content and plausibly inpainting previously unseen regions, upgrading amateur footage into professionally styled videos. Existing VTE methods struggle with precise camera control and long-range consistency because they either inject target poses through a limited-capacity embedding or rely on single-frame warping with only implicit cross-frame aggregation in video diffusion models. To address these issues, we introduce a new VTE framework that 1) explicitly aggregates information across the entire source video via a hybrid warping scheme. Specifically, static regions are progressively fused into a world cache then rendered to target camera poses, while dynamic regions are directly warped; their fusion yields globally consistent coarse frames that guide refinement. 2) processes video segments jointly with their history via a history-guided autoregressive diffusion model, while the world cache is incrementally updated to reinforce already inpainted content, enabling long-term temporal coherence. Finally, we present iPhone-PTZ, a new VTE benchmark with diverse camera motions and large trajectory variations, and achieve state-of-the-art performance with fewer parameters.

Via

Access Paper or Ask Questions

IMFine: 3D Inpainting via Geometry-guided Multi-view Refinement

Mar 06, 2025

Zhihao Shi, Dong Huo, Yuhongze Zhou, Kejia Yin, Yan Min, Juwei Lu, Xinxin Zuo

Abstract:Current 3D inpainting and object removal methods are largely limited to front-facing scenes, facing substantial challenges when applied to diverse, "unconstrained" scenes where the camera orientation and trajectory are unrestricted. To bridge this gap, we introduce a novel approach that produces inpainted 3D scenes with consistent visual quality and coherent underlying geometry across both front-facing and unconstrained scenes. Specifically, we propose a robust 3D inpainting pipeline that incorporates geometric priors and a multi-view refinement network trained via test-time adaptation, building on a pre-trained image inpainting model. Additionally, we develop a novel inpainting mask detection technique to derive targeted inpainting masks from object masks, boosting the performance in handling unconstrained scenes. To validate the efficacy of our approach, we create a challenging and diverse benchmark that spans a wide range of scenes. Comprehensive experiments demonstrate that our proposed method substantially outperforms existing state-of-the-art approaches.

* Accepted at CVPR 2025, \href{https://xinxinzuo2353.github.io/imfine/}{Project Page}

Via

Access Paper or Ask Questions

Attentional Graph Convolutional Network for Structure-aware Audio-Visual Scene Classification

Dec 31, 2022

Liguang Zhou, Yuhongze Zhou, Xiaonan Qi, Junjie Hu, Tin Lun Lam, Yangsheng Xu

Figure 1 for Attentional Graph Convolutional Network for Structure-aware Audio-Visual Scene Classification

Figure 2 for Attentional Graph Convolutional Network for Structure-aware Audio-Visual Scene Classification

Figure 3 for Attentional Graph Convolutional Network for Structure-aware Audio-Visual Scene Classification

Figure 4 for Attentional Graph Convolutional Network for Structure-aware Audio-Visual Scene Classification

Abstract:Audio-Visual scene understanding is a challenging problem due to the unstructured spatial-temporal relations that exist in the audio signals and spatial layouts of different objects and various texture patterns in the visual images. Recently, many studies have focused on abstracting features from convolutional neural networks while the learning of explicit semantically relevant frames of sound signals and visual images has been overlooked. To this end, we present an end-to-end framework, namely attentional graph convolutional network (AGCN), for structure-aware audio-visual scene representation. First, the spectrogram of sound and input image is processed by a backbone network for feature extraction. Then, to build multi-scale hierarchical information of input features, we utilize an attention fusion mechanism to aggregate features from multiple layers of the backbone network. Notably, to well represent the salient regions and contextual information of audio-visual inputs, the salient acoustic graph (SAG) and contextual acoustic graph (CAG), salient visual graph (SVG), and contextual visual graph (CVG) are constructed for the audio-visual scene representation. Finally, the constructed graphs pass through a graph convolutional network for structure-aware audio-visual scene recognition. Extensive experimental results on the audio, visual and audio-visual scene recognition datasets show that promising results have been achieved by the AGCN methods. Visualizing graphs on the spectrograms and images have been presented to show the effectiveness of proposed CAG/SAG and CVG/SVG that could focus on the salient and semantic relevant regions.

Via

Access Paper or Ask Questions

Peer Learning for Unbiased Scene Graph Generation

Dec 31, 2022

Liguang Zhou, Junjie Hu, Yuhongze Zhou, Tin Lun Lam, Yangsheng Xu

Figure 1 for Peer Learning for Unbiased Scene Graph Generation

Figure 2 for Peer Learning for Unbiased Scene Graph Generation

Figure 3 for Peer Learning for Unbiased Scene Graph Generation

Figure 4 for Peer Learning for Unbiased Scene Graph Generation

Abstract:In this paper, we propose a novel framework dubbed peer learning to deal with the problem of biased scene graph generation (SGG). This framework uses predicate sampling and consensus voting (PSCV) to encourage different peers to learn from each other, improving model diversity and mitigating bias in SGG. To address the heavily long-tailed distribution of predicate classes, we propose to use predicate sampling to divide and conquer this issue. As a result, the model is less biased and makes more balanced predicate predictions. Specifically, one peer may not be sufficiently diverse to discriminate between different levels of predicate distributions. Therefore, we sample the data distribution based on frequency of predicates into sub-distributions, selecting head, body, and tail classes to combine and feed to different peers as complementary predicate knowledge during the training process. The complementary predicate knowledge of these peers is then ensembled utilizing a consensus voting strategy, which simulates a civilized voting process in our society that emphasizes the majority opinion and diminishes the minority opinion. This approach ensures that the learned representations of each peer are optimally adapted to the various data distributions. Extensive experiments on the Visual Genome dataset demonstrate that PSCV outperforms previous methods. We have established a new state-of-the-art (SOTA) on the SGCls task by achieving a mean of \textbf{31.6}.

Via

Access Paper or Ask Questions

Context-aware Mixture-of-Experts for Unbiased Scene Graph Generation

Aug 21, 2022

Liguang Zhou, Yuhongze Zhou, Tin Lun Lam, Yangsheng Xu

Figure 1 for Context-aware Mixture-of-Experts for Unbiased Scene Graph Generation

Figure 2 for Context-aware Mixture-of-Experts for Unbiased Scene Graph Generation

Figure 3 for Context-aware Mixture-of-Experts for Unbiased Scene Graph Generation

Figure 4 for Context-aware Mixture-of-Experts for Unbiased Scene Graph Generation

Abstract:The scene graph generation has gained tremendous progress in recent years. However, its intrinsic long-tailed distribution of predicate classes is a challenging problem. Almost all existing scene graph generation (SGG) methods follow the same framework where they use a similar backbone network for object detection and a customized network for scene graph generation. These methods often design the sophisticated context-encoder to extract the inherent relevance of scene context w.r.t the intrinsic predicates and complicated networks to improve the learning capabilities of the network model for highly imbalanced data distributions. To address the unbiased SGG problem, we present a simple yet effective method called Context-Aware Mixture-of-Experts (CAME) to improve the model diversity and alleviate the biased SGG without a sophisticated design. Specifically, we propose to use the mixture of experts to remedy the heavily long-tailed distributions of predicate classes, which is suitable for most unbiased scene graph generators. With a mixture of relation experts, the long-tailed distribution of predicates is addressed in a divide and ensemble manner. As a result, the biased SGG is mitigated and the model tends to make more balanced predicates predictions. However, experts with the same weight are not sufficiently diverse to discriminate the different levels of predicates distributions. Hence, we simply use the build-in context-aware encoder, to help the network dynamically leverage the rich scene characteristics to further increase the diversity of the model. By utilizing the context information of the image, the importance of each expert w.r.t the scene context is dynamically assigned. We have conducted extensive experiments on three tasks on the Visual Genome dataset to show that came achieved superior performance over previous methods.

Via

Access Paper or Ask Questions

Feature Pyramid Attention based Residual Neural Network for Environmental Sound Classification

May 28, 2022

Liguang Zhou, Yuhongze Zhou, Xiaonan Qi, Junjie Hu, Tin Lun Lam, Yangsheng Xu

Figure 1 for Feature Pyramid Attention based Residual Neural Network for Environmental Sound Classification

Figure 2 for Feature Pyramid Attention based Residual Neural Network for Environmental Sound Classification

Figure 3 for Feature Pyramid Attention based Residual Neural Network for Environmental Sound Classification

Figure 4 for Feature Pyramid Attention based Residual Neural Network for Environmental Sound Classification

Abstract:Environmental sound classification (ESC) is a challenging problem due to the unstructured spatial-temporal relations that exist in the sound signals. Recently, many studies have focused on abstracting features from convolutional neural networks while the learning of semantically relevant frames of sound signals has been overlooked. To this end, we present an end-to-end framework, namely feature pyramid attention network (FPAM), focusing on abstracting the semantically relevant features for ESC. We first extract the feature maps of the preprocessed spectrogram of the sound waveform by a backbone network. Then, to build multi-scale hierarchical features of sound spectrograms, we construct a feature pyramid representation of the sound spectrograms by aggregating the feature maps from multi-scale layers, where the temporal frames and spatial locations of semantically relevant frames are localized by FPAM. Specifically, the multiple features are first processed by a dimension alignment module. Afterward, the pyramid spatial attention module (PSA) is attached to localize the important frequency regions spatially with a spatial attention module (SAM). Last, the processed feature maps are refined by a pyramid channel attention (PCA) to localize the important temporal frames. To justify the effectiveness of the proposed FPAM, visualization of attention maps on the spectrograms has been presented. The visualization results show that FPAM can focus more on the semantic relevant regions while neglecting the noises. The effectiveness of the proposed methods is validated on two widely used ESC datasets: the ESC-50 and ESC-10 datasets. The experimental results show that the FPAM yields comparable performance to state-of-the-art methods. A substantial performance increase has been achieved by FPAM compared with the baseline methods.

Via

Access Paper or Ask Questions

View Blind-spot as Inpainting: Self-Supervised Denoising with Mask Guided Residual Convolution

Sep 10, 2021

Yuhongze Zhou, Liguang Zhou, Tin Lun Lam, Yangsheng Xu

Figure 1 for View Blind-spot as Inpainting: Self-Supervised Denoising with Mask Guided Residual Convolution

Figure 2 for View Blind-spot as Inpainting: Self-Supervised Denoising with Mask Guided Residual Convolution

Figure 3 for View Blind-spot as Inpainting: Self-Supervised Denoising with Mask Guided Residual Convolution

Figure 4 for View Blind-spot as Inpainting: Self-Supervised Denoising with Mask Guided Residual Convolution

Abstract:In recent years, self-supervised denoising methods have shown impressive performance, which circumvent painstaking collection procedure of noisy-clean image pairs in supervised denoising methods and boost denoising applicability in real world. One of well-known self-supervised denoising strategies is the blind-spot training scheme. However, a few works attempt to improve blind-spot based self-denoiser in the aspect of network architecture. In this paper, we take an intuitive view of blind-spot strategy and consider its process of using neighbor pixels to predict manipulated pixels as an inpainting process. Therefore, we propose a novel Mask Guided Residual Convolution (MGRConv) into common convolutional neural networks, e.g. U-Net, to promote blind-spot based denoising. Our MGRConv can be regarded as soft partial convolution and find a trade-off among partial convolution, learnable attention maps, and gated convolution. It enables dynamic mask learning with appropriate mask constrain. Different from partial convolution and gated convolution, it provides moderate freedom for network learning. It also avoids leveraging external learnable parameters for mask activation, unlike learnable attention maps. The experiments show that our proposed plug-and-play MGRConv can assist blind-spot based denoising network to reach promising results on both existing single-image based and dataset-based methods.

Via

Access Paper or Ask Questions

Human Perception Modeling for Automatic Natural Image Matting

Mar 31, 2021

Yuhongze Zhou, Liguang Zhou, Tin Lun Lam, Yangsheng Xu

Figure 1 for Human Perception Modeling for Automatic Natural Image Matting

Figure 2 for Human Perception Modeling for Automatic Natural Image Matting

Figure 3 for Human Perception Modeling for Automatic Natural Image Matting

Figure 4 for Human Perception Modeling for Automatic Natural Image Matting

Abstract:Natural image matting aims to precisely separate foreground objects from background using alpha matte. Fully automatic natural image matting without external annotation is quite challenging. Well-performed matting methods usually require accurate handcrafted trimap as extra input, which is labor-intensive and time-consuming, while the performance of automatic trimap generation method of dilating foreground segmentation fluctuates with segmentation quality. In this paper, we argue that how to handle trade-off of additional information input is a major issue in automatic matting, which we decompose into two subtasks: trimap and alpha estimation. By leveraging easily-accessible coarse annotations and modeling alpha matte handmade process of capturing rough foreground/background/transition boundary and carving delicate details in transition region, we propose an intuitively-designed trimap-free two-stage matting approach without additional annotations, e.g. trimap and background image. Specifically, given an image and its coarse foreground segmentation, Trimap Generation Network estimates probabilities of foreground, unknown, and background regions to guide alpha feature flow of our proposed Non-Local Matting network, which is equipped with trimap-guided global aggregation attention block. Experimental results show that our matting algorithm has competitive performance with current state-of-the-art methods in both trimap-free and trimap-needed aspects.

Via

Access Paper or Ask Questions

GAN-Based Facial Attractiveness Enhancement

Jun 04, 2020

Yuhongze Zhou, Qinjie Xiao

Figure 1 for GAN-Based Facial Attractiveness Enhancement

Figure 2 for GAN-Based Facial Attractiveness Enhancement

Figure 3 for GAN-Based Facial Attractiveness Enhancement

Figure 4 for GAN-Based Facial Attractiveness Enhancement

Abstract:We propose a generative framework based on generative adversarial network (GAN) to enhance facial attractiveness while preserving facial identity and high-fidelity. Given a portrait image as input, having applied gradient descent to recover a latent vector that this generative framework can use to synthesize an image resemble to the input image, beauty semantic editing manipulation on the corresponding recovered latent vector based on InterFaceGAN enables this framework to achieve facial image beautification. This paper compared our system with Beholder-GAN and our proposed result-enhanced version of Beholder-GAN. It turns out that our framework obtained state-of-art attractiveness enhancement results. The code is available at https://github.com/zoezhou1999/BeautifyBasedOnGAN.

Via

Access Paper or Ask Questions