Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xu Yao

VideoSPatS: Video SPatiotemporal Splines for Disentangled Occlusion, Appearance and Motion Modeling and Editing

Apr 08, 2025

Juan Luis Gonzalez Bello, Xu Yao, Alex Whelan, Kyle Olszewski, Hyeongwoo Kim, Pablo Garrido

Abstract:We present an implicit video representation for occlusions, appearance, and motion disentanglement from monocular videos, which we call Video SPatiotemporal Splines (VideoSPatS). Unlike previous methods that map time and coordinates to deformation and canonical colors, our VideoSPatS maps input coordinates into Spatial and Color Spline deformation fields $D_s$ and $D_c$, which disentangle motion and appearance in videos. With spline-based parametrization, our method naturally generates temporally consistent flow and guarantees long-term temporal consistency, which is crucial for convincing video editing. Using multiple prediction branches, our VideoSPatS model also performs layer separation between the latent video and the selected occluder. By disentangling occlusions, appearance, and motion, our method enables better spatiotemporal modeling and editing of diverse videos, including in-the-wild talking head videos with challenging occlusions, shadows, and specularities while maintaining an appropriate canonical space for editing. We also present general video modeling results on the DAVIS and CoDeF datasets, as well as our own talking head video dataset collected from open-source web videos. Extensive ablations show the combination of $D_s$ and $D_c$ under neural splines can overcome motion and appearance ambiguities, paving the way for more advanced video editing models.

* CVPR25, project website: https://juanluisg-flwls.github.io/videospats-website/

Via

Access Paper or Ask Questions

Smart Audit System Empowered by LLM

Oct 10, 2024

Xu Yao, Xiaoxu Wu, Xi Li, Huan Xu, Chenlei Li, Ping Huang, Si Li, Xiaoning Ma, Jiulong Shan

Figure 1 for Smart Audit System Empowered by LLM

Figure 2 for Smart Audit System Empowered by LLM

Figure 3 for Smart Audit System Empowered by LLM

Figure 4 for Smart Audit System Empowered by LLM

Abstract:Manufacturing quality audits are pivotal for ensuring high product standards in mass production environments. Traditional auditing processes, however, are labor-intensive and reliant on human expertise, posing challenges in maintaining transparency, accountability, and continuous improvement across complex global supply chains. To address these challenges, we propose a smart audit system empowered by large language models (LLMs). Our approach introduces three innovations: a dynamic risk assessment model that streamlines audit procedures and optimizes resource allocation; a manufacturing compliance copilot that enhances data processing, retrieval, and evaluation for a self-evolving manufacturing knowledge base; and a Re-act framework commonality analysis agent that provides real-time, customized analysis to empower engineers with insights for supplier improvement. These enhancements elevate audit efficiency and effectiveness, with testing scenarios demonstrating an improvement of over 24%.

Via

Access Paper or Ask Questions

Enhancing Molecular Property Prediction via Mixture of Collaborative Experts

Dec 06, 2023

Xu Yao, Shuang Liang, Songqiao Han, Hailiang Huang

Figure 1 for Enhancing Molecular Property Prediction via Mixture of Collaborative Experts

Figure 2 for Enhancing Molecular Property Prediction via Mixture of Collaborative Experts

Figure 3 for Enhancing Molecular Property Prediction via Mixture of Collaborative Experts

Figure 4 for Enhancing Molecular Property Prediction via Mixture of Collaborative Experts

Abstract:Molecular Property Prediction (MPP) task involves predicting biochemical properties based on molecular features, such as molecular graph structures, contributing to the discovery of lead compounds in drug development. To address data scarcity and imbalance in MPP, some studies have adopted Graph Neural Networks (GNN) as an encoder to extract commonalities from molecular graphs. However, these approaches often use a separate predictor for each task, neglecting the shared characteristics among predictors corresponding to different tasks. In response to this limitation, we introduce the GNN-MoCE architecture. It employs the Mixture of Collaborative Experts (MoCE) as predictors, exploiting task commonalities while confronting the homogeneity issue in the expert pool and the decision dominance dilemma within the expert group. To enhance expert diversity for collaboration among all experts, the Expert-Specific Projection method is proposed to assign a unique projection perspective to each expert. To balance decision-making influence for collaboration within the expert group, the Expert-Specific Loss is presented to integrate individual expert loss into the weighted decision loss of the group for more equitable training. Benefiting from the enhancements of MoCE in expert creation, dynamic expert group formation, and experts' collaboration, our model demonstrates superior performance over traditional methods on 24 MPP datasets, especially in tasks with limited data or high imbalance.

* 11 pages, 8 figures

Via

Access Paper or Ask Questions

Video Coding Using Learned Latent GAN Compression

Jul 12, 2022

Mustafa Shukor, Bharath Bhushan Damodaran, Xu Yao, Pierre Hellier

Figure 1 for Video Coding Using Learned Latent GAN Compression

Figure 2 for Video Coding Using Learned Latent GAN Compression

Figure 3 for Video Coding Using Learned Latent GAN Compression

Figure 4 for Video Coding Using Learned Latent GAN Compression

Abstract:We propose in this paper a new paradigm for facial video compression. We leverage the generative capacity of GANs such as StyleGAN to represent and compress a video, including intra and inter compression. Each frame is inverted in the latent space of StyleGAN, from which the optimal compression is learned. To do so, a diffeomorphic latent representation is learned using a normalizing flows model, where an entropy model can be optimized for image coding. In addition, we propose a new perceptual loss that is more efficient than other counterparts. Finally, an entropy model for video inter coding with residual is also learned in the previously constructed latent representation. Our method (SGANC) is simple, faster to train, and achieves better results for image and video coding compared to state-of-the-art codecs such as VTM, AV1, and recent deep learning techniques. In particular, it drastically minimizes perceptual distortion at low bit rates.

* Accepted at ACM Multimedia 2022

Via

Access Paper or Ask Questions

Semantic Unfolding of StyleGAN Latent Space

Jun 29, 2022

Mustafa Shukor, Xu Yao, Bharath Bushan Damodaran, Pierre Hellier

Figure 1 for Semantic Unfolding of StyleGAN Latent Space

Figure 2 for Semantic Unfolding of StyleGAN Latent Space

Figure 3 for Semantic Unfolding of StyleGAN Latent Space

Abstract:Generative adversarial networks (GANs) have proven to be surprisingly efficient for image editing by inverting and manipulating the latent code corresponding to an input real image. This editing property emerges from the disentangled nature of the latent space. In this paper, we identify that the facial attribute disentanglement is not optimal, thus facial editing relying on linear attribute separation is flawed. We thus propose to improve semantic disentanglement with supervision. Our method consists in learning a proxy latent representation using normalizing flows, and we show that this leads to a more efficient space for face image editing.

* Accepted at ICIP22

Via

Access Paper or Ask Questions

Feature-Style Encoder for Style-Based GAN Inversion

Feb 04, 2022

Xu Yao, Alasdair Newson, Yann Gousseau, Pierre Hellier

Figure 1 for Feature-Style Encoder for Style-Based GAN Inversion

Figure 2 for Feature-Style Encoder for Style-Based GAN Inversion

Figure 3 for Feature-Style Encoder for Style-Based GAN Inversion

Figure 4 for Feature-Style Encoder for Style-Based GAN Inversion

Abstract:We propose a novel architecture for GAN inversion, which we call Feature-Style encoder. The style encoder is key for the manipulation of the obtained latent codes, while the feature encoder is crucial for optimal image reconstruction. Our model achieves accurate inversion of real images from the latent space of a pre-trained style-based GAN model, obtaining better perceptual quality and lower reconstruction error than existing methods. Thanks to its encoder structure, the model allows fast and accurate image editing. Additionally, we demonstrate that the proposed encoder is especially well-suited for inversion and editing on videos. We conduct extensive experiments for several style-based generators pre-trained on different data domains. Our proposed method yields state-of-the-art results for style-based GAN inversion, significantly outperforming competing approaches. Source codes are available at https://github.com/InterDigitalInc/FeatureStyleEncoder .

Via

Access Paper or Ask Questions

Semantic and Geometric Unfolding of StyleGAN Latent Space

Jul 09, 2021

Mustafa Shukor, Xu Yao, Bharath Bhushan Damodaran, Pierre Hellier

Figure 1 for Semantic and Geometric Unfolding of StyleGAN Latent Space

Figure 2 for Semantic and Geometric Unfolding of StyleGAN Latent Space

Figure 3 for Semantic and Geometric Unfolding of StyleGAN Latent Space

Figure 4 for Semantic and Geometric Unfolding of StyleGAN Latent Space

Abstract:Generative adversarial networks (GANs) have proven to be surprisingly efficient for image editing by inverting and manipulating the latent code corresponding to a natural image. This property emerges from the disentangled nature of the latent space. In this paper, we identify two geometric limitations of such latent space: (a) euclidean distances differ from image perceptual distance, and (b) disentanglement is not optimal and facial attribute separation using linear model is a limiting hypothesis. We thus propose a new method to learn a proxy latent representation using normalizing flows to remedy these limitations, and show that this leads to a more efficient space for face image editing.

* 16 pages

Via

Access Paper or Ask Questions

A Latent Transformer for Disentangled and Identity-Preserving Face Editing

Jun 22, 2021

Xu Yao, Alasdair Newson, Yann Gousseau, Pierre Hellier

Figure 1 for A Latent Transformer for Disentangled and Identity-Preserving Face Editing

Figure 2 for A Latent Transformer for Disentangled and Identity-Preserving Face Editing

Figure 3 for A Latent Transformer for Disentangled and Identity-Preserving Face Editing

Figure 4 for A Latent Transformer for Disentangled and Identity-Preserving Face Editing

Abstract:High quality facial image editing is a challenging problem in the movie post-production industry, requiring a high degree of control and identity preservation. Previous works that attempt to tackle this problem may suffer from the entanglement of facial attributes and the loss of the person's identity. Furthermore, many algorithms are limited to a certain task. To tackle these limitations, we propose to edit facial attributes via the latent space of a StyleGAN generator, by training a dedicated latent transformation network and incorporating explicit disentanglement and identity preservation terms in the loss function. We further introduce a pipeline to generalize our face editing to videos. Our model achieves a disentangled, controllable, and identity-preserving facial attribute editing, even in the challenging case of real (i.e., non-synthetic) images and videos. We conduct extensive experiments on image and video datasets and show that our model outperforms other state-of-the-art methods in visual quality and quantitative evaluation.

Via

Access Paper or Ask Questions

High Resolution Face Age Editing

May 09, 2020

Xu Yao, Gilles Puy, Alasdair Newson, Yann Gousseau, Pierre Hellier

Figure 1 for High Resolution Face Age Editing

Figure 2 for High Resolution Face Age Editing

Figure 3 for High Resolution Face Age Editing

Figure 4 for High Resolution Face Age Editing

Abstract:Face age editing has become a crucial task in film post-production, and is also becoming popular for general purpose photography. Recently, adversarial training has produced some of the most visually impressive results for image manipulation, including the face aging/de-aging task. In spite of considerable progress, current methods often present visual artifacts and can only deal with low-resolution images. In order to achieve aging/de-aging with the high quality and robustness necessary for wider use, these problems need to be addressed. This is the goal of the present work. We present an encoder-decoder architecture for face age editing. The core idea of our network is to create both a latent space containing the face identity, and a feature modulation layer corresponding to the age of the individual. We then combine these two elements to produce an output image of the person with a desired target age. Our architecture is greatly simplified with respect to other approaches, and allows for continuous age editing on high resolution images in a single unified model.

Via

Access Paper or Ask Questions

Photo style transfer with consistency losses

May 09, 2020

Xu Yao, Gilles Puy, Patrick Pérez

Figure 1 for Photo style transfer with consistency losses

Figure 2 for Photo style transfer with consistency losses

Figure 3 for Photo style transfer with consistency losses

Figure 4 for Photo style transfer with consistency losses

Abstract:We address the problem of style transfer between two photos and propose a new way to preserve photorealism. Using the single pair of photos available as input, we train a pair of deep convolution networks (convnets), each of which transfers the style of one photo to the other. To enforce photorealism, we introduce a content preserving mechanism by combining a cycle-consistency loss with a self-consistency loss. Experimental results show that this method does not suffer from typical artifacts observed in methods working in the same settings. We then further analyze some properties of these trained convnets. First, we notice that they can be used to stylize other unseen images with same known style. Second, we show that retraining only a small subset of the network parameters can be sufficient to adapt these convnets to new styles.

* In 2019 IEEE International Conference on Image Processing (ICIP) (pp. 2314-2318). IEEE

Via

Access Paper or Ask Questions