Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cai Yu

A Causal Convolutional Low-rank Representation Model for Imputation of Water Quality Data

Apr 21, 2025

Xin Liao, Bing Yang, Tan Dongli, Cai Yu

Figure 1 for A Causal Convolutional Low-rank Representation Model for Imputation of Water Quality Data

Figure 2 for A Causal Convolutional Low-rank Representation Model for Imputation of Water Quality Data

Figure 3 for A Causal Convolutional Low-rank Representation Model for Imputation of Water Quality Data

Figure 4 for A Causal Convolutional Low-rank Representation Model for Imputation of Water Quality Data

Abstract:The monitoring of water quality is a crucial part of environmental protection, and a large number of monitors are widely deployed to monitor water quality. Due to unavoidable factors such as data acquisition breakdowns, sensors and communication failures, water quality monitoring data suffers from missing values over time, resulting in High-Dimensional and Sparse (HDS) Water Quality Data (WQD). The simple and rough filling of the missing values leads to inaccurate results and affects the implementation of relevant measures. Therefore, this paper proposes a Causal convolutional Low-rank Representation (CLR) model for imputing missing WQD to improve the completeness of the WQD, which employs a two-fold idea: a) applying causal convolutional operation to consider the temporal dependence of the low-rank representation, thus incorporating temporal information to improve the imputation accuracy; and b) implementing a hyperparameters adaptation scheme to automatically adjust the best hyperparameters during model training, thereby reducing the tedious manual adjustment of hyper-parameters. Experimental studies on three real-world water quality datasets demonstrate that the proposed CLR model is superior to some of the existing state-of-the-art imputation models in terms of imputation accuracy and time cost, as well as indicating that the proposed model provides more reliable decision support for environmental monitoring.

* 9 pages, 3 figures

Via

Access Paper or Ask Questions

Learning to Discover Forgery Cues for Face Forgery Detection

Sep 02, 2024

Jiahe Tian, Peng Chen, Cai Yu, Xiaomeng Fu, Xi Wang, Jiao Dai, Jizhong Han

Figure 1 for Learning to Discover Forgery Cues for Face Forgery Detection

Figure 2 for Learning to Discover Forgery Cues for Face Forgery Detection

Figure 3 for Learning to Discover Forgery Cues for Face Forgery Detection

Figure 4 for Learning to Discover Forgery Cues for Face Forgery Detection

Abstract:Locating manipulation maps, i.e., pixel-level annotation of forgery cues, is crucial for providing interpretable detection results in face forgery detection. Related learning objects have also been widely adopted as auxiliary tasks to improve the classification performance of detectors whereas they require comparisons between paired real and forged faces to obtain manipulation maps as supervision. This requirement restricts their applicability to unpaired faces and contradicts real-world scenarios. Moreover, the used comparison methods annotate all changed pixels, including noise introduced by compression and upsampling. Using such maps as supervision hinders the learning of exploitable cues and makes models prone to overfitting. To address these issues, we introduce a weakly supervised model in this paper, named Forgery Cue Discovery (FoCus), to locate forgery cues in unpaired faces. Unlike some detectors that claim to locate forged regions in attention maps, FoCus is designed to sidestep their shortcomings of capturing partial and inaccurate forgery cues. Specifically, we propose a classification attentive regions proposal module to locate forgery cues during classification and a complementary learning module to facilitate the learning of richer cues. The produced manipulation maps can serve as better supervision to enhance face forgery detectors. Visualization of the manipulation maps of the proposed FoCus exhibits superior interpretability and robustness compared to existing methods. Experiments on five datasets and four multi-task models demonstrate the effectiveness of FoCus in both in-dataset and cross-dataset evaluations.

* TIFS 2024

Via

Access Paper or Ask Questions

Explicit Correlation Learning for Generalizable Cross-Modal Deepfake Detection

Apr 30, 2024

Cai Yu, Shan Jia, Xiaomeng Fu, Jin Liu, Jiahe Tian, Jiao Dai, Xi Wang, Siwei Lyu, Jizhong Han

Figure 1 for Explicit Correlation Learning for Generalizable Cross-Modal Deepfake Detection

Figure 2 for Explicit Correlation Learning for Generalizable Cross-Modal Deepfake Detection

Figure 3 for Explicit Correlation Learning for Generalizable Cross-Modal Deepfake Detection

Figure 4 for Explicit Correlation Learning for Generalizable Cross-Modal Deepfake Detection

Abstract:With the rising prevalence of deepfakes, there is a growing interest in developing generalizable detection methods for various types of deepfakes. While effective in their specific modalities, traditional detection methods fall short in addressing the generalizability of detection across diverse cross-modal deepfakes. This paper aims to explicitly learn potential cross-modal correlation to enhance deepfake detection towards various generation scenarios. Our approach introduces a correlation distillation task, which models the inherent cross-modal correlation based on content information. This strategy helps to prevent the model from overfitting merely to audio-visual synchronization. Additionally, we present the Cross-Modal Deepfake Dataset (CMDFD), a comprehensive dataset with four generation methods to evaluate the detection of diverse cross-modal deepfakes. The experimental results on CMDFD and FakeAVCeleb datasets demonstrate the superior generalizability of our method over existing state-of-the-art methods. Our code and data can be found at \url{https://github.com/ljj898/CMDFD-Dataset-and-Deepfake-Detection}.

* accepted by ICME 2024

Via

Access Paper or Ask Questions

OSM-Net: One-to-Many One-shot Talking Head Generation with Spontaneous Head Motions

Sep 28, 2023

Jin Liu, Xi Wang, Xiaomeng Fu, Yesheng Chai, Cai Yu, Jiao Dai, Jizhong Han

Figure 1 for OSM-Net: One-to-Many One-shot Talking Head Generation with Spontaneous Head Motions

Figure 2 for OSM-Net: One-to-Many One-shot Talking Head Generation with Spontaneous Head Motions

Figure 3 for OSM-Net: One-to-Many One-shot Talking Head Generation with Spontaneous Head Motions

Figure 4 for OSM-Net: One-to-Many One-shot Talking Head Generation with Spontaneous Head Motions

Abstract:One-shot talking head generation has no explicit head movement reference, thus it is difficult to generate talking heads with head motions. Some existing works only edit the mouth area and generate still talking heads, leading to unreal talking head performance. Other works construct one-to-one mapping between audio signal and head motion sequences, introducing ambiguity correspondences into the mapping since people can behave differently in head motions when speaking the same content. This unreasonable mapping form fails to model the diversity and produces either nearly static or even exaggerated head motions, which are unnatural and strange. Therefore, the one-shot talking head generation task is actually a one-to-many ill-posed problem and people present diverse head motions when speaking. Based on the above observation, we propose OSM-Net, a \textit{one-to-many} one-shot talking head generation network with natural head motions. OSM-Net constructs a motion space that contains rich and various clip-level head motion features. Each basis of the space represents a feature of meaningful head motion in a clip rather than just a frame, thus providing more coherent and natural motion changes in talking heads. The driving audio is mapped into the motion space, around which various motion features can be sampled within a reasonable range to achieve the one-to-many mapping. Besides, the landmark constraint and time window feature input improve the accurate expression feature extraction and video generation. Extensive experiments show that OSM-Net generates more natural realistic head motions under reasonable one-to-many mapping paradigm compared with other methods.

* Paper Under Review

Via

Access Paper or Ask Questions

MFR-Net: Multi-faceted Responsive Listening Head Generation via Denoising Diffusion Model

Aug 31, 2023

Jin Liu, Xi Wang, Xiaomeng Fu, Yesheng Chai, Cai Yu, Jiao Dai, Jizhong Han

Figure 1 for MFR-Net: Multi-faceted Responsive Listening Head Generation via Denoising Diffusion Model

Figure 2 for MFR-Net: Multi-faceted Responsive Listening Head Generation via Denoising Diffusion Model

Figure 3 for MFR-Net: Multi-faceted Responsive Listening Head Generation via Denoising Diffusion Model

Figure 4 for MFR-Net: Multi-faceted Responsive Listening Head Generation via Denoising Diffusion Model

Abstract:Face-to-face communication is a common scenario including roles of speakers and listeners. Most existing research methods focus on producing speaker videos, while the generation of listener heads remains largely overlooked. Responsive listening head generation is an important task that aims to model face-to-face communication scenarios by generating a listener head video given a speaker video and a listener head image. An ideal generated responsive listening video should respond to the speaker with attitude or viewpoint expressing while maintaining diversity in interaction patterns and accuracy in listener identity information. To achieve this goal, we propose the \textbf{M}ulti-\textbf{F}aceted \textbf{R}esponsive Listening Head Generation Network (MFR-Net). Specifically, MFR-Net employs the probabilistic denoising diffusion model to predict diverse head pose and expression features. In order to perform multi-faceted response to the speaker video, while maintaining accurate listener identity preservation, we design the Feature Aggregation Module to boost listener identity features and fuse them with other speaker-related features. Finally, a renderer finetuned with identity consistency loss produces the final listening head videos. Our extensive experiments demonstrate that MFR-Net not only achieves multi-faceted responses in diversity and speaker identity information but also in attitude and viewpoint expression.

* Accepted by ACM MM 2023

Via

Access Paper or Ask Questions

Modality-Agnostic Audio-Visual Deepfake Detection

Jul 26, 2023

Cai Yu, Peng Chen, Jiahe Tian, Jin Liu, Jiao Dai, Xi Wang, Yesheng Chai, Jizhong Han

Figure 1 for Modality-Agnostic Audio-Visual Deepfake Detection

Figure 2 for Modality-Agnostic Audio-Visual Deepfake Detection

Figure 3 for Modality-Agnostic Audio-Visual Deepfake Detection

Figure 4 for Modality-Agnostic Audio-Visual Deepfake Detection

Abstract:As AI-generated content (AIGC) thrives, Deepfakes have expanded from single-modality falsification to cross-modal fake content creation, where either audio or visual components can be manipulated. While using two unimodal detectors can detect audio-visual deepfakes, cross-modal forgery clues could be overlooked. Existing multimodal deepfake detection methods typically establish correspondence between the audio and visual modalities for binary real/fake classification, and require the co-occurrence of both modalities. However, in real-world multi-modal applications, missing modality scenarios may occur where either modality is unavailable. In such cases, audio-visual detection methods are less practical than two independent unimodal methods. Consequently, the detector can not always obtain the number or type of manipulated modalities beforehand, necessitating a fake-modality-agnostic audio-visual detector. In this work, we propose a unified fake-modality-agnostic scenarios framework that enables the detection of multimodal deepfakes and handles missing modalities cases, no matter the manipulation hidden in audio, video, or even cross-modal forms. To enhance the modeling of cross-modal forgery clues, we choose audio-visual speech recognition (AVSR) as a preceding task, which effectively extracts speech correlation across modalities, which is difficult for deepfakes to reproduce. Additionally, we propose a dual-label detection approach that follows the structure of AVSR to support the independent detection of each modality. Extensive experiments show that our scheme not only outperforms other state-of-the-art binary detection methods across all three audio-visual datasets but also achieves satisfying performance on detection modality-agnostic audio/video fakes. Moreover, it even surpasses the joint use of two unimodal methods in the presence of missing modality cases.

Via

Access Paper or Ask Questions

FONT: Flow-guided One-shot Talking Head Generation with Natural Head Motions

Mar 31, 2023

Jin Liu, Xi Wang, Xiaomeng Fu, Yesheng Chai, Cai Yu, Jiao Dai, Jizhong Han

Figure 1 for FONT: Flow-guided One-shot Talking Head Generation with Natural Head Motions

Figure 2 for FONT: Flow-guided One-shot Talking Head Generation with Natural Head Motions

Figure 3 for FONT: Flow-guided One-shot Talking Head Generation with Natural Head Motions

Figure 4 for FONT: Flow-guided One-shot Talking Head Generation with Natural Head Motions

Abstract:One-shot talking head generation has received growing attention in recent years, with various creative and practical applications. An ideal natural and vivid generated talking head video should contain natural head pose changes. However, it is challenging to map head pose sequences from driving audio since there exists a natural gap between audio-visual modalities. In this work, we propose a Flow-guided One-shot model that achieves NaTural head motions(FONT) over generated talking heads. Specifically, the head pose prediction module is designed to generate head pose sequences from the source face and driving audio. We add the random sampling operation and the structural similarity constraint to model the diversity in the one-to-many mapping between audio-visual modality, thus predicting natural head poses. Then we develop a keypoint predictor that produces unsupervised keypoints from the source face, driving audio and pose sequences to describe the facial structure information. Finally, a flow-guided occlusion-aware generator is employed to produce photo-realistic talking head videos from the estimated keypoints and source face. Extensive experimental results prove that FONT generates talking heads with natural head poses and synchronized mouth shapes, outperforming other compared methods.

* Accepted by ICME2023

Via

Access Paper or Ask Questions

OPT: One-shot Pose-Controllable Talking Head Generation

Feb 16, 2023

Jin Liu, Xi Wang, Xiaomeng Fu, Yesheng Chai, Cai Yu, Jiao Dai, Jizhong Han

Figure 1 for OPT: One-shot Pose-Controllable Talking Head Generation

Figure 2 for OPT: One-shot Pose-Controllable Talking Head Generation

Figure 3 for OPT: One-shot Pose-Controllable Talking Head Generation

Figure 4 for OPT: One-shot Pose-Controllable Talking Head Generation

Abstract:One-shot talking head generation produces lip-sync talking heads based on arbitrary audio and one source face. To guarantee the naturalness and realness, recent methods propose to achieve free pose control instead of simply editing mouth areas. However, existing methods do not preserve accurate identity of source face when generating head motions. To solve the identity mismatch problem and achieve high-quality free pose control, we present One-shot Pose-controllable Talking head generation network (OPT). Specifically, the Audio Feature Disentanglement Module separates content features from audios, eliminating the influence of speaker-specific information contained in arbitrary driving audios. Later, the mouth expression feature is extracted from the content feature and source face, during which the landmark loss is designed to enhance the accuracy of facial structure and identity preserving quality. Finally, to achieve free pose control, controllable head pose features from reference videos are fed into the Video Generator along with the expression feature and source face to generate new talking heads. Extensive quantitative and qualitative experimental results verify that OPT generates high-quality pose-controllable talking heads with no identity mismatch problem, outperforming previous SOTA methods.

* Accepted by ICASSP2023

Via

Access Paper or Ask Questions

LI-Net: Large-Pose Identity-Preserving Face Reenactment Network

Apr 07, 2021

Jin Liu, Peng Chen, Tao Liang, Zhaoxing Li, Cai Yu, Shuqiao Zou, Jiao Dai, Jizhong Han

Figure 1 for LI-Net: Large-Pose Identity-Preserving Face Reenactment Network

Figure 2 for LI-Net: Large-Pose Identity-Preserving Face Reenactment Network

Figure 3 for LI-Net: Large-Pose Identity-Preserving Face Reenactment Network

Figure 4 for LI-Net: Large-Pose Identity-Preserving Face Reenactment Network

Abstract:Face reenactment is a challenging task, as it is difficult to maintain accurate expression, pose and identity simultaneously. Most existing methods directly apply driving facial landmarks to reenact source faces and ignore the intrinsic gap between two identities, resulting in the identity mismatch issue. Besides, they neglect the entanglement of expression and pose features when encoding driving faces, leading to inaccurate expressions and visual artifacts on large-pose reenacted faces. To address these problems, we propose a Large-pose Identity-preserving face reenactment network, LI-Net. Specifically, the Landmark Transformer is adopted to adjust driving landmark images, which aims to narrow the identity gap between driving and source landmark images. Then the Face Rotation Module and the Expression Enhancing Generator decouple the transformed landmark image into pose and expression features, and reenact those attributes separately to generate identity-preserving faces with accurate expressions and poses. Both qualitative and quantitative experimental results demonstrate the superiority of our method.

* IEEE International Conference on Multimedia and Expo(ICME) 2021 Oral

Via

Access Paper or Ask Questions