Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jun Wah Ng

KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features

Aug 10, 2025

Ivan Kukanov, Jun Wah Ng

Figure 1 for KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features

Figure 2 for KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features

Figure 3 for KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features

Figure 4 for KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features

Abstract:The rapid development of audio-driven talking head generators and advanced Text-To-Speech (TTS) models has led to more sophisticated temporal deepfakes. These advances highlight the need for robust methods capable of detecting and localizing deepfakes, even under novel, unseen attack scenarios. Current state-of-the-art deepfake detectors, while accurate, are often computationally expensive and struggle to generalize to novel manipulation techniques. To address these challenges, we propose multimodal approaches for the AV-Deepfake1M 2025 challenge. For the visual modality, we leverage handcrafted features to improve interpretability and adaptability. For the audio modality, we adapt a self-supervised learning (SSL) backbone coupled with graph attention networks to capture rich audio representations, improving detection robustness. Our approach strikes a balance between performance and real-world deployment, focusing on resilience and potential interpretability. On the AV-Deepfake1M++ dataset, our multimodal system achieves AUC of 92.78% for deepfake classification task and IoU of 0.3536 for temporal localization using only the audio modality.

* 7 pages, accepted to the 33rd ACM International Conference on Multimedia (MM'25)

Via

Access Paper or Ask Questions

Synthetic Voice Detection and Audio Splicing Detection using SE-Res2Net-Conformer Architecture

Oct 07, 2022

Lei Wang, Benedict Yeoh, Jun Wah Ng

Figure 1 for Synthetic Voice Detection and Audio Splicing Detection using SE-Res2Net-Conformer Architecture

Figure 2 for Synthetic Voice Detection and Audio Splicing Detection using SE-Res2Net-Conformer Architecture

Figure 3 for Synthetic Voice Detection and Audio Splicing Detection using SE-Res2Net-Conformer Architecture

Figure 4 for Synthetic Voice Detection and Audio Splicing Detection using SE-Res2Net-Conformer Architecture

Abstract:Synthetic voice and splicing audio clips have been generated to spoof Internet users and artificial intelligence (AI) technologies such as voice authentication. Existing research work treats spoofing countermeasures as a binary classification problem: bonafide vs. spoof. This paper extends the existing Res2Net by involving the recent Conformer block to further exploit the local patterns on acoustic features. Experimental results on ASVspoof 2019 database show that the proposed SE-Res2Net-Conformer architecture is able to improve the spoofing countermeasures performance for the logical access scenario. In addition, this paper also proposes to re-formulate the existing audio splicing detection problem. Instead of identifying the complete splicing segments, it is more useful to detect the boundaries of the spliced segments. Moreover, a deep learning approach can be used to solve the problem, which is different from the previous signal processing techniques.

* Accepted by the 13th International Symposium on Chinese Spoken Language Processing (ISCSLP 2022)

Via

Access Paper or Ask Questions