Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Surya Koppisetti

What Does an Audio Deepfake Detector Focus on? A Study in the Time Domain

Jan 27, 2025

Petr Grinberg, Ankur Kumar, Surya Koppisetti, Gaurav Bharaj

Figure 1 for What Does an Audio Deepfake Detector Focus on? A Study in the Time Domain

Figure 2 for What Does an Audio Deepfake Detector Focus on? A Study in the Time Domain

Figure 3 for What Does an Audio Deepfake Detector Focus on? A Study in the Time Domain

Figure 4 for What Does an Audio Deepfake Detector Focus on? A Study in the Time Domain

Abstract:Adding explanations to audio deepfake detection (ADD) models will boost their real-world application by providing insight on the decision making process. In this paper, we propose a relevancy-based explainable AI (XAI) method to analyze the predictions of transformer-based ADD models. We compare against standard Grad-CAM and SHAP-based methods, using quantitative faithfulness metrics as well as a partial spoof test, to comprehensively analyze the relative importance of different temporal regions in an audio. We consider large datasets, unlike previous works where only limited utterances are studied, and find that the XAI methods differ in their explanations. The proposed relevancy-based XAI method performs the best overall on a variety of metrics. Further investigation on the relative importance of speech/non-speech, phonetic content, and voice onsets/offsets suggest that the XAI results obtained from analyzing limited utterances don't necessarily hold when evaluated on large datasets.

* Accepted to ICASSP 2025

Via

Access Paper or Ask Questions

Learn from Real: Reality Defender's Submission to ASVspoof5 Challenge

Oct 09, 2024

Yi Zhu, Chirag Goel, Surya Koppisetti, Trang Tran, Ankur Kumar, Gaurav Bharaj

Figure 1 for Learn from Real: Reality Defender's Submission to ASVspoof5 Challenge

Figure 2 for Learn from Real: Reality Defender's Submission to ASVspoof5 Challenge

Figure 3 for Learn from Real: Reality Defender's Submission to ASVspoof5 Challenge

Figure 4 for Learn from Real: Reality Defender's Submission to ASVspoof5 Challenge

Abstract:Audio deepfake detection is crucial to combat the malicious use of AI-synthesized speech. Among many efforts undertaken by the community, the ASVspoof challenge has become one of the benchmarks to evaluate the generalizability and robustness of detection models. In this paper, we present Reality Defender's submission to the ASVspoof5 challenge, highlighting a novel pretraining strategy which significantly improves generalizability while maintaining low computational cost during training. Our system SLIM learns the style-linguistics dependency embeddings from various types of bonafide speech using self-supervised contrastive learning. The learned embeddings help to discriminate spoof from bonafide speech by focusing on the relationship between the style and linguistics aspects. We evaluated our system on ASVspoof5, ASV2019, and In-the-wild. Our submission achieved minDCF of 0.1499 and EER of 5.5% on ASVspoof5 Track 1, and EER of 7.4% and 10.8% on ASV2019 and In-the-wild respectively.

* Accepted into ASVspoof5 workshop

Via

Access Paper or Ask Questions

SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection

Jul 26, 2024

Yi Zhu, Surya Koppisetti, Trang Tran, Gaurav Bharaj

Figure 1 for SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection

Figure 2 for SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection

Figure 3 for SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection

Figure 4 for SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection

Abstract:Audio deepfake detection (ADD) is crucial to combat the misuse of speech synthesized from generative AI models. Existing ADD models suffer from generalization issues, with a large performance discrepancy between in-domain and out-of-domain data. Moreover, the black-box nature of existing models limits their use in real-world scenarios, where explanations are required for model decisions. To alleviate these issues, we introduce a new ADD model that explicitly uses the StyleLInguistics Mismatch (SLIM) in fake speech to separate them from real speech. SLIM first employs self-supervised pretraining on only real samples to learn the style-linguistics dependency in the real class. The learned features are then used in complement with standard pretrained acoustic features (e.g., Wav2vec) to learn a classifier on the real and fake classes. When the feature encoders are frozen, SLIM outperforms benchmark methods on out-of-domain datasets while achieving competitive results on in-domain data. The features learned by SLIM allow us to quantify the (mis)match between style and linguistic content in a sample, hence facilitating an explanation of the model decision.

Via

Access Paper or Ask Questions

Towards Attention-based Contrastive Learning for Audio Spoof Detection

Jul 03, 2024

Chirag Goel, Surya Koppisetti, Ben Colman, Ali Shahriyari, Gaurav Bharaj

Abstract:Vision transformers (ViT) have made substantial progress for classification tasks in computer vision. Recently, Gong et. al. '21, introduced attention-based modeling for several audio tasks. However, relatively unexplored is the use of a ViT for audio spoof detection task. We bridge this gap and introduce ViTs for this task. A vanilla baseline built on fine-tuning the SSAST (Gong et. al. '22) audio ViT model achieves sub-optimal equal error rates (EERs). To improve performance, we propose a novel attention-based contrastive learning framework (SSAST-CL) that uses cross-attention to aid the representation learning. Experiments show that our framework successfully disentangles the bonafide and spoof classes and helps learn better classifiers for the task. With appropriate data augmentations policy, a model trained on our framework achieves competitive performance on the ASVSpoof 2021 challenge. We provide comparisons and ablation studies to justify our claim.

* Proc. INTERSPEECH 2023

Via

Access Paper or Ask Questions

AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

Jun 05, 2024

Trevine Oorloff, Surya Koppisetti, Nicolò Bonettini, Divyaraj Solanki, Ben Colman, Yaser Yacoob, Ali Shahriyari, Gaurav Bharaj

Figure 1 for AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

Figure 2 for AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

Figure 3 for AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

Figure 4 for AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

Abstract:With the rapid growth in deepfake video content, we require improved and generalizable methods to detect them. Most existing detection methods either use uni-modal cues or rely on supervised training to capture the dissonance between the audio and visual modalities. While the former disregards the audio-visual correspondences entirely, the latter predominantly focuses on discerning audio-visual cues within the training corpus, thereby potentially overlooking correspondences that can help detect unseen deepfakes. We present Audio-Visual Feature Fusion (AVFF), a two-stage cross-modal learning method that explicitly captures the correspondence between the audio and visual modalities for improved deepfake detection. The first stage pursues representation learning via self-supervision on real videos to capture the intrinsic audio-visual correspondences. To extract rich cross-modal representations, we use contrastive learning and autoencoding objectives, and introduce a novel audio-visual complementary masking and feature fusion strategy. The learned representations are tuned in the second stage, where deepfake classification is pursued via supervised learning on both real and fake videos. Extensive experiments and analysis suggest that our novel representation learning paradigm is highly discriminative in nature. We report 98.6% accuracy and 99.1% AUC on the FakeAVCeleb dataset, outperforming the current audio-visual state-of-the-art by 14.9% and 9.9%, respectively.

* Accepted to CVPR 2024

Via

Access Paper or Ask Questions