Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Muhammad Umar Farooq

Korea National University of Transportation

A Lightweight and Interpretable Deepfakes Detection Framework

Jan 21, 2025

Muhammad Umar Farooq, Ali Javed, Khalid Mahmood Malik, Muhammad Anas Raza

Abstract:The recent realistic creation and dissemination of so-called deepfakes poses a serious threat to social life, civil rest, and law. Celebrity defaming, election manipulation, and deepfakes as evidence in court of law are few potential consequences of deepfakes. The availability of open source trained models based on modern frameworks such as PyTorch or TensorFlow, video manipulations Apps such as FaceApp and REFACE, and economical computing infrastructure has easen the creation of deepfakes. Most of the existing detectors focus on detecting either face-swap, lip-sync, or puppet master deepfakes, but a unified framework to detect all three types of deepfakes is hardly explored. This paper presents a unified framework that exploits the power of proposed feature fusion of hybrid facial landmarks and our novel heart rate features for detection of all types of deepfakes. We propose novel heart rate features and fused them with the facial landmark features to better extract the facial artifacts of fake videos and natural variations available in the original videos. We used these features to train a light-weight XGBoost to classify between the deepfake and bonafide videos. We evaluated the performance of our framework on the world leaders dataset (WLDR) that contains all types of deepfakes. Experimental results illustrate that the proposed framework offers superior detection performance over the comparative deepfakes detection methods. Performance comparison of our framework against the LSTM-FCN, a candidate of deep learning model, shows that proposed model achieves similar results, however, it is more interpretable.

* International Conference of Advanced Engineering, Technology and Applications, 2021

Via

Access Paper or Ask Questions

Transferable Adversarial Attacks on Audio Deepfake Detection

Jan 21, 2025

Muhammad Umar Farooq, Awais Khan, Kutub Uddin, Khalid Mahmood Malik

Figure 1 for Transferable Adversarial Attacks on Audio Deepfake Detection

Figure 2 for Transferable Adversarial Attacks on Audio Deepfake Detection

Figure 3 for Transferable Adversarial Attacks on Audio Deepfake Detection

Figure 4 for Transferable Adversarial Attacks on Audio Deepfake Detection

Abstract:Audio deepfakes pose significant threats, including impersonation, fraud, and reputation damage. To address these risks, audio deepfake detection (ADD) techniques have been developed, demonstrating success on benchmarks like ASVspoof2019. However, their resilience against transferable adversarial attacks remains largely unexplored. In this paper, we introduce a transferable GAN-based adversarial attack framework to evaluate the effectiveness of state-of-the-art (SOTA) ADD systems. By leveraging an ensemble of surrogate ADD models and a discriminator, the proposed approach generates transferable adversarial attacks that better reflect real-world scenarios. Unlike previous methods, the proposed framework incorporates a self-supervised audio model to ensure transcription and perceptual integrity, resulting in high-quality adversarial attacks. Experimental results on benchmark dataset reveal that SOTA ADD systems exhibit significant vulnerabilities, with accuracies dropping from 98% to 26%, 92% to 54%, and 94% to 84% in white-box, gray-box, and black-box scenarios, respectively. When tested in other data sets, performance drops of 91% to 46%, and 94% to 67% were observed against the In-the-Wild and WaveFake data sets, respectively. These results highlight the significant vulnerabilities of existing ADD systems and emphasize the need to enhance their robustness against advanced adversarial threats to ensure security and reliability.

* WACV 2025

Via

Access Paper or Ask Questions

Securing Social Media Against Deepfakes using Identity, Behavioral, and Geometric Signatures

Dec 07, 2024

Muhammad Umar Farooq, Awais Khan, Ijaz Ul Haq, Khalid Mahmood Malik

Figure 1 for Securing Social Media Against Deepfakes using Identity, Behavioral, and Geometric Signatures

Figure 2 for Securing Social Media Against Deepfakes using Identity, Behavioral, and Geometric Signatures

Figure 3 for Securing Social Media Against Deepfakes using Identity, Behavioral, and Geometric Signatures

Figure 4 for Securing Social Media Against Deepfakes using Identity, Behavioral, and Geometric Signatures

Abstract:Trust in social media is a growing concern due to its ability to influence significant societal changes. However, this space is increasingly compromised by various types of deepfake multimedia, which undermine the authenticity of shared content. Although substantial efforts have been made to address the challenge of deepfake content, existing detection techniques face a major limitation in generalization: they tend to perform well only on specific types of deepfakes they were trained on.This dependency on recognizing specific deepfake artifacts makes current methods vulnerable when applied to unseen or varied deepfakes, thereby compromising their performance in real-world applications such as social media platforms. To address the generalizability of deepfake detection, there is a need for a holistic approach that can capture a broader range of facial attributes and manipulations beyond isolated artifacts. To address this, we propose a novel deepfake detection framework featuring an effective feature descriptor that integrates Deep identity, Behavioral, and Geometric (DBaG) signatures, along with a classifier named DBaGNet. Specifically, the DBaGNet classifier utilizes the extracted DBaG signatures, leveraging a triplet loss objective to enhance generalized representation learning for improved classification. Specifically, the DBaGNet classifier utilizes the extracted DBaG signatures and applies a triplet loss objective to enhance generalized representation learning for improved classification. To test the effectiveness and generalizability of our proposed approach, we conduct extensive experiments using six benchmark deepfake datasets: WLDR, CelebDF, DFDC, FaceForensics++, DFD, and NVFAIR. Specifically, to ensure the effectiveness of our approach, we perform cross-dataset evaluations, and the results demonstrate significant performance gains over several state-of-the-art methods.

Via

Access Paper or Ask Questions

Progressive unsupervised domain adaptation for ASR using ensemble models and multi-stage training

Feb 07, 2024

Rehan Ahmad, Muhammad Umar Farooq, Thomas Hain

Abstract:In Automatic Speech Recognition (ASR), teacher-student (T/S) training has shown to perform well for domain adaptation with small amount of training data. However, adaption without ground-truth labels is still challenging. A previous study has shown the effectiveness of using ensemble teacher models in T/S training for unsupervised domain adaptation (UDA) but its performance still lags behind compared to the model trained on in-domain data. This paper proposes a method to yield better UDA by training multi-stage students with ensemble teacher models. Initially, multiple teacher models are trained on labelled data from read and meeting domains. These teachers are used to train a student model on unlabelled out-of-domain telephone speech data. To improve the adaptation, subsequent student models are trained sequentially considering previously trained model as their teacher. Experiments are conducted with three teachers trained on AMI, WSJ and LibriSpeech and three stages of students on SwitchBoard data. Results shown on eval00 test set show significant WER improvement with multi-stage training with an absolute gain of 9.8%, 7.7% and 3.3% at each stage.

Via

Access Paper or Ask Questions

MUST: A Multilingual Student-Teacher Learning approach for low-resource speech recognition

Oct 29, 2023

Muhammad Umar Farooq, Rehan Ahmad, Thomas Hain

Abstract:Student-teacher learning or knowledge distillation (KD) has been previously used to address data scarcity issue for training of speech recognition (ASR) systems. However, a limitation of KD training is that the student model classes must be a proper or improper subset of the teacher model classes. It prevents distillation from even acoustically similar languages if the character sets are not same. In this work, the aforementioned limitation is addressed by proposing a MUltilingual Student-Teacher (MUST) learning which exploits a posteriors mapping approach. A pre-trained mapping model is used to map posteriors from a teacher language to the student language ASR. These mapped posteriors are used as soft labels for KD learning. Various teacher ensemble schemes are experimented to train an ASR model for low-resource languages. A model trained with MUST learning reduces relative character error rate (CER) up to 9.5% in comparison with a baseline monolingual ASR.

* Accepted for IEEE ASRU 2023

Via

Access Paper or Ask Questions

Learning Cross-lingual Mappings for Data Augmentation to Improve Low-Resource Speech Recognition

Jun 14, 2023

Muhammad Umar Farooq, Thomas Hain

Abstract:Exploiting cross-lingual resources is an effective way to compensate for data scarcity of low resource languages. Recently, a novel multilingual model fusion technique has been proposed where a model is trained to learn cross-lingual acoustic-phonetic similarities as a mapping function. However, handcrafted lexicons have been used to train hybrid DNN-HMM ASR systems. To remove this dependency, we extend the concept of learnable cross-lingual mappings for end-to-end speech recognition. Furthermore, mapping models are employed to transliterate the source languages to the target language without using parallel data. Finally, the source audio and its transliteration is used for data augmentation to retrain the target language ASR. The results show that any source language ASR model can be used for a low-resource target language recognition followed by proposed mapping model. Furthermore, data augmentation results in a relative gain up to 5% over baseline monolingual model.

* Accepted for Interspeech 2023

Via

Access Paper or Ask Questions

Ensemble CNNs for Breast Tumor Classification

Apr 11, 2023

Muhammad Umar Farooq, Zahid Ullah, Jeonghwan Gwak

Abstract:To improve the recognition ability of computer-aided breast mass classification among mammographic images, in this work we explore the state-of-the-art classification networks to develop an ensemble mechanism. First, the regions of interest (ROIs) are obtained from the original dataset, and then three models, i.e., XceptionNet, DenseNet, and EfficientNet, are trained individually. After training, we ensemble the mechanism by summing the probabilities outputted from each network which enhances the performance up to 5%. The scheme has been validated on a public dataset and we achieved accuracy, precision, and recall 88%, 85%, and 76% respectively.

* SMA 2021: The 10th International Conference on Smart Media and Applications Gunsan Saemangeum Convention Center and Kunsan National University Gunsan-si, South Korea, September 9-11, 2021

Via

Access Paper or Ask Questions

Towards domain generalisation in ASR with elitist sampling and ensemble knowledge distillation

Mar 01, 2023

Rehan Ahmad, Md Asif Jalal, Muhammad Umar Farooq, Anna Ollerenshaw, Thomas Hain

Figure 1 for Towards domain generalisation in ASR with elitist sampling and ensemble knowledge distillation

Figure 2 for Towards domain generalisation in ASR with elitist sampling and ensemble knowledge distillation

Figure 3 for Towards domain generalisation in ASR with elitist sampling and ensemble knowledge distillation

Figure 4 for Towards domain generalisation in ASR with elitist sampling and ensemble knowledge distillation

Abstract:Knowledge distillation has widely been used for model compression and domain adaptation for speech applications. In the presence of multiple teachers, knowledge can easily be transferred to the student by averaging the models output. However, previous research shows that the student do not adapt well with such combination. This paper propose to use an elitist sampling strategy at the output of ensemble teacher models to select the best-decoded utterance generated by completely out-of-domain teacher models for generalizing unseen domain. The teacher models are trained on AMI, LibriSpeech and WSJ while the student is adapted for the Switchboard data. The results show that with the selection strategy based on the individual models posteriors the student model achieves a better WER compared to all the teachers and baselines with a minimum absolute improvement of about 8.4 percent. Furthermore, an insights on the model adaptation with out-of-domain data has also been studied via correlation analysis.

Via

Access Paper or Ask Questions

Non-Linear Pairwise Language Mappings for Low-Resource Multilingual Acoustic Model Fusion

Jul 07, 2022

Muhammad Umar Farooq, Darshan Adiga Haniya Narayana, Thomas Hain

Figure 1 for Non-Linear Pairwise Language Mappings for Low-Resource Multilingual Acoustic Model Fusion

Figure 2 for Non-Linear Pairwise Language Mappings for Low-Resource Multilingual Acoustic Model Fusion

Figure 3 for Non-Linear Pairwise Language Mappings for Low-Resource Multilingual Acoustic Model Fusion

Figure 4 for Non-Linear Pairwise Language Mappings for Low-Resource Multilingual Acoustic Model Fusion

Abstract:Multilingual speech recognition has drawn significant attention as an effective way to compensate data scarcity for low-resource languages. End-to-end (e2e) modelling is preferred over conventional hybrid systems, mainly because of no lexicon requirement. However, hybrid DNN-HMMs still outperform e2e models in limited data scenarios. Furthermore, the problem of manual lexicon creation has been alleviated by publicly available trained models of grapheme-to-phoneme (G2P) and text to IPA transliteration for a lot of languages. In this paper, a novel approach of hybrid DNN-HMM acoustic models fusion is proposed in a multilingual setup for the low-resource languages. Posterior distributions from different monolingual acoustic models, against a target language speech signal, are fused together. A separate regression neural network is trained for each source-target language pair to transform posteriors from source acoustic model to the target language. These networks require very limited data as compared to the ASR training. Posterior fusion yields a relative gain of 14.65% and 6.5% when compared with multilingual and monolingual baselines respectively. Cross-lingual model fusion shows that the comparable results can be achieved without using posteriors from the language dependent ASR.

* Accepted for Interspeech 2022

Via

Access Paper or Ask Questions

Investigating the Impact of Cross-lingual Acoustic-Phonetic Similarities on Multilingual Speech Recognition

Jul 07, 2022

Muhammad Umar Farooq, Thomas Hain

Figure 1 for Investigating the Impact of Cross-lingual Acoustic-Phonetic Similarities on Multilingual Speech Recognition

Figure 2 for Investigating the Impact of Cross-lingual Acoustic-Phonetic Similarities on Multilingual Speech Recognition

Figure 3 for Investigating the Impact of Cross-lingual Acoustic-Phonetic Similarities on Multilingual Speech Recognition

Figure 4 for Investigating the Impact of Cross-lingual Acoustic-Phonetic Similarities on Multilingual Speech Recognition

Abstract:Multilingual automatic speech recognition (ASR) systems mostly benefit low resource languages but suffer degradation in performance across several languages relative to their monolingual counterparts. Limited studies have focused on understanding the languages behaviour in the multilingual speech recognition setups. In this paper, a novel data-driven approach is proposed to investigate the cross-lingual acoustic-phonetic similarities. This technique measures the similarities between posterior distributions from various monolingual acoustic models against a target speech signal. Deep neural networks are trained as mapping networks to transform the distributions from different acoustic models into a directly comparable form. The analysis observes that the languages closeness can not be truly estimated by the volume of overlapping phonemes set. Entropy analysis of the proposed mapping networks exhibits that a language with lesser overlap can be more amenable to cross-lingual transfer, and hence more beneficial in the multilingual setup. Finally, the proposed posterior transformation approach is leveraged to fuse monolingual models for a target language. A relative improvement of ~8% over monolingual counterpart is achieved.

* Accepted for Interspeech 2022

Via

Access Paper or Ask Questions