Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuhao Luo

CAD: A General Multimodal Framework for Video Deepfake Detection via Cross-Modal Alignment and Distillation

May 21, 2025

Yuxuan Du, Zhendong Wang, Yuhao Luo, Caiyong Piao, Zhiyuan Yan, Hao Li, Li Yuan

Abstract:The rapid emergence of multimodal deepfakes (visual and auditory content are manipulated in concert) undermines the reliability of existing detectors that rely solely on modality-specific artifacts or cross-modal inconsistencies. In this work, we first demonstrate that modality-specific forensic traces (e.g., face-swap artifacts or spectral distortions) and modality-shared semantic misalignments (e.g., lip-speech asynchrony) offer complementary evidence, and that neglecting either aspect limits detection performance. Existing approaches either naively fuse modality-specific features without reconciling their conflicting characteristics or focus predominantly on semantic misalignment at the expense of modality-specific fine-grained artifact cues. To address these shortcomings, we propose a general multimodal framework for video deepfake detection via Cross-Modal Alignment and Distillation (CAD). CAD comprises two core components: 1) Cross-modal alignment that identifies inconsistencies in high-level semantic synchronization (e.g., lip-speech mismatches); 2) Cross-modal distillation that mitigates feature conflicts during fusion while preserving modality-specific forensic traces (e.g., spectral distortions in synthetic audio). Extensive experiments on both multimodal and unimodal (e.g., image-only/video-only)deepfake benchmarks demonstrate that CAD significantly outperforms previous methods, validating the necessity of harmonious integration of multimodal complementary information.

Via

Access Paper or Ask Questions

Sky-Drive: A Distributed Multi-Agent Simulation Platform for Socially-Aware and Human-AI Collaborative Future Transportation

Apr 25, 2025

Zilin Huang, Zihao Sheng, Zhengyang Wan, Yansong Qu, Yuhao Luo, Boyue Wang, Pei Li, Yen-Jung Chen, Jiancong Chen, Keke Long(+3 more)

Abstract:Recent advances in autonomous system simulation platforms have significantly enhanced the safe and scalable testing of driving policies. However, existing simulators do not yet fully meet the needs of future transportation research, particularly in modeling socially-aware driving agents and enabling effective human-AI collaboration. This paper introduces Sky-Drive, a novel distributed multi-agent simulation platform that addresses these limitations through four key innovations: (a) a distributed architecture for synchronized simulation across multiple terminals; (b) a multi-modal human-in-the-loop framework integrating diverse sensors to collect rich behavioral data; (c) a human-AI collaboration mechanism supporting continuous and adaptive knowledge exchange; and (d) a digital twin (DT) framework for constructing high-fidelity virtual replicas of real-world transportation environments. Sky-Drive supports diverse applications such as autonomous vehicle (AV)-vulnerable road user (VRU) interaction modeling, human-in-the-loop training, socially-aware reinforcement learning, personalized driving policy, and customized scenario generation. Future extensions will incorporate foundation models for context-aware decision support and hardware-in-the-loop (HIL) testing for real-world validation. By bridging scenario generation, data collection, algorithm training, and hardware integration, Sky-Drive has the potential to become a foundational platform for the next generation of socially-aware and human-centered autonomous transportation research. The demo video and code are available at:https://sky-lab-uw.github.io/Sky-Drive-website/

* 15 pages, 7 figures

Via

Access Paper or Ask Questions

Can We Leave Deepfake Data Behind in Training Deepfake Detector?

Aug 30, 2024

Jikang Cheng, Zhiyuan Yan, Ying Zhang, Yuhao Luo, Zhongyuan Wang, Chen Li

Figure 1 for Can We Leave Deepfake Data Behind in Training Deepfake Detector?

Figure 2 for Can We Leave Deepfake Data Behind in Training Deepfake Detector?

Figure 3 for Can We Leave Deepfake Data Behind in Training Deepfake Detector?

Figure 4 for Can We Leave Deepfake Data Behind in Training Deepfake Detector?

Abstract:The generalization ability of deepfake detectors is vital for their applications in real-world scenarios. One effective solution to enhance this ability is to train the models with manually-blended data, which we termed "blendfake", encouraging models to learn generic forgery artifacts like blending boundary. Interestingly, current SoTA methods utilize blendfake without incorporating any deepfake data in their training process. This is likely because previous empirical observations suggest that vanilla hybrid training (VHT), which combines deepfake and blendfake data, results in inferior performance to methods using only blendfake data (so-called "1+1<2"). Therefore, a critical question arises: Can we leave deepfake behind and rely solely on blendfake data to train an effective deepfake detector? Intuitively, as deepfakes also contain additional informative forgery clues (e.g., deep generative artifacts), excluding all deepfake data in training deepfake detectors seems counter-intuitive. In this paper, we rethink the role of blendfake in detecting deepfakes and formulate the process from "real to blendfake to deepfake" to be a progressive transition. Specifically, blendfake and deepfake can be explicitly delineated as the oriented pivot anchors between "real-to-fake" transitions. The accumulation of forgery information should be oriented and progressively increasing during this transition process. To this end, we propose an Oriented Progressive Regularizor (OPR) to establish the constraints that compel the distribution of anchors to be discretely arranged. Furthermore, we introduce feature bridging to facilitate the smooth transition between adjacent anchors. Extensive experiments confirm that our design allows leveraging forgery information from both blendfake and deepfake effectively and comprehensively.

Via

Access Paper or Ask Questions

GRANP: A Graph Recurrent Attentive Neural Process Model for Vehicle Trajectory Prediction

Apr 09, 2024

Yuhao Luo, Kehua Chen, Meixin Zhu

Abstract:As a vital component in autonomous driving, accurate trajectory prediction effectively prevents traffic accidents and improves driving efficiency. To capture complex spatial-temporal dynamics and social interactions, recent studies developed models based on advanced deep-learning methods. On the other hand, recent studies have explored the use of deep generative models to further account for trajectory uncertainties. However, the current approaches demonstrating indeterminacy involve inefficient and time-consuming practices such as sampling from trained models. To fill this gap, we proposed a novel model named Graph Recurrent Attentive Neural Process (GRANP) for vehicle trajectory prediction while efficiently quantifying prediction uncertainty. In particular, GRANP contains an encoder with deterministic and latent paths, and a decoder for prediction. The encoder, including stacked Graph Attention Networks, LSTM and 1D convolutional layers, is employed to extract spatial-temporal relationships. The decoder is used to learn a latent distribution and thus quantify prediction uncertainty. To reveal the effectiveness of our model, we evaluate the performance of GRANP on the highD dataset. Extensive experiments show that GRANP achieves state-of-the-art results and can efficiently quantify uncertainties. Additionally, we undertake an intuitive case study that showcases the interpretability of the proposed approach. The code is available at https://github.com/joy-driven/GRANP.

Via

Access Paper or Ask Questions

Domain-Aware Cross-Attention for Cross-domain Recommendation

Jan 22, 2024

Yuhao Luo, Shiwei Ma, Mingjun Nie, Changping Peng, Zhangang Lin, Jingping Shao, Qianfang Xu

Abstract:Cross-domain recommendation (CDR) is an important method to improve recommender system performance, especially when observations in target domains are sparse. However, most existing cross-domain recommendations fail to fully utilize the target domain's special features and are hard to be generalized to new domains. The designed network is complex and is not suitable for rapid industrial deployment. Our method introduces a two-step domain-aware cross-attention, extracting transferable features of the source domain from different granularity, which allows the efficient expression of both domain and user interests. In addition, we simplify the training process, and our model can be easily deployed on new domains. We conduct experiments on both public datasets and industrial datasets, and the experimental results demonstrate the effectiveness of our method. We have also deployed the model in an online advertising system and observed significant improvements in both Click-Through-Rate (CTR) and effective cost per mille (ECPM).

* 6 pages, 1 figure

Via

Access Paper or Ask Questions

Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection

Nov 19, 2023

Zhiyuan Yan, Yuhao Luo, Siwei Lyu, Qingshan Liu, Baoyuan Wu

Figure 1 for Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection

Figure 2 for Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection

Figure 3 for Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection

Figure 4 for Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection

Abstract:Deepfake detection faces a critical generalization hurdle, with performance deteriorating when there is a mismatch between the distributions of training and testing data. A broadly received explanation is the tendency of these detectors to be overfitted to forgery-specific artifacts, rather than learning features that are widely applicable across various forgeries. To address this issue, we propose a simple yet effective detector called LSDA (\underline{L}atent \underline{S}pace \underline{D}ata \underline{A}ugmentation), which is based on a heuristic idea: representations with a wider variety of forgeries should be able to learn a more generalizable decision boundary, thereby mitigating the overfitting of method-specific features (see Figure. 1). Following this idea, we propose to enlarge the forgery space by constructing and simulating variations within and across forgery features in the latent space. This approach encompasses the acquisition of enriched, domain-specific features and the facilitation of smoother transitions between different forgery types, effectively bridging domain gaps. Our approach culminates in refining a binary classifier that leverages the distilled knowledge from the enhanced features, striving for a generalizable deepfake detector. Comprehensive experiments show that our proposed method is surprisingly effective and transcends state-of-the-art detectors across several widely used benchmarks.

Via

Access Paper or Ask Questions

Audio compression-assisted feature extraction for voice replay attack detection

Oct 10, 2023

Xiangyu Shi, Yuhao Luo, Li Wang, Haorui He, Hao Li, Lei Wang, Zhizheng Wu

Figure 1 for Audio compression-assisted feature extraction for voice replay attack detection

Figure 2 for Audio compression-assisted feature extraction for voice replay attack detection

Figure 3 for Audio compression-assisted feature extraction for voice replay attack detection

Figure 4 for Audio compression-assisted feature extraction for voice replay attack detection

Abstract:Replay attack is one of the most effective and simplest voice spoofing attacks. Detecting replay attacks is challenging, according to the Automatic Speaker Verification Spoofing and Countermeasures Challenge 2021 (ASVspoof 2021), because they involve a loudspeaker, a microphone, and acoustic conditions (e.g., background noise). One obstacle to detecting replay attacks is finding robust feature representations that reflect the channel noise information added to the replayed speech. This study proposes a feature extraction approach that uses audio compression for assistance. Audio compression compresses audio to preserve content and speaker information for transmission. The missed information after decompression is expected to contain content- and speaker-independent information (e.g., channel noise added during the replay process). We conducted a comprehensive experiment with a few data augmentation techniques and 3 classifiers on the ASVspoof 2021 physical access (PA) set and confirmed the effectiveness of the proposed feature extraction approach. To the best of our knowledge, the proposed approach achieves the lowest EER at 22.71% on the ASVspoof 2021 PA evaluation set.

Via

Access Paper or Ask Questions

AdvSV: An Over-the-Air Adversarial Attack Dataset for Speaker Verification

Oct 09, 2023

Li Wang, Jiaqi Li, Yuhao Luo, Jiahao Zheng, Lei Wang, Hao Li, Ke Xu, Chengfang Fang, Jie Shi, Zhizheng Wu

Abstract:It is known that deep neural networks are vulnerable to adversarial attacks. Although Automatic Speaker Verification (ASV) built on top of deep neural networks exhibits robust performance in controlled scenarios, many studies confirm that ASV is vulnerable to adversarial attacks. The lack of a standard dataset is a bottleneck for further research, especially reproducible research. In this study, we developed an open-source adversarial attack dataset for speaker verification research. As an initial step, we focused on the over-the-air attack. An over-the-air adversarial attack involves a perturbation generation algorithm, a loudspeaker, a microphone, and an acoustic environment. The variations in the recording configurations make it very challenging to reproduce previous research. The AdvSV dataset is constructed using the Voxceleb1 Verification test set as its foundation. This dataset employs representative ASV models subjected to adversarial attacks and records adversarial samples to simulate over-the-air attack settings. The scope of the dataset can be easily extended to include more types of adversarial attacks. The dataset will be released to the public under the CC-BY license. In addition, we also provide a detection baseline for reproducible research.

* Submitted to ICASSP2024

Via

Access Paper or Ask Questions

A framework with updateable joint images re-ranking for Person Re-identification

Mar 08, 2018

Mingyue Yuan, Dong Yin, Jingwen Ding, Yuhao Luo, Zhipeng Zhou, Chengfeng Zhu, Rui Zhang

Figure 1 for A framework with updateable joint images re-ranking for Person Re-identification

Figure 2 for A framework with updateable joint images re-ranking for Person Re-identification

Figure 3 for A framework with updateable joint images re-ranking for Person Re-identification

Figure 4 for A framework with updateable joint images re-ranking for Person Re-identification

Abstract:Person re-identification plays an important role in realistic video surveillance with increasing demand for public safety. In this paper, we propose a novel framework with rules of updating images for person re-identification in real-world surveillance system. First, Image Pool is generated by using mean-shift tracking method to automatically select video frame fragments of the target person. Second, features extracted from Image Pool by convolutional network work together to re-rank original ranking list of the main image and matching results will be generated. In addition, updating rules are designed for replacing images in Image Pool when a new image satiating with our updating critical formula in video system. These rules fall into two categories: if the new image is from the same camera as the previous updated image, it will replace one of assist images; otherwise, it will replace the main image directly. Experiments are conduced on Market-1501, iLIDS-VID and PRID-2011 and our ITSD datasets to validate that our framework outperforms on rank-1 accuracy and mAP for person re-identification. Furthermore, the update ability of our framework provides consistently remarkable accuracy rate in real-world surveillance system.

* 23 pages,5 figures. submitted to JVCI

Via

Access Paper or Ask Questions