Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Siran Peng

MLLM-Enhanced Face Forgery Detection: A Vision-Language Fusion Solution

May 04, 2025

Siran Peng, Zipei Wang, Li Gao, Xiangyu Zhu, Tianshuo Zhang, Ajian Liu, Haoyuan Zhang, Zhen Lei

Abstract:Reliable face forgery detection algorithms are crucial for countering the growing threat of deepfake-driven disinformation. Previous research has demonstrated the potential of Multimodal Large Language Models (MLLMs) in identifying manipulated faces. However, existing methods typically depend on either the Large Language Model (LLM) alone or an external detector to generate classification results, which often leads to sub-optimal integration of visual and textual modalities. In this paper, we propose VLF-FFD, a novel Vision-Language Fusion solution for MLLM-enhanced Face Forgery Detection. Our key contributions are twofold. First, we present EFF++, a frame-level, explainability-driven extension of the widely used FaceForensics++ (FF++) dataset. In EFF++, each manipulated video frame is paired with a textual annotation that describes both the forgery artifacts and the specific manipulation technique applied, enabling more effective and informative MLLM training. Second, we design a Vision-Language Fusion Network (VLF-Net) that promotes bidirectional interaction between visual and textual features, supported by a three-stage training pipeline to fully leverage its potential. VLF-FFD achieves state-of-the-art (SOTA) performance in both cross-dataset and intra-dataset evaluations, underscoring its exceptional effectiveness in face forgery detection.

Via

Access Paper or Ask Questions

A General Adaptive Dual-level Weighting Mechanism for Remote Sensing Pansharpening

Mar 17, 2025

Jie Huang, Haorui Chen, Jiaxuan Ren, Siran Peng, Liangjian Deng

Abstract:Currently, deep learning-based methods for remote sensing pansharpening have advanced rapidly. However, many existing methods struggle to fully leverage feature heterogeneity and redundancy, thereby limiting their effectiveness. We use the covariance matrix to model the feature heterogeneity and redundancy and propose Correlation-Aware Covariance Weighting (CACW) to adjust them. CACW captures these correlations through the covariance matrix, which is then processed by a nonlinear function to generate weights for adjustment. Building upon CACW, we introduce a general adaptive dual-level weighting mechanism (ADWM) to address these challenges from two key perspectives, enhancing a wide range of existing deep-learning methods. First, Intra-Feature Weighting (IFW) evaluates correlations among channels within each feature to reduce redundancy and enhance unique information. Second, Cross-Feature Weighting (CFW) adjusts contributions across layers based on inter-layer correlations, refining the final output. Extensive experiments demonstrate the superior performance of ADWM compared to recent state-of-the-art (SOTA) methods. Furthermore, we validate the effectiveness of our approach through generality experiments, redundancy visualization, comparison experiments, key variables and complexity analysis, and ablation studies. Our code is available at https://github.com/Jie-1203/ADWM.

* This paper is accepted at the CVPR Conference on Computer Vision and Pattern Recognition 2025

Via

Access Paper or Ask Questions

WMamba: Wavelet-based Mamba for Face Forgery Detection

Jan 16, 2025

Siran Peng, Tianshuo Zhang, Li Gao, Xiangyu Zhu, Haoyuan Zhang, Kai Pang, Zhen Lei

Figure 1 for WMamba: Wavelet-based Mamba for Face Forgery Detection

Figure 2 for WMamba: Wavelet-based Mamba for Face Forgery Detection

Figure 3 for WMamba: Wavelet-based Mamba for Face Forgery Detection

Figure 4 for WMamba: Wavelet-based Mamba for Face Forgery Detection

Abstract:With the rapid advancement of deepfake generation technologies, the demand for robust and accurate face forgery detection algorithms has become increasingly critical. Recent studies have demonstrated that wavelet analysis can uncover subtle forgery artifacts that remain imperceptible in the spatial domain. Wavelets effectively capture important facial contours, which are often slender, fine-grained, and global in nature. However, existing wavelet-based approaches fail to fully leverage these unique characteristics, resulting in sub-optimal feature extraction and limited generalizability. To address this challenge, we introduce WMamba, a novel wavelet-based feature extractor built upon the Mamba architecture. WMamba maximizes the utility of wavelet information through two key innovations. First, we propose Dynamic Contour Convolution (DCConv), which employs specially crafted deformable kernels to adaptively model slender facial contours. Second, by leveraging the Mamba architecture, our method captures long-range spatial relationships with linear computational complexity. This efficiency allows for the extraction of fine-grained, global forgery artifacts from small image patches. Extensive experimental results show that WMamba achieves state-of-the-art (SOTA) performance, highlighting its effectiveness and superiority in face forgery detection.

Via

Access Paper or Ask Questions

FusionMamba: Efficient Image Fusion with State Space Model

Apr 11, 2024

Siran Peng, Xiangyu Zhu, Haoyu Deng, Zhen Lei, Liang-Jian Deng

Figure 1 for FusionMamba: Efficient Image Fusion with State Space Model

Figure 2 for FusionMamba: Efficient Image Fusion with State Space Model

Figure 3 for FusionMamba: Efficient Image Fusion with State Space Model

Figure 4 for FusionMamba: Efficient Image Fusion with State Space Model

Abstract:Image fusion aims to generate a high-resolution multi/hyper-spectral image by combining a high-resolution image with limited spectral information and a low-resolution image with abundant spectral data. Current deep learning (DL)-based methods for image fusion primarily rely on CNNs or Transformers to extract features and merge different types of data. While CNNs are efficient, their receptive fields are limited, restricting their capacity to capture global context. Conversely, Transformers excel at learning global information but are hindered by their quadratic complexity. Fortunately, recent advancements in the State Space Model (SSM), particularly Mamba, offer a promising solution to this issue by enabling global awareness with linear complexity. However, there have been few attempts to explore the potential of SSM in information fusion, which is a crucial ability in domains like image fusion. Therefore, we propose FusionMamba, an innovative method for efficient image fusion. Our contributions mainly focus on two aspects. Firstly, recognizing that images from different sources possess distinct properties, we incorporate Mamba blocks into two U-shaped networks, presenting a novel architecture that extracts spatial and spectral features in an efficient, independent, and hierarchical manner. Secondly, to effectively combine spatial and spectral information, we extend the Mamba block to accommodate dual inputs. This expansion leads to the creation of a new module called the FusionMamba block, which outperforms existing fusion techniques such as concatenation and cross-attention. To validate FusionMamba's effectiveness, we conduct a series of experiments on five datasets related to three image fusion tasks. The quantitative and qualitative evaluation results demonstrate that our method achieves state-of-the-art (SOTA) performance, underscoring the superiority of FusionMamba.

Via

Access Paper or Ask Questions

Source-Aware Spatial-Spectral-Integrated Double U-Net for Image Fusion

Dec 13, 2022

Siran Peng, Chenhao Guo, Xiao Wu

Abstract:In image fusion tasks, pictures from different sources possess distinctive properties, therefore treating them equally will lead to inadequate feature extracting. Besides, multi-scaled networks capture information more sufficiently than single-scaled models in pixel-wised problems. In light of these factors, we propose a source-aware spatial-spectral-integrated double U-shaped network called $\rm{(SU)^2}$Net. The network is mainly composed of a spatial U-net and a spectral U-net, which learn spatial details and spectral characteristics discriminately and hierarchically. In contrast with most previous works that simply apply concatenation to integrate spatial and spectral information, a novel structure named the spatial-spectral block (called $\rm{S^2}$Block) is specially designed to merge feature maps from different sources effectively. Experiment results show that our method outperforms the representative state-of-the-art (SOTA) approaches in both quantitative and qualitative evaluations for a variety of image fusion missions, including remote sensing pansharpening and hyperspectral image super-resolution (HISR).

Via

Access Paper or Ask Questions