Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jianfu Zhang

Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs

Jun 08, 2025

Yikun Ji, Hong Yan, Jun Lan, Huijia Zhu, Weiqiang Wang, Qi Fan, Liqing Zhang, Jianfu Zhang

Figure 1 for Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs

Figure 2 for Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs

Figure 3 for Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs

Figure 4 for Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs

Abstract:The rapid advancement of image generation technologies intensifies the demand for interpretable and robust detection methods. Although existing approaches often attain high accuracy, they typically operate as black boxes without providing human-understandable justifications. Multi-modal Large Language Models (MLLMs), while not originally intended for forgery detection, exhibit strong analytical and reasoning capabilities. When properly fine-tuned, they can effectively identify AI-generated images and offer meaningful explanations. However, existing MLLMs still struggle with hallucination and often fail to align their visual interpretations with actual image content and human reasoning. To bridge this gap, we construct a dataset of AI-generated images annotated with bounding boxes and descriptive captions that highlight synthesis artifacts, establishing a foundation for human-aligned visual-textual grounded reasoning. We then finetune MLLMs through a multi-stage optimization strategy that progressively balances the objectives of accurate detection, visual localization, and coherent textual explanation. The resulting model achieves superior performance in both detecting AI-generated images and localizing visual flaws, significantly outperforming baseline methods.

Via

Access Paper or Ask Questions

COCO-Inpaint: A Benchmark for Image Inpainting Detection and Manipulation Localization

Apr 25, 2025

Haozhen Yan, Yan Hong, Jiahui Zhan, Yikun Ji, Jun Lan, Huijia Zhu, Weiqiang Wang, Jianfu Zhang

Figure 1 for COCO-Inpaint: A Benchmark for Image Inpainting Detection and Manipulation Localization

Figure 2 for COCO-Inpaint: A Benchmark for Image Inpainting Detection and Manipulation Localization

Figure 3 for COCO-Inpaint: A Benchmark for Image Inpainting Detection and Manipulation Localization

Figure 4 for COCO-Inpaint: A Benchmark for Image Inpainting Detection and Manipulation Localization

Abstract:Recent advancements in image manipulation have achieved unprecedented progress in generating photorealistic content, but also simultaneously eliminating barriers to arbitrary manipulation and editing, raising concerns about multimedia authenticity and cybersecurity. However, existing Image Manipulation Detection and Localization (IMDL) methodologies predominantly focus on splicing or copy-move forgeries, lacking dedicated benchmarks for inpainting-based manipulations. To bridge this gap, we present COCOInpaint, a comprehensive benchmark specifically designed for inpainting detection, with three key contributions: 1) High-quality inpainting samples generated by six state-of-the-art inpainting models, 2) Diverse generation scenarios enabled by four mask generation strategies with optional text guidance, and 3) Large-scale coverage with 258,266 inpainted images with rich semantic diversity. Our benchmark is constructed to emphasize intrinsic inconsistencies between inpainted and authentic regions, rather than superficial semantic artifacts such as object shapes. We establish a rigorous evaluation protocol using three standard metrics to assess existing IMDL approaches. The dataset will be made publicly available to facilitate future research in this area.

* 10 pages, 3 figures

Via

Access Paper or Ask Questions

Towards Explainable Fake Image Detection with Multi-Modal Large Language Models

Apr 19, 2025

Yikun Ji, Yan Hong, Jiahui Zhan, Haoxing Chen, jun lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang

Figure 1 for Towards Explainable Fake Image Detection with Multi-Modal Large Language Models

Figure 2 for Towards Explainable Fake Image Detection with Multi-Modal Large Language Models

Figure 3 for Towards Explainable Fake Image Detection with Multi-Modal Large Language Models

Figure 4 for Towards Explainable Fake Image Detection with Multi-Modal Large Language Models

Abstract:Progress in image generation raises significant public security concerns. We argue that fake image detection should not operate as a "black box". Instead, an ideal approach must ensure both strong generalization and transparency. Recent progress in Multi-modal Large Language Models (MLLMs) offers new opportunities for reasoning-based AI-generated image detection. In this work, we evaluate the capabilities of MLLMs in comparison to traditional detection methods and human evaluators, highlighting their strengths and limitations. Furthermore, we design six distinct prompts and propose a framework that integrates these prompts to develop a more robust, explainable, and reasoning-driven detection system. The code is available at https://github.com/Gennadiyev/mllm-defake.

Via

Access Paper or Ask Questions

InterAnimate: Taming Region-aware Diffusion Model for Realistic Human Interaction Animation

Apr 15, 2025

Yukang Lin, Yan Hong, Zunnan Xu, Xindi Li, Chao Xu, Chuanbiao Song, Ronghui Li, Haoxing Chen, Jun Lan, Huijia Zhu(+3 more)

Abstract:Recent video generation research has focused heavily on isolated actions, leaving interactive motions-such as hand-face interactions-largely unexamined. These interactions are essential for emerging biometric authentication systems, which rely on interactive motion-based anti-spoofing approaches. From a security perspective, there is a growing need for large-scale, high-quality interactive videos to train and strengthen authentication models. In this work, we introduce a novel paradigm for animating realistic hand-face interactions. Our approach simultaneously learns spatio-temporal contact dynamics and biomechanically plausible deformation effects, enabling natural interactions where hand movements induce anatomically accurate facial deformations while maintaining collision-free contact. To facilitate this research, we present InterHF, a large-scale hand-face interaction dataset featuring 18 interaction patterns and 90,000 annotated videos. Additionally, we propose InterAnimate, a region-aware diffusion model designed specifically for interaction animation. InterAnimate leverages learnable spatial and temporal latents to effectively capture dynamic interaction priors and integrates a region-aware interaction mechanism that injects these priors into the denoising process. To the best of our knowledge, this work represents the first large-scale effort to systematically study human hand-face interactions. Qualitative and quantitative results show InterAnimate produces highly realistic animations, setting a new benchmark. Code and data will be made public to advance research.

* under preview

Via

Access Paper or Ask Questions

High-Quality 3D Head Reconstruction from Any Single Portrait Image

Mar 11, 2025

Jianfu Zhang, yujie Gao, Jiahui Zhan, Wentao Wang, Yiyi Zhang, Haohua Zhao, Liqing Zhang

Figure 1 for High-Quality 3D Head Reconstruction from Any Single Portrait Image

Figure 2 for High-Quality 3D Head Reconstruction from Any Single Portrait Image

Figure 3 for High-Quality 3D Head Reconstruction from Any Single Portrait Image

Figure 4 for High-Quality 3D Head Reconstruction from Any Single Portrait Image

Abstract:In this work, we introduce a novel high-fidelity 3D head reconstruction method from a single portrait image, regardless of perspective, expression, or accessories. Despite significant efforts in adapting 2D generative models for novel view synthesis and 3D optimization, most methods struggle to produce high-quality 3D portraits. The lack of crucial information, such as identity, expression, hair, and accessories, limits these approaches in generating realistic 3D head models. To address these challenges, we construct a new high-quality dataset containing 227 sequences of digital human portraits captured from 96 different perspectives, totalling 21,792 frames, featuring diverse expressions and accessories. To further improve performance, we integrate identity and expression information into the multi-view diffusion process to enhance facial consistency across views. Specifically, we apply identity- and expression-aware guidance and supervision to extract accurate facial representations, which guide the model and enforce objective functions to ensure high identity and expression consistency during generation. Finally, we generate an orbital video around the portrait consisting of 96 multi-view frames, which can be used for 3D portrait model reconstruction. Our method demonstrates robust performance across challenging scenarios, including side-face angles and complex accessories

Via

Access Paper or Ask Questions

GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data

Nov 27, 2024

Wentao Wang, Hang Ye, Fangzhou Hong, Xue Yang, Jianfu Zhang, Yizhou Wang, Ziwei Liu, Liang Pan

Figure 1 for GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data

Figure 2 for GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data

Figure 3 for GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data

Figure 4 for GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data

Abstract:Given a single in-the-wild human photo, it remains a challenging task to reconstruct a high-fidelity 3D human model. Existing methods face difficulties including a) the varying body proportions captured by in-the-wild human images; b) diverse personal belongings within the shot; and c) ambiguities in human postures and inconsistency in human textures. In addition, the scarcity of high-quality human data intensifies the challenge. To address these problems, we propose a Generalizable image-to-3D huMAN reconstruction framework, dubbed GeneMAN, building upon a comprehensive multi-source collection of high-quality human data, including 3D scans, multi-view videos, single photos, and our generated synthetic human data. GeneMAN encompasses three key modules. 1) Without relying on parametric human models (e.g., SMPL), GeneMAN first trains a human-specific text-to-image diffusion model and a view-conditioned diffusion model, serving as GeneMAN 2D human prior and 3D human prior for reconstruction, respectively. 2) With the help of the pretrained human prior models, the Geometry Initialization-&-Sculpting pipeline is leveraged to recover high-quality 3D human geometry given a single image. 3) To achieve high-fidelity 3D human textures, GeneMAN employs the Multi-Space Texture Refinement pipeline, consecutively refining textures in the latent and the pixel spaces. Extensive experimental results demonstrate that GeneMAN could generate high-quality 3D human models from a single image input, outperforming prior state-of-the-art methods. Notably, GeneMAN could reveal much better generalizability in dealing with in-the-wild images, often yielding high-quality 3D human models in natural poses with common items, regardless of the body proportions in the input images.

* Project page: https://roooooz.github.io/GeneMAN/

Via

Access Paper or Ask Questions

DomainGallery: Few-shot Domain-driven Image Generation by Attribute-centric Finetuning

Nov 07, 2024

Yuxuan Duan, Yan Hong, Bo Zhang, Jun Lan, Huijia Zhu, Weiqiang Wang, Jianfu Zhang, Li Niu, Liqing Zhang

Figure 1 for DomainGallery: Few-shot Domain-driven Image Generation by Attribute-centric Finetuning

Figure 2 for DomainGallery: Few-shot Domain-driven Image Generation by Attribute-centric Finetuning

Figure 3 for DomainGallery: Few-shot Domain-driven Image Generation by Attribute-centric Finetuning

Figure 4 for DomainGallery: Few-shot Domain-driven Image Generation by Attribute-centric Finetuning

Abstract:The recent progress in text-to-image models pretrained on large-scale datasets has enabled us to generate various images as long as we provide a text prompt describing what we want. Nevertheless, the availability of these models is still limited when we expect to generate images that fall into a specific domain either hard to describe or just unseen to the models. In this work, we propose DomainGallery, a few-shot domain-driven image generation method which aims at finetuning pretrained Stable Diffusion on few-shot target datasets in an attribute-centric manner. Specifically, DomainGallery features prior attribute erasure, attribute disentanglement, regularization and enhancement. These techniques are tailored to few-shot domain-driven generation in order to solve key issues that previous works have failed to settle. Extensive experiments are given to validate the superior performance of DomainGallery on a variety of domain-driven generation scenarios. Codes are available at https://github.com/Ldhlwh/DomainGallery.

* NeurIPS 2024

Via

Access Paper or Ask Questions

Self-Supervised Vision Transformer for Enhanced Virtual Clothes Try-On

Jun 15, 2024

Lingxiao Lu, Shengyi Wu, Haoxuan Sun, Junhong Gou, Jianlou Si, Chen Qian, Jianfu Zhang, Liqing Zhang

Abstract:Virtual clothes try-on has emerged as a vital feature in online shopping, offering consumers a critical tool to visualize how clothing fits. In our research, we introduce an innovative approach for virtual clothes try-on, utilizing a self-supervised Vision Transformer (ViT) coupled with a diffusion model. Our method emphasizes detail enhancement by contrasting local clothing image embeddings, generated by ViT, with their global counterparts. Techniques such as conditional guidance and focus on key regions have been integrated into our approach. These combined strategies empower the diffusion model to reproduce clothing details with increased clarity and realism. The experimental results showcase substantial advancements in the realism and precision of details in virtual try-on experiences, significantly surpassing the capabilities of existing technologies.

Via

Access Paper or Ask Questions

DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark

May 30, 2024

Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang(+1 more)

Figure 1 for DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark

Figure 2 for DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark

Figure 3 for DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark

Figure 4 for DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark

Abstract:Recently, video generation techniques have advanced rapidly. Given the popularity of video content on social media platforms, these models intensify concerns about the spread of fake information. Therefore, there is a growing demand for detectors capable of distinguishing between fake AI-generated videos and mitigating the potential harm caused by fake information. However, the lack of large-scale datasets from the most advanced video generators poses a barrier to the development of such detectors. To address this gap, we introduce the first AI-generated video detection dataset, GenVideo. It features the following characteristics: (1) a large volume of videos, including over one million AI-generated and real videos collected; (2) a rich diversity of generated content and methodologies, covering a broad spectrum of video categories and generation techniques. We conducted extensive studies of the dataset and proposed two evaluation methods tailored for real-world-like scenarios to assess the detectors' performance: the cross-generator video classification task assesses the generalizability of trained detectors on generators; the degraded video classification task evaluates the robustness of detectors to handle videos that have degraded in quality during dissemination. Moreover, we introduced a plug-and-play module, named Detail Mamba (DeMamba), designed to enhance the detectors by identifying AI-generated videos through the analysis of inconsistencies in temporal and spatial dimensions. Our extensive experiments demonstrate DeMamba's superior generalizability and robustness on GenVideo compared to existing detectors. We believe that the GenVideo dataset and the DeMamba module will significantly advance the field of AI-generated video detection. Our code and dataset will be aviliable at \url{https://github.com/chenhaoxing/DeMamba}.

Via

Access Paper or Ask Questions

Supervised Contrastive Learning for Snapshot Spectral Imaging Face Anti-Spoofing

May 29, 2024

Chuanbiao Song, Yan Hong, Jun Lan, Huijia Zhu, Weiqiang Wang, Jianfu Zhang

Figure 1 for Supervised Contrastive Learning for Snapshot Spectral Imaging Face Anti-Spoofing

Figure 2 for Supervised Contrastive Learning for Snapshot Spectral Imaging Face Anti-Spoofing

Figure 3 for Supervised Contrastive Learning for Snapshot Spectral Imaging Face Anti-Spoofing

Figure 4 for Supervised Contrastive Learning for Snapshot Spectral Imaging Face Anti-Spoofing

Abstract:This study reveals a cutting-edge re-balanced contrastive learning strategy aimed at strengthening face anti-spoofing capabilities within facial recognition systems, with a focus on countering the challenges posed by printed photos, and highly realistic silicone or latex masks. Leveraging the HySpeFAS dataset, which benefits from Snapshot Spectral Imaging technology to provide hyperspectral images, our approach harmonizes class-level contrastive learning with data resampling and an innovative real-face oriented reweighting technique. This method effectively mitigates dataset imbalances and reduces identity-related biases. Notably, our strategy achieved an unprecedented 0.0000\% Average Classification Error Rate (ACER) on the HySpeFAS dataset, ranking first at the Chalearn Snapshot Spectral Imaging Face Anti-spoofing Challenge on CVPR 2024.

* We rank first at the Chalearn Snapshot Spectral Imaging Face Anti-spoofing Challenge on CVPR 2024; the paper is accepted by CVPR 2024 workshop;

Via

Access Paper or Ask Questions