Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haiyong Zheng

MSGFusion: Multimodal Scene Graph-Guided Infrared and Visible Image Fusion

Sep 16, 2025

Guihui Li, Bowei Dong, Kaizhi Dong, Jiayi Li, Haiyong Zheng

Figure 1 for MSGFusion: Multimodal Scene Graph-Guided Infrared and Visible Image Fusion

Figure 2 for MSGFusion: Multimodal Scene Graph-Guided Infrared and Visible Image Fusion

Figure 3 for MSGFusion: Multimodal Scene Graph-Guided Infrared and Visible Image Fusion

Figure 4 for MSGFusion: Multimodal Scene Graph-Guided Infrared and Visible Image Fusion

Abstract:Infrared and visible image fusion has garnered considerable attention owing to the strong complementarity of these two modalities in complex, harsh environments. While deep learning-based fusion methods have made remarkable advances in feature extraction, alignment, fusion, and reconstruction, they still depend largely on low-level visual cues, such as texture and contrast, and struggle to capture the high-level semantic information embedded in images. Recent attempts to incorporate text as a source of semantic guidance have relied on unstructured descriptions that neither explicitly model entities, attributes, and relationships nor provide spatial localization, thereby limiting fine-grained fusion performance. To overcome these challenges, we introduce MSGFusion, a multimodal scene graph-guided fusion framework for infrared and visible imagery. By deeply coupling structured scene graphs derived from text and vision, MSGFusion explicitly represents entities, attributes, and spatial relations, and then synchronously refines high-level semantics and low-level details through successive modules for scene graph representation, hierarchical aggregation, and graph-driven fusion. Extensive experiments on multiple public benchmarks show that MSGFusion significantly outperforms state-of-the-art approaches, particularly in detail preservation and structural clarity, and delivers superior semantic consistency and generalizability in downstream tasks such as low-light object detection, semantic segmentation, and medical image fusion.

Via

Access Paper or Ask Questions

Face Forgery Detection with Elaborate Backbone

Sep 25, 2024

Zonghui Guo, Yingjie Liu, Jie Zhang, Haiyong Zheng, Shiguang Shan

Figure 1 for Face Forgery Detection with Elaborate Backbone

Figure 2 for Face Forgery Detection with Elaborate Backbone

Figure 3 for Face Forgery Detection with Elaborate Backbone

Figure 4 for Face Forgery Detection with Elaborate Backbone

Abstract:Face Forgery Detection (FFD), or Deepfake detection, aims to determine whether a digital face is real or fake. Due to different face synthesis algorithms with diverse forgery patterns, FFD models often overfit specific patterns in training datasets, resulting in poor generalization to other unseen forgeries. This severe challenge requires FFD models to possess strong capabilities in representing complex facial features and extracting subtle forgery cues. Although previous FFD models directly employ existing backbones to represent and extract facial forgery cues, the critical role of backbones is often overlooked, particularly as their knowledge and capabilities are insufficient to address FFD challenges, inevitably limiting generalization. Therefore, it is essential to integrate the backbone pre-training configurations and seek practical solutions by revisiting the complete FFD workflow, from backbone pre-training and fine-tuning to inference of discriminant results. Specifically, we analyze the crucial contributions of backbones with different configurations in FFD task and propose leveraging the ViT network with self-supervised learning on real-face datasets to pre-train a backbone, equipping it with superior facial representation capabilities. We then build a competitive backbone fine-tuning framework that strengthens the backbone's ability to extract diverse forgery cues within a competitive learning mechanism. Moreover, we devise a threshold optimization mechanism that utilizes prediction confidence to improve the inference reliability. Comprehensive experiments demonstrate that our FFD model with the elaborate backbone achieves excellent performance in FFD and extra face-related tasks, i.e., presentation attack detection. Code and models are available at https://github.com/zhenglab/FFDBackbone.

Via

Access Paper or Ask Questions

Advancing H&E-to-IHC Stain Translation in Breast Cancer: A Multi-Magnification and Attention-Based Approach

Aug 04, 2024

Linhao Qu, Chengsheng Zhang, Guihui Li, Haiyong Zheng, Chen Peng, Wei He

Figure 1 for Advancing H&E-to-IHC Stain Translation in Breast Cancer: A Multi-Magnification and Attention-Based Approach

Figure 2 for Advancing H&E-to-IHC Stain Translation in Breast Cancer: A Multi-Magnification and Attention-Based Approach

Figure 3 for Advancing H&E-to-IHC Stain Translation in Breast Cancer: A Multi-Magnification and Attention-Based Approach

Figure 4 for Advancing H&E-to-IHC Stain Translation in Breast Cancer: A Multi-Magnification and Attention-Based Approach

Abstract:Breast cancer presents a significant healthcare challenge globally, demanding precise diagnostics and effective treatment strategies, where histopathological examination of Hematoxylin and Eosin (H&E) stained tissue sections plays a central role. Despite its importance, evaluating specific biomarkers like Human Epidermal Growth Factor Receptor 2 (HER2) for personalized treatment remains constrained by the resource-intensive nature of Immunohistochemistry (IHC). Recent strides in deep learning, particularly in image-to-image translation, offer promise in synthesizing IHC-HER2 slides from H\&E stained slides. However, existing methodologies encounter challenges, including managing multiple magnifications in pathology images and insufficient focus on crucial information during translation. To address these issues, we propose a novel model integrating attention mechanisms and multi-magnification information processing. Our model employs a multi-magnification processing strategy to extract and utilize information from various magnifications within pathology images, facilitating robust image translation. Additionally, an attention module within the generative network prioritizes critical information for image distribution translation while minimizing less pertinent details. Rigorous testing on a publicly available breast cancer dataset demonstrates superior performance compared to existing methods, establishing our model as a state-of-the-art solution in advancing pathology image translation from H&E to IHC staining.

* Accepted by IEEE CIS-RAM 2024 Invited Session Oral

Via

Access Paper or Ask Questions

Quantum Recurrent Neural Networks for Sequential Learning

Feb 07, 2023

Yanan Li, Zhimin Wang, Rongbing Han, Shangshang Shi, Jiaxin Li, Ruimin Shang, Haiyong Zheng, Guoqiang Zhong, Yongjian Gu

Figure 1 for Quantum Recurrent Neural Networks for Sequential Learning

Figure 2 for Quantum Recurrent Neural Networks for Sequential Learning

Figure 3 for Quantum Recurrent Neural Networks for Sequential Learning

Figure 4 for Quantum Recurrent Neural Networks for Sequential Learning

Abstract:Quantum neural network (QNN) is one of the promising directions where the near-term noisy intermediate-scale quantum (NISQ) devices could find advantageous applications against classical resources. Recurrent neural networks are the most fundamental networks for sequential learning, but up to now there is still a lack of canonical model of quantum recurrent neural network (QRNN), which certainly restricts the research in the field of quantum deep learning. In the present work, we propose a new kind of QRNN which would be a good candidate as the canonical QRNN model, where, the quantum recurrent blocks (QRBs) are constructed in the hardware-efficient way, and the QRNN is built by stacking the QRBs in a staggered way that can greatly reduce the algorithm's requirement with regard to the coherent time of quantum devices. That is, our QRNN is much more accessible on NISQ devices. Furthermore, the performance of the present QRNN model is verified concretely using three different kinds of classical sequential data, i.e., meteorological indicators, stock price, and text categorization. The numerical experiments show that our QRNN achieves much better performance in prediction (classification) accuracy against the classical RNN and state-of-the-art QNN models for sequential learning, and can predict the changing details of temporal sequence data. The practical circuit structure and superior performance indicate that the present QRNN is a promising learning model to find quantum advantageous applications in the near term.

Via

Access Paper or Ask Questions

SinTra: Learning an inspiration model from a single multi-track music segment

Apr 21, 2022

Qingwei Song, Qiwei Sun, Dongsheng Guo, Haiyong Zheng

Figure 1 for SinTra: Learning an inspiration model from a single multi-track music segment

Figure 2 for SinTra: Learning an inspiration model from a single multi-track music segment

Figure 3 for SinTra: Learning an inspiration model from a single multi-track music segment

Figure 4 for SinTra: Learning an inspiration model from a single multi-track music segment

Abstract:In this paper, we propose SinTra, an auto-regressive sequential generative model that can learn from a single multi-track music segment, to generate coherent, aesthetic, and variable polyphonic music of multi-instruments with an arbitrary length of bar. For this task, to ensure the relevance of generated samples and training music, we present a novel pitch-group representation. SinTra, consisting of a pyramid of Transformer-XL with a multi-scale training strategy, can learn both the musical structure and the relative positional relationship between notes of the single training music segment. Additionally, for maintaining the inter-track correlation, we use the convolution operation to process multi-track music, and when decoding, the tracks are independent to each other to prevent interference. We evaluate SinTra with both subjective study and objective metrics. The comparison results show that our framework can learn information from a single music segment more sufficiently than Music Transformer. Also the comparison between SinTra and its variant, i.e., the single-stage SinTra with the first stage only, shows that the pyramid structure can effectively suppress overly-fragmented notes.

Via

Access Paper or Ask Questions

SGUIE-Net: Semantic Attention Guided Underwater Image Enhancement with Multi-Scale Perception

Jan 08, 2022

Qi Qi, Kunqian Li, Haiyong Zheng, Xiang Gao, Guojia Hou, Kun Sun

Figure 1 for SGUIE-Net: Semantic Attention Guided Underwater Image Enhancement with Multi-Scale Perception

Figure 2 for SGUIE-Net: Semantic Attention Guided Underwater Image Enhancement with Multi-Scale Perception

Figure 3 for SGUIE-Net: Semantic Attention Guided Underwater Image Enhancement with Multi-Scale Perception

Figure 4 for SGUIE-Net: Semantic Attention Guided Underwater Image Enhancement with Multi-Scale Perception

Abstract:Due to the wavelength-dependent light attenuation, refraction and scattering, underwater images usually suffer from color distortion and blurred details. However, due to the limited number of paired underwater images with undistorted images as reference, training deep enhancement models for diverse degradation types is quite difficult. To boost the performance of data-driven approaches, it is essential to establish more effective learning mechanisms that mine richer supervised information from limited training sample resources. In this paper, we propose a novel underwater image enhancement network, called SGUIE-Net, in which we introduce semantic information as high-level guidance across different images that share common semantic regions. Accordingly, we propose semantic region-wise enhancement module to perceive the degradation of different semantic regions from multiple scales and feed it back to the global attention features extracted from its original scale. This strategy helps to achieve robust and visually pleasant enhancements to different semantic objects, which should thanks to the guidance of semantic information for differentiated enhancement. More importantly, for those degradation types that are not common in the training sample distribution, the guidance connects them with the already well-learned types according to their semantic relevance. Extensive experiments on the publicly available datasets and our proposed dataset demonstrated the impressive performance of SGUIE-Net. The code and proposed dataset are available at: https://trentqq.github.io/SGUIE-Net.html

Via

Access Paper or Ask Questions

ReshapeGAN: Object Reshaping by Providing A Single Reference Image

May 16, 2019

Ziqiang Zheng, Yang Wu, Zhibin Yu, Yang Yang, Haiyong Zheng, Takeo Kanade

Figure 1 for ReshapeGAN: Object Reshaping by Providing A Single Reference Image

Figure 2 for ReshapeGAN: Object Reshaping by Providing A Single Reference Image

Figure 3 for ReshapeGAN: Object Reshaping by Providing A Single Reference Image

Figure 4 for ReshapeGAN: Object Reshaping by Providing A Single Reference Image

Abstract:The aim of this work is learning to reshape the object in an input image to an arbitrary new shape, by just simply providing a single reference image with an object instance in the desired shape. We propose a new Generative Adversarial Network (GAN) architecture for such an object reshaping problem, named ReshapeGAN. The network can be tailored for handling all kinds of problem settings, including both within-domain (or single-dataset) reshaping and cross-domain (typically across mutiple datasets) reshaping, with paired or unpaired training data. The appearance of the input object is preserved in all cases, and thus it is still identifiable after reshaping, which has never been achieved as far as we are aware. We present the tailored models of the proposed ReshapeGAN for all the problem settings, and have them tested on 8 kinds of reshaping tasks with 13 different datasets, demonstrating the ability of ReshapeGAN on generating convincing and superior results for object reshaping. To the best of our knowledge, we are the first to be able to make one GAN framework work on all such object reshaping tasks, especially the cross-domain tasks on handling multiple diverse datasets. We present here both ablation studies on our proposed ReshapeGAN models and comparisons with the state-of-the-art models when they are made comparable, using all kinds of applicable metrics that we are aware of.

* 25 pages, 23 figures

Via

Access Paper or Ask Questions

One-Shot Image-to-Image Translation via Part-Global Learning with a Multi-adversarial Framework

May 12, 2019

Ziqiang Zheng, Zhibin Yu, Haiyong Zheng, Yang Yang, Heng Tao Shen

Figure 1 for One-Shot Image-to-Image Translation via Part-Global Learning with a Multi-adversarial Framework

Figure 2 for One-Shot Image-to-Image Translation via Part-Global Learning with a Multi-adversarial Framework

Figure 3 for One-Shot Image-to-Image Translation via Part-Global Learning with a Multi-adversarial Framework

Figure 4 for One-Shot Image-to-Image Translation via Part-Global Learning with a Multi-adversarial Framework

Abstract:It is well known that humans can learn and recognize objects effectively from several limited image samples. However, learning from just a few images is still a tremendous challenge for existing main-stream deep neural networks. Inspired by analogical reasoning in the human mind, a feasible strategy is to translate the abundant images of a rich source domain to enrich the relevant yet different target domain with insufficient image data. To achieve this goal, we propose a novel, effective multi-adversarial framework (MA) based on part-global learning, which accomplishes one-shot cross-domain image-to-image translation. In specific, we first devise a part-global adversarial training scheme to provide an efficient way for feature extraction and prevent discriminators being over-fitted. Then, a multi-adversarial mechanism is employed to enhance the image-to-image translation ability to unearth the high-level semantic representation. Moreover, a balanced adversarial loss function is presented, which aims to balance the training data and stabilize the training process. Extensive experiments demonstrate that the proposed approach can obtain impressive results on various datasets between two extremely imbalanced image domains and outperform state-of-the-art methods on one-shot image-to-image translation.

* 9 pages, 13 figures

Via

Access Paper or Ask Questions

Generative Adversarial Network with Multi-Branch Discriminator for Cross-Species Image-to-Image Translation

Jan 24, 2019

Ziqiang Zheng, Zhibin Yu, Haiyong Zheng, Yang Wu, Bing Zheng, Ping Lin

Figure 1 for Generative Adversarial Network with Multi-Branch Discriminator for Cross-Species Image-to-Image Translation

Figure 2 for Generative Adversarial Network with Multi-Branch Discriminator for Cross-Species Image-to-Image Translation

Figure 3 for Generative Adversarial Network with Multi-Branch Discriminator for Cross-Species Image-to-Image Translation

Figure 4 for Generative Adversarial Network with Multi-Branch Discriminator for Cross-Species Image-to-Image Translation

Abstract:Current approaches have made great progress on image-to-image translation tasks benefiting from the success of image synthesis methods especially generative adversarial networks (GANs). However, existing methods are limited to handling translation tasks between two species while keeping the content matching on the semantic level. A more challenging task would be the translation among more than two species. To explore this new area, we propose a simple yet effective structure of a multi-branch discriminator for enhancing an arbitrary generative adversarial architecture (GAN), named GAN-MBD. It takes advantage of the boosting strategy to break a common discriminator into several smaller ones with fewer parameters, which can enhance the generation and synthesis abilities of GANs efficiently and effectively. Comprehensive experiments show that the proposed multi-branch discriminator can dramatically improve the performance of popular GANs on cross-species image-to-image translation tasks while reducing the number of parameters for computation. The code and some datasets are attached as supplementary materials for reference.

* 10 pages, 16 figures

Via

Access Paper or Ask Questions

Discriminative Region Proposal Adversarial Networks for High-Quality Image-to-Image Translation

Aug 06, 2018

Chao Wang, Haiyong Zheng, Zhibin Yu, Ziqiang Zheng, Zhaorui Gu, Bing Zheng

Figure 1 for Discriminative Region Proposal Adversarial Networks for High-Quality Image-to-Image Translation

Figure 2 for Discriminative Region Proposal Adversarial Networks for High-Quality Image-to-Image Translation

Figure 3 for Discriminative Region Proposal Adversarial Networks for High-Quality Image-to-Image Translation

Figure 4 for Discriminative Region Proposal Adversarial Networks for High-Quality Image-to-Image Translation

Abstract:Image-to-image translation has been made much progress with embracing Generative Adversarial Networks (GANs). However, it's still very challenging for translation tasks that require high quality, especially at high-resolution and photorealism. In this paper, we present Discriminative Region Proposal Adversarial Networks (DRPAN) for high-quality image-to-image translation. We decompose the procedure of image-to-image translation task into three iterated steps, first is to generate an image with global structure but some local artifacts (via GAN), second is using our DRPnet to propose the most fake region from the generated image, and third is to implement "image inpainting" on the most fake region for more realistic result through a reviser, so that the system (DRPAN) can be gradually optimized to synthesize images with more attention on the most artifact local part. Experiments on a variety of image-to-image translation tasks and datasets validate that our method outperforms state-of-the-arts for producing high-quality translation results in terms of both human perceptual studies and automatic quantitative measures.

* ECCV 2018

Via

Access Paper or Ask Questions