Abstract:If unaligned multimodal medical images can be simultaneously aligned and fused using a single-stage approach within a unified processing framework, it will not only achieve mutual promotion of dual tasks but also help reduce the complexity of the model. However, the design of this model faces the challenge of incompatible requirements for feature fusion and alignment; specifically, feature alignment requires consistency among corresponding features, whereas feature fusion requires the features to be complementary to each other. To address this challenge, this paper proposes an unaligned medical image fusion method called Bidirectional Stepwise Feature Alignment and Fusion (BSFA-F) strategy. To reduce the negative impact of modality differences on cross-modal feature matching, we incorporate the Modal Discrepancy-Free Feature Representation (MDF-FR) method into BSFA-F. MDF-FR utilizes a Modality Feature Representation Head (MFRH) to integrate the global information of the input image. By injecting the information contained in MFRH of the current image into other modality images, it effectively reduces the impact of modality differences on feature alignment while preserving the complementary information carried by different images. In terms of feature alignment, BSFA-F employs a bidirectional stepwise alignment deformation field prediction strategy based on the path independence of vector displacement between two points. This strategy solves the problem of large spans and inaccurate deformation field prediction in single-step alignment. Finally, Multi-Modal Feature Fusion block achieves the fusion of aligned features. The experimental results across multiple datasets demonstrate the effectiveness of our method. The source code is available at https://github.com/slrl123/BSAFusion.
Abstract:Video Paragraph Grounding (VPG) aims to precisely locate the most appropriate moments within a video that are relevant to a given textual paragraph query. However, existing methods typically rely on large-scale annotated temporal labels and assume that the correspondence between videos and paragraphs is known. This is impractical in real-world applications, as constructing temporal labels requires significant labor costs, and the correspondence is often unknown. To address this issue, we propose a Dual-task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding method (DMR-JRG). In this method, retrieval and grounding tasks are mutually reinforced rather than being treated as separate issues. DMR-JRG mainly consists of two branches: a retrieval branch and a grounding branch. The retrieval branch uses inter-video contrastive learning to roughly align the global features of paragraphs and videos, reducing modality differences and constructing a coarse-grained feature space to break free from the need for correspondence between paragraphs and videos. Additionally, this coarse-grained feature space further facilitates the grounding branch in extracting fine-grained contextual representations. In the grounding branch, we achieve precise cross-modal matching and grounding by exploring the consistency between local, global, and temporal dimensions of video segments and textual paragraphs. By synergizing these dimensions, we construct a fine-grained feature space for video and textual features, greatly reducing the need for large-scale annotated temporal labels.
Abstract:Infrared and visible (IR-VIS) image fusion has gained significant attention for its broad application value. However, existing methods often neglect the complementary role of infrared image in restoring visible image features under hazy conditions. To address this, we propose a joint learning framework that utilizes infrared image for the restoration and fusion of hazy IR-VIS images. To mitigate the adverse effects of feature diversity between IR-VIS images, we introduce a prompt generation mechanism that regulates modality-specific feature incompatibility. This creates a prompt selection matrix from non-shared image information, followed by prompt embeddings generated from a prompt pool. These embeddings help generate candidate features for dehazing. We further design an infrared-assisted feature restoration mechanism that selects candidate features based on haze density, enabling simultaneous restoration and fusion within a single-stage framework. To enhance fusion quality, we construct a multi-stage prompt embedding fusion module that leverages feature supplementation from the prompt generation module. Our method effectively fuses IR-VIS images while removing haze, yielding clear, haze-free fusion results. In contrast to two-stage methods that dehaze and then fuse, our approach enables collaborative training in a single-stage framework, making the model relatively lightweight and suitable for practical deployment. Experimental results validate its effectiveness and demonstrate advantages over existing methods.
Abstract:The primary value of infrared and visible image fusion technology lies in applying the fusion results to downstream tasks. However, existing methods face challenges such as increased training complexity and significantly compromised performance of individual tasks when addressing multiple downstream tasks simultaneously. To tackle this, we propose Task-Oriented Adaptive Regulation (T-OAR), an adaptive mechanism specifically designed for multi-task environments. Additionally, we introduce the Task-related Dynamic Prompt Injection (T-DPI) module, which generates task-specific dynamic prompts from user-input text instructions and integrates them into target representations. This guides the feature extraction module to produce representations that are more closely aligned with the specific requirements of downstream tasks. By incorporating the T-DPI module into the T-OAR framework, our approach generates fusion images tailored to task-specific requirements without the need for separate training or task-specific weights. This not only reduces computational costs but also enhances adaptability and performance across multiple tasks. Experimental results show that our method excels in object detection, semantic segmentation, and salient object detection, demonstrating its strong adaptability, flexibility, and task specificity. This provides an efficient solution for image fusion in multi-task environments, highlighting the technology's potential across diverse applications.
Abstract:Visual grounding has attracted wide attention thanks to its broad application in various visual language tasks. Although visual grounding has made significant research progress, existing methods ignore the promotion effect of the association between text and image features at different hierarchies on cross-modal matching. This paper proposes a Phrase Decoupling Cross-Modal Hierarchical Matching and Progressive Position Correction Visual Grounding method. It first generates a mask through decoupled sentence phrases, and a text and image hierarchical matching mechanism is constructed, highlighting the role of association between different hierarchies in cross-modal matching. In addition, a corresponding target object position progressive correction strategy is defined based on the hierarchical matching mechanism to achieve accurate positioning for the target object described in the text. This method can continuously optimize and adjust the bounding box position of the target object as the certainty of the text description of the target object improves. This design explores the association between features at different hierarchies and highlights the role of features related to the target object and its position in target positioning. The proposed method is validated on different datasets through experiments, and its superiority is verified by the performance comparison with the state-of-the-art methods.
Abstract:The reconstruction of high dynamic range (HDR) images from multi-exposure low dynamic range (LDR) images in dynamic scenes presents significant challenges, especially in preserving and restoring information in oversaturated regions and avoiding ghosting artifacts. While current methods often struggle to address these challenges, our work aims to bridge this gap by developing a multi-exposure HDR image reconstruction network for dynamic scenes, complemented by single-frame HDR image reconstruction. This network, comprising single-frame HDR reconstruction with enhanced stop image (SHDR-ESI) and SHDR-ESI-assisted multi-exposure HDR reconstruction (SHDRA-MHDR), effectively leverages the ghost-free characteristic of single-frame HDR reconstruction and the detail-enhancing capability of ESI in oversaturated areas. Specifically, SHDR-ESI innovatively integrates single-frame HDR reconstruction with the utilization of ESI. This integration not only optimizes the single image HDR reconstruction process but also effectively guides the synthesis of multi-exposure HDR images in SHDR-AMHDR. In this method, the single-frame HDR reconstruction is specifically applied to reduce potential ghosting effects in multiexposure HDR synthesis, while the use of ESI images assists in enhancing the detail information in the HDR synthesis process. Technically, SHDR-ESI incorporates a detail enhancement mechanism, which includes a self-representation module and a mutual-representation module, designed to aggregate crucial information from both reference image and ESI. To fully leverage the complementary information from non-reference images, a feature interaction fusion module is integrated within SHDRA-MHDR. Additionally, a ghost suppression module, guided by the ghost-free results of SHDR-ESI, is employed to suppress the ghosting artifacts.
Abstract:Recovering a clear image from a single hazy image is an open inverse problem. Although significant research progress has been made, most existing methods ignore the effect that downstream tasks play in promoting upstream dehazing. From the perspective of the haze generation mechanism, there is a potential relationship between the depth information of the scene and the hazy image. Based on this, we propose a dual-task collaborative mutual promotion framework to achieve the dehazing of a single image. This framework integrates depth estimation and dehazing by a dual-task interaction mechanism and achieves mutual enhancement of their performance. To realize the joint optimization of the two tasks, an alternative implementation mechanism with the difference perception is developed. On the one hand, the difference perception between the depth maps of the dehazing result and the ideal image is proposed to promote the dehazing network to pay attention to the non-ideal areas of the dehazing. On the other hand, by improving the depth estimation performance in the difficult-to-recover areas of the hazy image, the dehazing network can explicitly use the depth information of the hazy image to assist the clear image recovery. To promote the depth estimation, we propose to use the difference between the dehazed image and the ground truth to guide the depth estimation network to focus on the dehazed unideal areas. It allows dehazing and depth estimation to leverage their strengths in a mutually reinforcing manner. Experimental results show that the proposed method can achieve better performance than that of the state-of-the-art approaches.
Abstract:Current infrared and visible image fusion (IVIF) methods go to great lengths to excavate complementary features and design complex fusion strategies, which is extremely challenging. To this end, we rethink the IVIF outside the box, proposing a complementary to harmonious information transfer network (CHITNet). It reasonably transfers complementary information into harmonious one, which integrates both the shared and complementary features from two modalities. Specifically, to skillfully sidestep aggregating complementary information in IVIF, we design a mutual information transfer (MIT) module to mutually represent features from two modalities, roughly transferring complementary information into harmonious one. Then, a harmonious information acquisition supervised by source image (HIASSI) module is devised to further ensure the complementary to harmonious information transfer after MIT. Meanwhile, we also propose a structure information preservation (SIP) module to guarantee that the edge structure information of the source images can be transferred to the fusion results. Moreover, a mutual promotion training paradigm (MPTP) with interaction loss is adopted to facilitate better collaboration among MIT, HIASSI and SIP. In this way, the proposed method is able to generate fused images with higher qualities. Extensive experimental results demonstrate the superiority of our CHITNet over state-of-the-art algorithms in terms of visual quality and quantitative evaluations.
Abstract:Multifocus image fusion is an effective way to overcome the limitation of optical lenses. Many existing methods obtain fused results by generating decision maps. However, such methods often assume that the focused areas of the two source images are complementary, making it impossible to achieve simultaneous fusion of multiple images. Additionally, the existing methods ignore the impact of hard pixels on fusion performance, limiting the visual quality improvement of fusion image. To address these issues, a combining generation and recombination model, termed as GRFusion, is proposed. In GRFusion, focus property detection of each source image can be implemented independently, enabling simultaneous fusion of multiple source images and avoiding information loss caused by alternating fusion. This makes GRFusion free from the number of inputs. To distinguish the hard pixels from the source images, we achieve the determination of hard pixels by considering the inconsistency among the detection results of focus areas in source images. Furthermore, a multi-directional gradient embedding method for generating full focus images is proposed. Subsequently, a hard-pixel-guided recombination mechanism for constructing fused result is devised, effectively integrating the complementary advantages of feature reconstruction-based method and focused pixel recombination-based method. Extensive experimental results demonstrate the effectiveness and the superiority of the proposed method.The source code will be released on https://github.com/xxx/xxx.
Abstract:Text-Pedestrian Image Retrieval aims to use the text describing pedestrian appearance to retrieve the corresponding pedestrian image. This task involves not only modality discrepancy, but also the challenge of the textual diversity of pedestrians with the same identity. At present, although existing research progress has been made in text-pedestrian image retrieval, these methods do not comprehensively consider the above-mentioned problems. Considering these, this paper proposes a progressive feature mining and external knowledge-assisted feature purification method. Specifically, we use a progressive mining mode to enable the model to mine discriminative features from neglected information, thereby avoiding the loss of discriminative information and improving the expression ability of features. In addition, to further reduce the negative impact of modal discrepancy and text diversity on cross-modal matching, we propose to use other sample knowledge of the same modality, i.e., external knowledge to enhance identity-consistent features and weaken identity-inconsistent features. This process purifies features and alleviates the interference caused by textual diversity and negative sample correlation features of the same modal. Extensive experiments on three challenging datasets demonstrate the effectiveness and superiority of the proposed method, and the retrieval performance even surpasses that of the large-scale model-based method on large-scale datasets.