Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval

Oct 31, 2024

Haiwen Li, Fei Su, Zhicheng Zhao

Figure 1 for MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval

Figure 2 for MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval

Figure 3 for MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval

Figure 4 for MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval

Share this with someone who'll enjoy it:

Abstract:Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images. Despite the impressive performance of supervised CIR, the dependence on costly, manually-labeled triplets limits its scalability and zero-shot capability. To address this issue, zero-shot composed image retrieval (ZS-CIR) is presented along with projection-based approaches. However, such methods face two major problems, i.e., task discrepancy between pre-training (image $\leftrightarrow$ text) and inference (image+text $\rightarrow$ image), and modality discrepancy. The latter pertains to approaches based on text-only projection training due to the necessity of feature extraction from the reference image during inference. In this paper, we propose a two-stage framework to tackle both discrepancies. First, to ensure efficiency and scalability, a textual inversion network is pre-trained on large-scale caption datasets. Subsequently, we put forward Modality-Task Dual Alignment (MoTaDual) as the second stage, where large-language models (LLMs) generate triplet data for fine-tuning, and additionally, prompt learning is introduced in a multi-modal context to effectively alleviate both modality and task discrepancies. The experimental results show that our MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost. The code will be released soon.

View paper on

Share this with someone who'll enjoy it:

Title:MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval

Paper and Code