Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment

Aug 27, 2023

Jiamin Zhuang, Jing Yu, Yang Ding, Xiangyan Qu, Yue Hu

Figure 1 for Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment

Figure 2 for Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment

Figure 3 for Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment

Figure 4 for Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment

Share this with someone who'll enjoy it:

Abstract:Image-text retrieval requires the system to bridge the heterogenous gap between vision and language for accurate retrieval while keeping the network lightweight-enough for efficient retrieval. Existing trade-off solutions mainly study from the view of incorporating cross-modal interactions with the independent-embedding framework or leveraging stronger pretrained encoders, which still demand time-consuming similarity measurement or heavyweight model structure in the retrieval stage. In this work, we propose an image-text alignment module SelfAlign on top of the independent-embedding framework, which improves the retrieval accuracy while maintains the retrieval efficiency without extra supervision. SelfAlign contains two collaborative sub-modules that force image-text alignment at both concept level and context level by self-supervised contrastive learning. It does not require cross-modal embedding interactions during training while maintaining independent image and text encoders during retrieval. With comparable time cost, SelfAlign consistently boosts the accuracy of state-of-the-art non-pretraining independent-embedding models respectively by 9.1%, 4.2% and 6.6% in terms of R@sum score on Flickr30K, MSCOCO 1K and MS-COCO 5K datasets. The retrieval accuracy also outperforms most existing interactive-embedding models with orders of magnitude decrease in retrieval time. The source code is available at: https://github.com/Zjamie813/SelfAlign.

* IEEE Transactions on Multimedia ( Early Access ), 29 May 2023 * Accepted in IEEE Transactions on Multimedia (TMM)

View paper on

Share this with someone who'll enjoy it:

Title:Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment

Paper and Code