Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning

Oct 26, 2022

Suvir Mirchandani, Licheng Yu, Mengjiao Wang, Animesh Sinha, Wenwen Jiang, Tao Xiang, Ning Zhang

Figure 1 for FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning

Figure 2 for FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning

Figure 3 for FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning

Figure 4 for FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning

Share this with someone who'll enjoy it:

Abstract:Multimodal tasks in the fashion domain have significant potential for e-commerce, but involve challenging vision-and-language learning problems - e.g., retrieving a fashion item given a reference image plus text feedback from a user. Prior works on multimodal fashion tasks have either been limited by the data in individual benchmarks, or have leveraged generic vision-and-language pre-training but have not taken advantage of the characteristics of fashion data. Additionally, these works have mainly been restricted to multimodal understanding tasks. To address these gaps, we make two key contributions. First, we propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs. We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks. Second, we propose a flexible decoder-based model architecture capable of both fashion retrieval and captioning tasks. Together, our model design and pre-training approach are competitive on a diverse set of fashion tasks, including cross-modal retrieval, image retrieval with text feedback, image captioning, relative image captioning, and multimodal categorization.

* 14 pages, 4 figures. To appear at Conference on Empirical Methods in Natural Language Processing (EMNLP) 2022

View paper on

Share this with someone who'll enjoy it:

Title:FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning

Paper and Code