Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training

Oct 04, 2022

Antonio Norelli, Marco Fumero, Valentino Maiorca, Luca Moschella, Emanuele Rodolà, Francesco Locatello

Figure 1 for ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training

Figure 2 for ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training

Figure 3 for ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training

Figure 4 for ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training

Share this with someone who'll enjoy it:

Abstract:Aligning the visual and language spaces requires to train deep neural networks from scratch on giant multimodal datasets; CLIP trains both an image and a text encoder, while LiT manages to train just the latter by taking advantage of a pretrained vision network. In this paper, we show that sparse relative representations are sufficient to align text and images without training any network. Our method relies on readily available single-domain encoders (trained with or without supervision) and a modest (in comparison) number of image-text pairs. ASIF redefines what constitutes a multimodal model by explicitly disentangling memory from processing: here the model is defined by the embedded pairs of all the entries in the multimodal dataset, in addition to the parameters of the two encoders. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multimodal models, raising important questions on their data efficiency and on the role of retrieval in machine learning.

* 13 pages, 5 figures

View paper on

OpenReview

Share this with someone who'll enjoy it:

Title:ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training

Paper and Code