Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations

Dec 14, 2021

Aman Shrivastava, Ramprasaath R. Selvaraju, Nikhil Naik, Vicente Ordonez

Figure 1 for CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations

Figure 2 for CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations

Figure 3 for CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations

Figure 4 for CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations

Share this with someone who'll enjoy it:

Abstract:We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a +15.4% mAP absolute gain in performance on Pascal VOC classification, and a +22.1% top-1 accuracy gain on ImageNet, while being comparable or superior to other, more complex, text-supervised models. CLIP-Lite is also superior to CLIP on image and text retrieval, zero-shot classification, and visual grounding. Finally, by performing explicit image-text alignment during representation learning, we show that CLIP-Lite can leverage language semantics to encourage bias-free visual representations that can be used in downstream tasks.

View paper on

Share this with someone who'll enjoy it:

Title:CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations

Paper and Code