Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Romain Bielawski

Does language help generalization in vision models?

May 15, 2021

Benjamin Devillers, Bhavin Choksi, Romain Bielawski, Rufin VanRullen

Figure 1 for Does language help generalization in vision models?

Figure 2 for Does language help generalization in vision models?

Figure 3 for Does language help generalization in vision models?

Abstract:Vision models trained on multimodal datasets can benefit from the wide availability of large image-caption datasets. A recent model (CLIP) was found to generalize well in zero-shot and transfer learning settings. This could imply that linguistic or "semantic grounding" confers additional generalization abilities to the visual feature space. Here, we systematically evaluate various multimodal architectures and vision-only models in terms of unsupervised clustering, few-shot learning, transfer learning and adversarial robustness. In each setting, multimodal training produced no additional generalization capability compared to standard supervised visual training. We conclude that work is still required for semantic grounding to help improve vision models.

* Paper accepted for presentation at the ViGIL 2021 workshop @NAACL. This version: added models to the comparison (ICMLM, TSM); added tests of adversarial robustness; mistake identified and corrected in the normalization of image features; results and conclusions updated accordingly

Via

Access Paper or Ask Questions