Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations

Jun 14, 2023

Gregor Geigle, Radu Timofte, Goran Glavaš

Figure 1 for Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations

Figure 2 for Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations

Figure 3 for Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations

Figure 4 for Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations

Share this with someone who'll enjoy it:

Abstract:Vision-and-language (VL) models with separate encoders for each modality (e.g., CLIP) have become the go-to models for zero-shot image classification and image-text retrieval. The bulk of the evaluation of these models is, however, performed with English text only: the costly creation of language-specific image-caption datasets has limited multilingual VL benchmarks to a handful of high-resource languages. In this work, we introduce Babel-ImageNet, a massively multilingual benchmark that offers (partial) translations of 1000 ImageNet labels to 92 languages, built without resorting to machine translation (MT) or requiring manual annotation. We instead automatically obtain reliable translations of ImageNext concepts by linking them -- via shared WordNet synsets -- to BabelNet, a massively multilingual lexico-semantic network. We evaluate 8 different publicly available multilingual CLIP models on zero-shot image classification (ZS-IC) for each of the 92 Babel-ImageNet languages, demonstrating a significant gap between English ImageNet performance and that of high-resource languages (e.g., German or Chinese), and an even bigger gap for low-resource languages (e.g., Sinhala or Lao). Crucially, we show that the models' ZS-IC performance on Babel-ImageNet highly correlates with their performance in image-text retrieval, validating that Babel-ImageNet is suitable for estimating the quality of the multilingual VL representation spaces for the vast majority of languages that lack gold image-text data. Finally, we show that the performance of multilingual CLIP for low-resource languages can be drastically improved via cheap, parameter-efficient language-specific training. We make our code and data publicly available: \url{https://github.com/gregor-ge/Babel-ImageNet}

View paper on

Share this with someone who'll enjoy it:

Title:Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations

Paper and Code