Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources

Jan 18, 2022

Quan Cui, Boyan Zhou, Yu Guo, Weidong Yin, Hao Wu, Osamu Yoshie

Figure 1 for ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources

Figure 2 for ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources

Figure 3 for ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources

Figure 4 for ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources

Share this with someone who'll enjoy it:

Abstract:Pioneering dual-encoder pre-training works (e.g., CLIP and ALIGN) have revealed the potential of aligning multi-modal representations with contrastive learning. However, these works require a tremendous amount of data and computational resources (e.g., billion-level web data and hundreds of GPUs), which prevent researchers with limited resources from reproduction and further exploration. To this end, we explore a stack of simple but effective heuristics, and provide a comprehensive training guidance, which allows us to conduct dual-encoder multi-modal representation alignment with limited resources. We provide a reproducible strong baseline of competitive results, namely ZeroVL, with only 14M publicly accessible academic datasets and 8 V100 GPUs. Additionally, we collect 100M web data for pre-training, and achieve comparable or superior results than state-of-the-art methods, further proving the effectiveness of our method on large-scale data. We hope that this work will provide useful data points and experience for future research in multi-modal pre-training. Our code is available at https://github.com/zerovl/ZeroVL.

* Code is released

View paper on

Share this with someone who'll enjoy it:

Title:ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources

Paper and Code