Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wei ye

TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training

Dec 14, 2023

Chaoya Jiang, Wei ye, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Shikun Zhang

Figure 1 for TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training

Figure 2 for TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training

Figure 3 for TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training

Figure 4 for TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training

Abstract:Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic modalities. Due to noises in web-harvested text-image pairs, however, scaling up training data volume in SMCL presents considerable obstacles in terms of computational cost and data inefficiency. To improve data efficiency in VLP, we propose Text-aware Image Mixing (TiMix), which integrates mix-based data augmentation techniques into SMCL, yielding significant performance improvements without significantly increasing computational overhead. We provide a theoretical analysis of TiMixfrom a mutual information (MI) perspective, showing that mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss. The experimental results demonstrate that TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods. This work empirically and theoretically demonstrates the potential of data mixing for data-efficient and computationally viable VLP, benefiting broader VLP model adoption in practical scenarios.

* Accepted on AAAI2024

Via

Access Paper or Ask Questions