Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Devil in the Number: Towards Robust Multi-modality Data Filter

Sep 24, 2023

Yichen Xu, Zihan Xu, Wenhao Chai, Zhonghan Zhao, Enxin Song, Gaoang Wang

Figure 1 for Devil in the Number: Towards Robust Multi-modality Data Filter

Figure 2 for Devil in the Number: Towards Robust Multi-modality Data Filter

Figure 3 for Devil in the Number: Towards Robust Multi-modality Data Filter

Figure 4 for Devil in the Number: Towards Robust Multi-modality Data Filter

Share this with someone who'll enjoy it:

Abstract:In order to appropriately filter multi-modality data sets on a web-scale, it becomes crucial to employ suitable filtering methods to boost performance and reduce training costs. For instance, LAION papers employs the CLIP score filter to select data with CLIP scores surpassing a certain threshold. On the other hand, T-MARS achieves high-quality data filtering by detecting and masking text within images and then filtering by CLIP score. Through analyzing the dataset, we observe a significant proportion of redundant information, such as numbers, present in the textual content. Our experiments on a subset of the data unveil the profound impact of these redundant elements on the CLIP scores. A logical approach would involve reevaluating the CLIP scores after eliminating these influences. Experimentally, our text-based CLIP filter outperforms the top-ranked method on the ``small scale" of DataComp (a data filtering benchmark) on ImageNet distribution shifts, achieving a 3.6% performance improvement. The results also demonstrate that our proposed text-masked filter outperforms the original CLIP score filter when selecting the top 40% of the data. The impact of numbers on CLIP and their handling provide valuable insights for improving the effectiveness of CLIP training, including language rewrite techniques.

* ICCV 2023 Workshop: TNGCV-DataComp

View paper on

Share this with someone who'll enjoy it:

Title:Devil in the Number: Towards Robust Multi-modality Data Filter

Paper and Code