Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math

Dec 28, 2023

Zengzhi Wang, Rui Xia, Pengfei Liu

Figure 1 for Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math

Figure 2 for Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math

Figure 3 for Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math

Figure 4 for Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math

Share this with someone who'll enjoy it:

Abstract:High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce \textsc{MathPile}, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of ``\emph{less is more}'', firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates. We hope our \textsc{MathPile} can help to enhance the mathematical reasoning abilities of language models. We plan to open-source different versions of \mathpile with the scripts used for processing, to facilitate future developments in this field.

* 37 pages. Working in Progress. https://github.com/GAIR-NLP/MathPile/

View paper on

Share this with someone who'll enjoy it:

Title:Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math

Paper and Code