Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training

Oct 28, 2024

Michael Pieler, Marco Bellagente, Hannah Teufel, Duy Phung, Nathan Cooper, Jonathan Tow, Paulo Rocha, Reshinth Adithyan, Zaid Alyafeai, Nikhil Pinnaparaju(+2 more)

Figure 1 for Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training

Figure 2 for Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training

Figure 3 for Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training

Figure 4 for Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training

Share this with someone who'll enjoy it:

Abstract:Recently published work on rephrasing natural text data for pre-training LLMs has shown promising results when combining the original dataset with the synthetically rephrased data. We build upon previous work by replicating existing results on C4 and extending them with our optimized rephrasing pipeline to the English, German, Italian, and Spanish Oscar subsets of CulturaX. Our pipeline leads to increased performance on standard evaluation benchmarks in both the mono- and multilingual setup. In addition, we provide a detailed study of our pipeline, investigating the choice of the base dataset and LLM for the rephrasing, as well as the relationship between the model size and the performance after pre-training. By exploring data with different perceived quality levels, we show that gains decrease with higher quality. Furthermore, we find the difference in performance between model families to be bigger than between different model sizes. This highlights the necessity for detailed tests before choosing an LLM to rephrase large amounts of data. Moreover, we investigate the effect of pre-training with synthetic data on supervised fine-tuning. Here, we find increasing but inconclusive results that highly depend on the used benchmark. These results (again) highlight the need for better benchmarking setups. In summary, we show that rephrasing multilingual and low-quality data is a very promising direction to extend LLM pre-training data.

* 21 pages, 4 figures, 12 tables

View paper on

Share this with someone who'll enjoy it:

Title:Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training

Paper and Code