Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text

Nov 16, 2023

Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis, Chloé Clavel

Share this with someone who'll enjoy it:

Abstract:This study investigates the consequences of training large language models (LLMs) on synthetic data generated by their predecessors, an increasingly prevalent practice aimed at addressing the limited supply of human-generated training data. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we developed a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive fine-tuning experiments across various natural language generation tasks. Our findings reveal a marked decrease in the diversity of the models' outputs through successive iterations. This trend underscores the potential risks of training LLMs on predecessor-generated text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of LLMs.

* Work in progress

View paper on

Share this with someone who'll enjoy it:

Title:The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text

Paper and Code