Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Blake Vente

Data Caricatures: On the Representation of African American Language in Pretraining Corpora

Mar 13, 2025

Nicholas Deas, Blake Vente, Amith Ananthram, Jessica A. Grieser, Desmond Patton, Shana Kleiner, James Shepard, Kathleen McKeown

Abstract:With a combination of quantitative experiments, human judgments, and qualitative analyses, we evaluate the quantity and quality of African American Language (AAL) representation in 12 predominantly English, open-source pretraining corpora. We specifically focus on the sources, variation, and naturalness of included AAL texts representing the AAL-speaking community. We find that AAL is underrepresented in all evaluated pretraining corpora compared to US demographics, constituting as little as 0.007% of documents. We also find that more than 25% of AAL texts in C4 may be inappropriate for LLMs to generate and reinforce harmful stereotypes. Finally, we find that most automated language, toxicity, and quality filters are more likely to conserve White Mainstream English (WME) texts over AAL in pretraining corpora.

* Preprint

Via

Access Paper or Ask Questions

Inroads to a Structured Data Natural Language Bijection and the role of LLM annotation

Jan 14, 2024

Blake Vente

Abstract:This work finds limited evidence supporting the theory that using multiple tasks with sequence-to-sequence transformer language models can improve performance on some metrics. In particular, the multi-task generalist t5-small outperforms the specialist t5-small with a $F_1$ of $0.771$ up from $0.692$, which may point to underlying cross-task knowledge generalization. This further suggests that even with the same network, "re-using" the same data in a different way may lead to higher performance in some metrics. However, the inverse task alone is likely only an optimization strategy, since it does not yield a significant general improvement at the model sizes explored in this work. Also, adding $\approx 4500$ LLM annotated records (interlaced with the $12800$ WebNLG training records) does not substantially change automatic metric performance compared to the same t5-small model without the synthetic data. This may be due to a learning capacity bottleneck on account of model size, and decreases observed may be due to distributional differences in the corpora. Future research using larger models or human evaluation is required to more fully explain the mechanisms contributing to performance on these tasks.

* Graduate Coursework

Via

Access Paper or Ask Questions