Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tania Chakraborty

Splits! A Flexible Dataset for Evaluating a Model's Demographic Social Inference

Apr 06, 2025

Eylon Caplan, Tania Chakraborty, Dan Goldwasser

Abstract:Understanding how people of various demographics think, feel, and express themselves (collectively called group expression) is essential for social science and underlies the assessment of bias in Large Language Models (LLMs). While LLMs can effectively summarize group expression when provided with empirical examples, coming up with generalizable theories of how a group's expression manifests in real-world text is challenging. In this paper, we define a new task called Group Theorization, in which a system must write theories that differentiate expression across demographic groups. We make available a large dataset on this task, Splits!, constructed by splitting Reddit posts by neutral topics (e.g. sports, cooking, and movies) and by demographics (e.g. occupation, religion, and race). Finally, we suggest a simple evaluation framework for assessing how effectively a method can generate 'better' theories about group expression, backed by human validation. We publicly release the raw corpora and evaluation scripts for Splits! to help researchers assess how methods infer--and potentially misrepresent--group differences in expression. We make Splits! and our evaluation module available at https://github.com/eyloncaplan/splits.

* Under review for COLM 2025

Via

Access Paper or Ask Questions

Mining Large-Scale Low-Resource Pronunciation Data From Wikipedia

Jan 27, 2021

Tania Chakraborty, Manasa Prasad, Theresa Breiner, Sandy Ritchie, Daan van Esch

Figure 1 for Mining Large-Scale Low-Resource Pronunciation Data From Wikipedia

Figure 2 for Mining Large-Scale Low-Resource Pronunciation Data From Wikipedia

Figure 3 for Mining Large-Scale Low-Resource Pronunciation Data From Wikipedia

Figure 4 for Mining Large-Scale Low-Resource Pronunciation Data From Wikipedia

Abstract:Pronunciation modeling is a key task for building speech technology in new languages, and while solid grapheme-to-phoneme (G2P) mapping systems exist, language coverage can stand to be improved. The information needed to build G2P models for many more languages can easily be found on Wikipedia, but unfortunately, it is stored in disparate formats. We report on a system we built to mine a pronunciation data set in 819 languages from loosely structured tables within Wikipedia. The data includes phoneme inventories, and for 63 low-resource languages, also includes the grapheme-to-phoneme (G2P) mapping. 54 of these languages do not have easily findable G2P mappings online otherwise. We turned the information from Wikipedia into a structured, machine-readable TSV format, and make the resulting data set publicly available so it can be improved further and used in a variety of applications involving low-resource languages.

* 7 pages, 9 figures

Via

Access Paper or Ask Questions