Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ted Underwood

Can Language Models Represent the Past without Anachronism?

Apr 28, 2025

Ted Underwood, Laura K. Nelson, Matthew Wilkens

Abstract:Before researchers can use language models to simulate the past, they need to understand the risk of anachronism. We find that prompting a contemporary model with examples of period prose does not produce output consistent with period style. Fine-tuning produces results that are stylistically convincing enough to fool an automated judge, but human evaluators can still distinguish fine-tuned model outputs from authentic historical text. We tentatively conclude that pretraining on period prose may be required in order to reliably simulate historical perspectives for social research.

Via

Access Paper or Ask Questions

Locating the Leading Edge of Cultural Change

Nov 22, 2024

Sarah Griebel, Becca Cohen, Lucian Li, Jaihyun Park, Jiayu Liu, Jana Perkins, Ted Underwood

Abstract:Measures of textual similarity and divergence are increasingly used to study cultural change. But which measures align, in practice, with social evidence about change? We apply three different representations of text (topic models, document embeddings, and word-level perplexity) to three different corpora (literary studies, economics, and fiction). In every case, works by highly-cited authors and younger authors are textually ahead of the curve. We don't find clear evidence that one representation of text is to be preferred over the others. But alignment with social evidence is strongest when texts are represented through the top quartile of passages, suggesting that a text's impact may depend more on its most forward-looking moments than on sustaining a high level of innovation throughout.

* Accepted CHR 2024

Via

Access Paper or Ask Questions

The Historical Significance of Textual Distances

Jun 30, 2018

Ted Underwood

Figure 1 for The Historical Significance of Textual Distances

Figure 2 for The Historical Significance of Textual Distances

Figure 3 for The Historical Significance of Textual Distances

Figure 4 for The Historical Significance of Textual Distances

Abstract:Measuring similarity is a basic task in information retrieval, and now often a building-block for more complex arguments about cultural change. But do measures of textual similarity and distance really correspond to evidence about cultural proximity and differentiation? To explore that question empirically, this paper compares textual and social measures of the similarities between genres of English-language fiction. Existing measures of textual similarity (cosine similarity on tf-idf vectors or topic vectors) are also compared to new strategies that use supervised learning to anchor textual measurement in a social context.

* Preprint of a paper for the 2nd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2018). Code is available at https://github.com/tedunderwood/genredistance or, archivally, at https://zenodo.org/record/1300934

Via

Access Paper or Ask Questions

Mapping Mutable Genres in Structurally Complex Volumes

Sep 18, 2013

Ted Underwood, Michael L. Black, Loretta Auvil, Boris Capitanu

Figure 1 for Mapping Mutable Genres in Structurally Complex Volumes

Figure 2 for Mapping Mutable Genres in Structurally Complex Volumes

Figure 3 for Mapping Mutable Genres in Structurally Complex Volumes

Figure 4 for Mapping Mutable Genres in Structurally Complex Volumes

Abstract:To mine large digital libraries in humanistically meaningful ways, scholars need to divide them by genre. This is a task that classification algorithms are well suited to assist, but they need adjustment to address the specific challenges of this domain. Digital libraries pose two problems of scale not usually found in the article datasets used to test these algorithms. 1) Because libraries span several centuries, the genres being identified may change gradually across the time axis. 2) Because volumes are much longer than articles, they tend to be internally heterogeneous, and the classification task needs to begin with segmentation. We describe a multi-layered solution that trains hidden Markov models to segment volumes, and uses ensembles of overlapping classifiers to address historical change. We test this approach on a collection of 469,200 volumes drawn from HathiTrust Digital Library. To demonstrate the humanistic value of these methods, we extract 32,209 volumes of fiction from the digital library, and trace the changing proportions of first- and third-person narration in the corpus. We note that narrative points of view seem to have strong associations with particular themes and genres.

* Preprint accepted for the 2013 IEEE International Conference on Big Data. Revised to include corroborating evidence from a smaller workset

Via

Access Paper or Ask Questions