Abstract:Heaps' law is an empirical relation in text analysis that predicts vocabulary growth as a function of corpus size. While this law has been validated in diverse human-authored text corpora, its applicability to large language model generated text remains unexplored. This study addresses this gap, focusing on the emulation of corpora using the suite of GPT-Neo large language models. To conduct our investigation, we emulated corpora of PubMed abstracts using three different parameter sizes of the GPT-Neo model. Our emulation strategy involved using the initial five words of each PubMed abstract as a prompt and instructing the model to expand the content up to the original abstract's length. Our findings indicate that the generated corpora adhere to Heaps' law. Interestingly, as the GPT-Neo model size grows, its generated vocabulary increasingly adheres to Heaps' law as as observed in human-authored text. To further improve the richness and authenticity of GPT-Neo outputs, future iterations could emphasize enhancing model size or refining the model architecture to curtail vocabulary repetition.
Abstract:Domain-specific terminology extraction is an important task in text analysis. A term in a corpus is said to be "bursty" when its occurrences are concentrated in few out of many documents. Being content rich, bursty terms are highly suited for subject matter characterization, and serve as natural candidates for identifying with technical terminology. Multiple measures of term burstiness have been proposed in the literature. However, the statistical significance testing paradigm has remained underexplored in text analysis, including in relation to term burstiness. To test these waters, we propose as our main contribution a multinomial language model-based exact test of statistical significance for term burstiness. Due to its prohibitive computational cost, we advance a heuristic formula designed to serve as a proxy for test P-values. As a complementary theoretical contribution, we derive a previously unreported relationship connecting the inverse document frequency and inverse collection frequency (two foundational quantities in text analysis) under the multinomial language model. The relation is used in the evaluation of our heuristic. Using the GENIA Term corpus benchmark, we compare our approach against established methods, demonstrating our heuristic's potential in identifying domain-specific technical terms. We hope this demonstration of statistical significance testing in text analysis serves as a springboard for future research.
Abstract:In this paper, we propose a recommender system that takes a user-selected story as input and returns a ranked list of similar stories on the basis of shared literary themes. The user of our system first selects a story of interest from a list of background stories, and then sets, as desired, a handful of knowledge-based filtering options, including the similarity measure used to quantify the similarity between story pairs. As a proof of concept, we validate experimentally our system on a dataset comprising 452 manually themed Star Trek television franchise episodes by using a benchmark of curated sets of related stories. We show that our manual approach to theme assignment significantly outperforms an automated approach to theme identification based on the application of topic models to episode transcripts. Additionally, we compare different approaches based on sets and on a hierarchical-semantic organization of themes to construct similarity functions between stories. The recommender system is implemented in the R package stoRy. A related R Shiny web application is available publicly along with the Stark Trek dataset including the theme ontology, episode annotations, storyset benchmarks, transcripts, and evaluation setup.