Abstract:Lexical diversity measures the vocabulary variation in texts. While its utility is evident for analyses in language change and applied linguistics, it is not yet clear how to operationalize this concept in a unique way. We here investigate entropy and text-token ratio, two widely employed metrics for lexical diversities, in six massive linguistic datasets in English, Spanish, and Turkish, consisting of books, news articles, and tweets. These gigaword corpora correspond to languages with distinct morphological features and differ in registers and genres, thus constituting a diverse testbed for a quantitative approach to lexical diversity. Strikingly, we find a functional relation between entropy and text-token ratio that holds across the corpora under consideration. Further, in the limit of large vocabularies we find an analytical expression that sheds light on the origin of this relation and its connection with both Zipf and Heaps laws. Our results then contribute to the theoretical understanding of text structure and offer practical implications for fields like natural language processing.
Abstract:Flamenco, recognized by UNESCO as part of the Intangible Cultural Heritage of Humanity, is a profound expression of cultural identity rooted in Andalusia, Spain. However, there is a lack of quantitative studies that help identify characteristic patterns in this long-lived music tradition. In this work, we present a computational analysis of Flamenco lyrics, employing natural language processing and machine learning to categorize over 2000 lyrics into their respective Flamenco genres, termed as $\textit{palos}$. Using a Multinomial Naive Bayes classifier, we find that lexical variation across styles enables to accurately identify distinct $\textit{palos}$. More importantly, from an automatic method of word usage, we obtain the semantic fields that characterize each style. Further, applying a metric that quantifies the inter-genre distance we perform a network analysis that sheds light on the relationship between Flamenco styles. Remarkably, our results suggest historical connections and $\textit{palo}$ evolutions. Overall, our work illuminates the intricate relationships and cultural significance embedded within Flamenco lyrics, complementing previous qualitative discussions with quantitative analyses and sparking new discussions on the origin and development of traditional music genres.