Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Johann-Mattis List

Using Correspondence Patterns to Identify Irregular Words in Cognate sets Through Leave-One-Out Validation

Feb 02, 2026

Frederic Blum, Johann-Mattis List

Abstract:Regular sound correspondences constitute the principal evidence in historical language comparison. Despite the heuristic focus on regularity, it is often more an intuitive judgement than a quantified evaluation, and irregularity is more common than expected from the Neogrammarian model. Given the recent progress of computational methods in historical linguistics and the increased availability of standardized lexical data, we are now able to improve our workflows and provide such a quantitative evaluation. Here, we present the balanced average recurrence of correspondence patterns as a new measure of regularity. We also present a new computational method that uses this measure to identify cognate sets that lack regularity with respect to their correspondence patterns. We validate the method through two experiments, using simulated and real data. In the experiments, we employ leave-one-out validation to measure the regularity of cognate sets in which one word form has been replaced by an irregular one, checking how well our method identifies the forms causing the irregularity. Our method achieves an overall accuracy of 85\% with the datasets based on real data. We also show the benefits of working with subsamples of large datasets and how increasing irregularity in the data influences our results. Reflecting on the broader potential of our new regularity measure and the irregular cognate identification method based on it, we conclude that they could play an important role in improving the quality of existing and future datasets in computer-assisted language comparison.

* Accepted for the L'Change workshop @ EACL 2026

Via

Access Paper or Ask Questions

Advancing the Database of Cross-Linguistic Colexifications with New Workflows and Data

Mar 14, 2025

Annika Tjuka, Robert Forkel, Christoph Rzymski, Johann-Mattis List

Abstract:Lexical resources are crucial for cross-linguistic analysis and can provide new insights into computational models for natural language learning. Here, we present an advanced database for comparative studies of words with multiple meanings, a phenomenon known as colexification. The new version includes improvements in the handling, selection and presentation of the data. We compare the new database with previous versions and find that our improvements provide a more balanced sample covering more language families worldwide, with an enhanced data quality, given that all word forms are provided in phonetic transcription. We conclude that the new Database of Cross-Linguistic Colexifications has the potential to inspire exciting new studies that link cross-linguistic data to open questions in linguistic typology, historical linguistics, psycholinguistics, and computational linguistics.

Via

Access Paper or Ask Questions

Annotating and Inferring Compositional Structures in Numeral Systems Across Languages

Mar 04, 2025

Arne Rubehn, Christoph Rzymski, Luca Ciucci, Kellen Parker van Dam, Alžběta Kučerová, Katja Bocklage, David Snee, Abishek Stephen, Johann-Mattis List

Abstract:Numeral systems across the world's languages vary in fascinating ways, both regarding their synchronic structure and the diachronic processes that determined how they evolved in their current shape. For a proper comparison of numeral systems across different languages, however, it is important to code them in a standardized form that allows for the comparison of basic properties. Here, we present a simple but effective coding scheme for numeral annotation, along with a workflow that helps to code numeral systems in a computer-assisted manner, providing sample data for numerals from 1 to 40 in 25 typologically diverse languages. We perform a thorough analysis of the sample, focusing on the systematic comparison between the underlying and the surface morphological structure. We further experiment with automated models for morpheme segmentation, where we find allomorphy as the major reason for segmentation errors. Finally, we show that subword tokenization algorithms are not viable for discovering morphemes in low-resource scenarios.

* Submitted to the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP (SIGTYP)

Via

Access Paper or Ask Questions

From Isolates to Families: Using Neural Networks for Automated Language Affiliation

Feb 17, 2025

Frederic Blum, Steffen Herbold, Johann-Mattis List

Abstract:In historical linguistics, the affiliation of languages to a common language family is traditionally carried out using a complex workflow that relies on manually comparing individual languages. Large-scale standardized collections of multilingual wordlists and grammatical language structures might help to improve this and open new avenues for developing automated language affiliation workflows. Here, we present neural network models that use lexical and grammatical data from a worldwide sample of more than 1,000 languages with known affiliations to classify individual languages into families. In line with the traditional assumption of most linguists, our results show that models trained on lexical data alone outperform models solely based on grammatical data, whereas combining both types of data yields even better performance. In additional experiments, we show how our models can identify long-ranging relations between entire subgroups, how they can be employed to investigate potential relatives of linguistic isolates, and how they can help us to obtain first hints on the affiliation of so far unaffiliated languages. We conclude that models for automated language affiliation trained on lexical and grammatical data provide comparative linguists with a valuable tool for evaluating hypotheses about deep and unknown language relations.

* Submitted to the 63rd Annual Meeting of the Association for Computational Linguistics, Vienna, Austria

Via

Access Paper or Ask Questions

Everybody Likes to Sleep: A Computer-Assisted Comparison of Object Naming Data from 30 Languages

Jan 14, 2025

Alžběta Kučerová, Johann-Mattis List

Figure 1 for Everybody Likes to Sleep: A Computer-Assisted Comparison of Object Naming Data from 30 Languages

Figure 2 for Everybody Likes to Sleep: A Computer-Assisted Comparison of Object Naming Data from 30 Languages

Figure 3 for Everybody Likes to Sleep: A Computer-Assisted Comparison of Object Naming Data from 30 Languages

Figure 4 for Everybody Likes to Sleep: A Computer-Assisted Comparison of Object Naming Data from 30 Languages

Abstract:Object naming - the act of identifying an object with a word or a phrase - is a fundamental skill in interpersonal communication, relevant to many disciplines, such as psycholinguistics, cognitive linguistics, or language and vision research. Object naming datasets, which consist of concept lists with picture pairings, are used to gain insights into how humans access and select names for objects in their surroundings and to study the cognitive processes involved in converting visual stimuli into semantic concepts. Unfortunately, object naming datasets often lack transparency and have a highly idiosyncratic structure. Our study tries to make current object naming data transparent and comparable by using a multilingual, computer-assisted approach that links individual items of object naming lists to unified concepts. Our current sample links 17 object naming datasets that cover 30 languages from 10 different language families. We illustrate how the comparative dataset can be explored by searching for concepts that recur across the majority of datasets and comparing the conceptual spaces of covered object naming datasets with classical basic vocabulary lists from historical linguistics and linguistic typology. Our findings can serve as a basis for enhancing cross-linguistic object naming research and as a guideline for future studies dealing with object naming tasks.

* To appear in the Proceedings of the Global WordNet Conference 2025

Via

Access Paper or Ask Questions

Generating Feature Vectors from Phonetic Transcriptions in Cross-Linguistic Data Formats

May 07, 2024

Arne Rubehn, Jessica Nieder, Robert Forkel, Johann-Mattis List

Figure 1 for Generating Feature Vectors from Phonetic Transcriptions in Cross-Linguistic Data Formats

Figure 2 for Generating Feature Vectors from Phonetic Transcriptions in Cross-Linguistic Data Formats

Figure 3 for Generating Feature Vectors from Phonetic Transcriptions in Cross-Linguistic Data Formats

Figure 4 for Generating Feature Vectors from Phonetic Transcriptions in Cross-Linguistic Data Formats

Abstract:When comparing speech sounds across languages, scholars often make use of feature representations of individual sounds in order to determine fine-grained sound similarities. Although binary feature systems for large numbers of speech sounds have been proposed, large-scale computational applications often face the challenges that the proposed feature systems -- even if they list features for several thousand sounds -- only cover a smaller part of the numerous speech sounds reflected in actual cross-linguistic data. In order to address the problem of missing data for attested speech sounds, we propose a new approach that can create binary feature vectors dynamically for all sounds that can be represented in the the standardized version of the International Phonetic Alphabet proposed by the Cross-Linguistic Transcription Systems (CLTS) reference catalog. Since CLTS is actively used in large data collections, covering more than 2,000 distinct language varieties, our procedure for the generation of binary feature vectors provides immediate access to a very large collection of multilingual wordlists. Testing our feature system in different ways on different datasets proves that the system is not only useful to provide a straightforward means to compare the similarity of speech sounds, but also illustrates its potential to be used in future cross-linguistic machine learning applications.

* To appear in the Proceedings of the 2024 Meeting of the Society for Computation in Linguistics (SCiL)

Via

Access Paper or Ask Questions

Are Sounds Sound for Phylogenetic Reconstruction?

Feb 05, 2024

Luise Häuser, Gerhard Jäger, Taraka Rama, Johann-Mattis List, Alexandros Stamatakis

Figure 1 for Are Sounds Sound for Phylogenetic Reconstruction?

Figure 2 for Are Sounds Sound for Phylogenetic Reconstruction?

Figure 3 for Are Sounds Sound for Phylogenetic Reconstruction?

Figure 4 for Are Sounds Sound for Phylogenetic Reconstruction?

Abstract:In traditional studies on language evolution, scholars often emphasize the importance of sound laws and sound correspondences for phylogenetic inference of language family trees. However, to date, computational approaches have typically not taken this potential into account. Most computational studies still rely on lexical cognates as major data source for phylogenetic reconstruction in linguistics, although there do exist a few studies in which authors praise the benefits of comparing words at the level of sound sequences. Building on (a) ten diverse datasets from different language families, and (b) state-of-the-art methods for automated cognate and sound correspondence detection, we test, for the first time, the performance of sound-based versus cognate-based approaches to phylogenetic reconstruction. Our results show that phylogenies reconstructed from lexical cognates are topologically closer, by approximately one third with respect to the generalized quartet distance on average, to the gold standard phylogenies than phylogenies reconstructed from sound correspondences.

* Paper accepted for SIGTYP (2024): H\"auser, Luise; J\"ager, Gerhard; List, Johann-Mattis; Rama, Taraka; and Stamatakis, Alexandros (2024): Are sounds sound for phylogenetic reconstruction? In: Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP (SIGTYP 2024)

Via

Access Paper or Ask Questions

A Computational Model for the Assessment of Mutual Intelligibility Among Closely Related Languages

Feb 05, 2024

Jessica Nieder, Johann-Mattis List

Abstract:Closely related languages show linguistic similarities that allow speakers of one language to understand speakers of another language without having actively learned it. Mutual intelligibility varies in degree and is typically tested in psycholinguistic experiments. To study mutual intelligibility computationally, we propose a computer-assisted method using the Linear Discriminative Learner, a computational model developed to approximate the cognitive processes by which humans learn languages, which we expand with multilingual semantic vectors and multilingual sound classes. We test the model on cognate data from German, Dutch, and English, three closely related Germanic languages. We find that our model's comprehension accuracy depends on 1) the automatic trimming of inflections and 2) the language pair for which comprehension is tested. Our multilingual modelling approach does not only offer new methodological findings for automatic testing of mutual intelligibility across languages but also extends the use of Linear Discriminative Learning to multilingual settings.

* To appear in: Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP (SIGTYP 2024)

Via

Access Paper or Ask Questions

Representing and Computing Uncertainty in Phonological Reconstruction

Oct 19, 2023

Johann-Mattis List, Nathan W. Hill, Robert Forkel, Frederic Blum

Abstract:Despite the inherently fuzzy nature of reconstructions in historical linguistics, most scholars do not represent their uncertainty when proposing proto-forms. With the increasing success of recently proposed approaches to automating certain aspects of the traditional comparative method, the formal representation of proto-forms has also improved. This formalization makes it possible to address both the representation and the computation of uncertainty. Building on recent advances in supervised phonological reconstruction, during which an algorithm learns how to reconstruct words in a given proto-language relying on previously annotated data, and inspired by improved methods for automated word prediction from cognate sets, we present a new framework that allows for the representation of uncertainty in linguistic reconstruction and also includes a workflow for the computation of fuzzy reconstructions from linguistic data.

* To appear in: Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change

Via

Access Paper or Ask Questions

Information-Theoretic Characterization of Vowel Harmony: A Cross-Linguistic Study on Word Lists

Aug 09, 2023

Julius Steuer, Badr Abdullah, Johann-Mattis List, Dietrich Klakow

Figure 1 for Information-Theoretic Characterization of Vowel Harmony: A Cross-Linguistic Study on Word Lists

Figure 2 for Information-Theoretic Characterization of Vowel Harmony: A Cross-Linguistic Study on Word Lists

Figure 3 for Information-Theoretic Characterization of Vowel Harmony: A Cross-Linguistic Study on Word Lists

Figure 4 for Information-Theoretic Characterization of Vowel Harmony: A Cross-Linguistic Study on Word Lists

Abstract:We present a cross-linguistic study that aims to quantify vowel harmony using data-driven computational modeling. Concretely, we define an information-theoretic measure of harmonicity based on the predictability of vowels in a natural language lexicon, which we estimate using phoneme-level language models (PLMs). Prior quantitative studies have relied heavily on inflected word-forms in the analysis of vowel harmony. We instead train our models using cross-linguistically comparable lemma forms with little or no inflection, which enables us to cover more under-studied languages. Training data for our PLMs consists of word lists with a maximum of 1000 entries per language. Despite the fact that the data we employ are substantially smaller than previously used corpora, our experiments demonstrate the neural PLMs capture vowel harmony patterns in a set of languages that exhibit this phenomenon. Our work also demonstrates that word lists are a valuable resource for typological research, and offers new possibilities for future studies on low-resource, under-studied languages.

* Presented at SIGTYP at EACL 2023

Via

Access Paper or Ask Questions