Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Robert Forkel

Advancing the Database of Cross-Linguistic Colexifications with New Workflows and Data

Mar 14, 2025

Annika Tjuka, Robert Forkel, Christoph Rzymski, Johann-Mattis List

Figure 1 for Advancing the Database of Cross-Linguistic Colexifications with New Workflows and Data

Figure 2 for Advancing the Database of Cross-Linguistic Colexifications with New Workflows and Data

Figure 3 for Advancing the Database of Cross-Linguistic Colexifications with New Workflows and Data

Abstract:Lexical resources are crucial for cross-linguistic analysis and can provide new insights into computational models for natural language learning. Here, we present an advanced database for comparative studies of words with multiple meanings, a phenomenon known as colexification. The new version includes improvements in the handling, selection and presentation of the data. We compare the new database with previous versions and find that our improvements provide a more balanced sample covering more language families worldwide, with an enhanced data quality, given that all word forms are provided in phonetic transcription. We conclude that the new Database of Cross-Linguistic Colexifications has the potential to inspire exciting new studies that link cross-linguistic data to open questions in linguistic typology, historical linguistics, psycholinguistics, and computational linguistics.

Via

Access Paper or Ask Questions

Generating Feature Vectors from Phonetic Transcriptions in Cross-Linguistic Data Formats

May 07, 2024

Arne Rubehn, Jessica Nieder, Robert Forkel, Johann-Mattis List

Figure 1 for Generating Feature Vectors from Phonetic Transcriptions in Cross-Linguistic Data Formats

Figure 2 for Generating Feature Vectors from Phonetic Transcriptions in Cross-Linguistic Data Formats

Figure 3 for Generating Feature Vectors from Phonetic Transcriptions in Cross-Linguistic Data Formats

Figure 4 for Generating Feature Vectors from Phonetic Transcriptions in Cross-Linguistic Data Formats

Abstract:When comparing speech sounds across languages, scholars often make use of feature representations of individual sounds in order to determine fine-grained sound similarities. Although binary feature systems for large numbers of speech sounds have been proposed, large-scale computational applications often face the challenges that the proposed feature systems -- even if they list features for several thousand sounds -- only cover a smaller part of the numerous speech sounds reflected in actual cross-linguistic data. In order to address the problem of missing data for attested speech sounds, we propose a new approach that can create binary feature vectors dynamically for all sounds that can be represented in the the standardized version of the International Phonetic Alphabet proposed by the Cross-Linguistic Transcription Systems (CLTS) reference catalog. Since CLTS is actively used in large data collections, covering more than 2,000 distinct language varieties, our procedure for the generation of binary feature vectors provides immediate access to a very large collection of multilingual wordlists. Testing our feature system in different ways on different datasets proves that the system is not only useful to provide a straightforward means to compare the similarity of speech sounds, but also illustrates its potential to be used in future cross-linguistic machine learning applications.

* To appear in the Proceedings of the 2024 Meeting of the Society for Computation in Linguistics (SCiL)

Via

Access Paper or Ask Questions

Representing and Computing Uncertainty in Phonological Reconstruction

Oct 19, 2023

Johann-Mattis List, Nathan W. Hill, Robert Forkel, Frederic Blum

Figure 1 for Representing and Computing Uncertainty in Phonological Reconstruction

Figure 2 for Representing and Computing Uncertainty in Phonological Reconstruction

Figure 3 for Representing and Computing Uncertainty in Phonological Reconstruction

Figure 4 for Representing and Computing Uncertainty in Phonological Reconstruction

Abstract:Despite the inherently fuzzy nature of reconstructions in historical linguistics, most scholars do not represent their uncertainty when proposing proto-forms. With the increasing success of recently proposed approaches to automating certain aspects of the traditional comparative method, the formal representation of proto-forms has also improved. This formalization makes it possible to address both the representation and the computation of uncertainty. Building on recent advances in supervised phonological reconstruction, during which an algorithm learns how to reconstruct words in a given proto-language relying on previously annotated data, and inspired by improved methods for automated word prediction from cognate sets, we present a new framework that allows for the representation of uncertainty in linguistic reconstruction and also includes a workflow for the computation of fuzzy reconstructions from linguistic data.

* To appear in: Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change

Via

Access Paper or Ask Questions

A New Framework for Fast Automated Phonological Reconstruction Using Trimmed Alignments and Sound Correspondence Patterns

Apr 10, 2022

Johann-Mattis List, Robert Forkel, Nathan W. Hill

Figure 1 for A New Framework for Fast Automated Phonological Reconstruction Using Trimmed Alignments and Sound Correspondence Patterns

Figure 2 for A New Framework for Fast Automated Phonological Reconstruction Using Trimmed Alignments and Sound Correspondence Patterns

Figure 3 for A New Framework for Fast Automated Phonological Reconstruction Using Trimmed Alignments and Sound Correspondence Patterns

Figure 4 for A New Framework for Fast Automated Phonological Reconstruction Using Trimmed Alignments and Sound Correspondence Patterns

Abstract:Computational approaches in historical linguistics have been increasingly applied during the past decade and many new methods that implement parts of the traditional comparative method have been proposed. Despite these increased efforts, there are not many easy-to-use and fast approaches for the task of phonological reconstruction. Here we present a new framework that combines state-of-the-art techniques for automated sequence comparison with novel techniques for phonetic alignment analysis and sound correspondence pattern detection to allow for the supervised reconstruction of word forms in ancestral languages. We test the method on a new dataset covering six groups from three different language families. The results show that our method yields promising results while at the same time being not only fast but also easy to apply and expand.

* To appear at the 3rd Workshop on Computational Approaches to Historical Language Change, co-located with the ACL 2022 conference. https://www.aclweb.org/portal/content/3rd-workshop-computational-approaches-historical-language-change

Via

Access Paper or Ask Questions