Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Suresh Kolichala

Jambu: A historical linguistic database for South Asian languages

Jun 05, 2023

Aryaman Arora, Adam Farris, Samopriya Basu, Suresh Kolichala

Abstract:We introduce Jambu, a cognate database of South Asian languages which unifies dozens of previous sources in a structured and accessible format. The database includes 287k lemmata from 602 lects, grouped together in 23k sets of cognates. We outline the data wrangling necessary to compile the dataset and train neural models for reflex prediction on the Indo-Aryan subset of the data. We hope that Jambu is an invaluable resource for all historical linguists and Indologists, and look towards further improvement and expansion of the database.

* 5 pages main text, 10 pages total. To appear at SIGMORPHON

Via

Access Paper or Ask Questions

Computational historical linguistics and language diversity in South Asia

Mar 23, 2022

Aryaman Arora, Adam Farris, Samopriya Basu, Suresh Kolichala

Figure 1 for Computational historical linguistics and language diversity in South Asia

Figure 2 for Computational historical linguistics and language diversity in South Asia

Figure 3 for Computational historical linguistics and language diversity in South Asia

Figure 4 for Computational historical linguistics and language diversity in South Asia

Abstract:South Asia is home to a plethora of languages, many of which severely lack access to new language technologies. This linguistic diversity also results in a research environment conducive to the study of comparative, contact, and historical linguistics -- fields which necessitate the gathering of extensive data from many languages. We claim that data scatteredness (rather than scarcity) is the primary obstacle in the development of South Asian language technology, and suggest that the study of language history is uniquely aligned with surmounting this obstacle. We review recent developments in and at the intersection of South Asian NLP and historical-comparative linguistics, describing our and others' current efforts in this area. We also offer new strategies towards breaking the data barrier.

* 14 pages; accepted to ACL 2022 Theme Track

Via

Access Paper or Ask Questions