Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexander Magidow

Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus

Sep 11, 2018

Yonatan Belinkov, Alexander Magidow, Alberto Barrón-Cedeño, Avi Shmidman, Maxim Romanov

Figure 1 for Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus

Figure 2 for Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus

Figure 3 for Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus

Figure 4 for Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus

Abstract:Arabic is a widely-spoken language with a long and rich history, but existing corpora and language technology focus mostly on modern Arabic and its varieties. Therefore, studying the history of the language has so far been mostly limited to manual analyses on a small scale. In this work, we present a large-scale historical corpus of the written Arabic language, spanning 1400 years. We describe our efforts to clean and process this corpus using Arabic NLP tools, including the identification of reused text. We study the history of the Arabic language using a novel automatic periodization algorithm, as well as other techniques. Our findings confirm the established division of written Arabic into Modern Standard and Classical Arabic, and confirm other established periodizations, while suggesting that written Arabic may be divisible into still further periods of development.

Via

Access Paper or Ask Questions

Shamela: A Large-Scale Historical Arabic Corpus

Dec 28, 2016

Yonatan Belinkov, Alexander Magidow, Maxim Romanov, Avi Shmidman, Moshe Koppel

Figure 1 for Shamela: A Large-Scale Historical Arabic Corpus

Figure 2 for Shamela: A Large-Scale Historical Arabic Corpus

Figure 3 for Shamela: A Large-Scale Historical Arabic Corpus

Figure 4 for Shamela: A Large-Scale Historical Arabic Corpus

Abstract:Arabic is a widely-spoken language with a rich and long history spanning more than fourteen centuries. Yet existing Arabic corpora largely focus on the modern period or lack sufficient diachronic information. We develop a large-scale, historical corpus of Arabic of about 1 billion words from diverse periods of time. We clean this corpus, process it with a morphological analyzer, and enhance it by detecting parallel passages and automatically dating undated texts. We demonstrate its utility with selected case-studies in which we show its application to the digital humanities.

* Slightly expanded version of Coling LT4DH workshop paper

Via

Access Paper or Ask Questions