Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Maxim Romanov

Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus

Sep 11, 2018

Yonatan Belinkov, Alexander Magidow, Alberto Barrón-Cedeño, Avi Shmidman, Maxim Romanov

Figure 1 for Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus

Figure 2 for Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus

Figure 3 for Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus

Figure 4 for Studying the History of the Arabic Language: Language Technology and a Large-Scale Historical Corpus

Abstract:Arabic is a widely-spoken language with a long and rich history, but existing corpora and language technology focus mostly on modern Arabic and its varieties. Therefore, studying the history of the language has so far been mostly limited to manual analyses on a small scale. In this work, we present a large-scale historical corpus of the written Arabic language, spanning 1400 years. We describe our efforts to clean and process this corpus using Arabic NLP tools, including the identification of reused text. We study the history of the Arabic language using a novel automatic periodization algorithm, as well as other techniques. Our findings confirm the established division of written Arabic into Modern Standard and Classical Arabic, and confirm other established periodizations, while suggesting that written Arabic may be divisible into still further periods of development.

Via

Access Paper or Ask Questions

Important New Developments in Arabographic Optical Character Recognition (OCR)

Mar 28, 2017

Maxim Romanov, Matthew Thomas Miller, Sarah Bowen Savant, Benjamin Kiessling

Figure 1 for Important New Developments in Arabographic Optical Character Recognition (OCR)

Figure 2 for Important New Developments in Arabographic Optical Character Recognition (OCR)

Figure 3 for Important New Developments in Arabographic Optical Character Recognition (OCR)

Figure 4 for Important New Developments in Arabographic Optical Character Recognition (OCR)

Abstract:The OpenITI team has achieved Optical Character Recognition (OCR) accuracy rates for classical Arabic-script texts in the high nineties. These numbers are based on our tests of seven different Arabic-script texts of varying quality and typefaces, totaling over 7,000 lines. These accuracy rates not only represent a distinct improvement over the actual accuracy rates of the various proprietary OCR options for classical Arabic-script texts, but, equally important, they are produced using an open-source OCR software, thus enabling us to make this Arabic-script OCR technology freely available to the broader Islamic, Persian, and Arabic Studies communities.

Via

Access Paper or Ask Questions

Shamela: A Large-Scale Historical Arabic Corpus

Dec 28, 2016

Yonatan Belinkov, Alexander Magidow, Maxim Romanov, Avi Shmidman, Moshe Koppel

Figure 1 for Shamela: A Large-Scale Historical Arabic Corpus

Figure 2 for Shamela: A Large-Scale Historical Arabic Corpus

Figure 3 for Shamela: A Large-Scale Historical Arabic Corpus

Figure 4 for Shamela: A Large-Scale Historical Arabic Corpus

Abstract:Arabic is a widely-spoken language with a rich and long history spanning more than fourteen centuries. Yet existing Arabic corpora largely focus on the modern period or lack sufficient diachronic information. We develop a large-scale, historical corpus of Arabic of about 1 billion words from diverse periods of time. We clean this corpus, process it with a morphological analyzer, and enhance it by detecting parallel passages and automatically dating undated texts. We demonstrate its utility with selected case-studies in which we show its application to the digital humanities.

* Slightly expanded version of Coling LT4DH workshop paper

Via

Access Paper or Ask Questions