Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ogabek Sobirov

UzbekTagger: The rule-based POS tagger for Uzbek language

Jan 30, 2023

Maksud Sharipov, Elmurod Kuriyozov, Ollabergan Yuldashev, Ogabek Sobirov

Abstract:This research paper presents a part-of-speech (POS) annotated dataset and tagger tool for the low-resource Uzbek language. The dataset includes 12 tags, which were used to develop a rule-based POS-tagger tool. The corpus text used in the annotation process was made sure to be balanced over 20 different fields in order to ensure its representativeness. Uzbek being an agglutinative language so the most of the words in an Uzbek sentence are formed by adding suffixes. This nature of it makes the POS-tagging task difficult to find the stems of words and the right part-of-speech they belong to. The methodology proposed in this research is the stemming of the words with an affix/suffix stripping approach including database of the stem forms of the words in the Uzbek language. The tagger tool was tested on the annotated dataset and showed high accuracy in identifying and tagging parts of speech in Uzbek text. This newly presented dataset and tagger tool can be used for a variety of natural language processing tasks such as language modeling, machine translation, and text-to-speech synthesis. The presented dataset is the first of its kind to be made publicly available for Uzbek, and the POS-tagger tool created can also be used as a pivot to use as a base for other closely-related Turkic languages.

* Preprint of the accepted paper to The 10th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, April 21-23, 2023, Pozna\'n, Poland

Via

Access Paper or Ask Questions

Development of a rule-based lemmatization algorithm through Finite State Machine for Uzbek language

Oct 28, 2022

Maksud Sharipov, Ogabek Sobirov

Figure 1 for Development of a rule-based lemmatization algorithm through Finite State Machine for Uzbek language

Figure 2 for Development of a rule-based lemmatization algorithm through Finite State Machine for Uzbek language

Figure 3 for Development of a rule-based lemmatization algorithm through Finite State Machine for Uzbek language

Figure 4 for Development of a rule-based lemmatization algorithm through Finite State Machine for Uzbek language

Abstract:Lemmatization is one of the core concepts in natural language processing, thus creating a lemmatization tool is an important task. This paper discusses the construction of a lemmatization algorithm for the Uzbek language. The main purpose of the work is to remove affixes of words in the Uzbek language by means of the finite state machine and to identify a lemma (a word that can be found in the dictionary) of the word. The process of removing affixes uses a database of affixes and part of speech knowledge. This lemmatization consists of the general rules and a part of speech data of the Uzbek language, affixes, classification of affixes, removing affixes on the basis of the finite state machine for each class, as well as a definition of this word lemma.

* Preprint version of the paper to be published in The International Conference and Workshop on Agglutinative Language Technologies as a challenge of Natural Language Processing (ALTNLP), June 6, 2022, Koper, Slovenia

Via

Access Paper or Ask Questions