Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Albina Khusainova

Evaluation of Morphological Embeddings for the Russian Language

Mar 11, 2021

Vitaly Romanov, Albina Khusainova

Figure 1 for Evaluation of Morphological Embeddings for the Russian Language

Abstract:A number of morphology-based word embedding models were introduced in recent years. However, their evaluation was mostly limited to English, which is known to be a morphologically simple language. In this paper, we explore whether and to what extent incorporating morphology into word embeddings improves performance on downstream NLP tasks, in the case of morphologically rich Russian language. NLP tasks of our choice are POS tagging, Chunking, and NER -- for Russian language, all can be mostly solved using only morphology without understanding the semantics of words. Our experiments show that morphology-based embeddings trained with Skipgram objective do not outperform existing embedding model -- FastText. Moreover, a more complex, but morphology unaware model, BERT, allows to achieve significantly greater performance on the tasks that presumably require understanding of a word's morphology.

* Published in Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval

Via

Access Paper or Ask Questions

Evaluation of Morphological Embeddings for English and Russian Languages

Mar 11, 2021

Vitaly Romanov, Albina Khusainova

Figure 1 for Evaluation of Morphological Embeddings for English and Russian Languages

Figure 2 for Evaluation of Morphological Embeddings for English and Russian Languages

Figure 3 for Evaluation of Morphological Embeddings for English and Russian Languages

Figure 4 for Evaluation of Morphological Embeddings for English and Russian Languages

Abstract:This paper evaluates morphology-based embeddings for English and Russian languages. Despite the interest and introduction of several morphology-based word embedding models in the past and acclaimed performance improvements on word similarity and language modeling tasks, in our experiments, we did not observe any stable preference over two of our baseline models - SkipGram and FastText. The performance exhibited by morphological embeddings is the average of the two baselines mentioned above.

* Published in Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for {NLP}. arXiv admin note: text overlap with arXiv:2103.06628

Via

Access Paper or Ask Questions

Hierarchical Transformer for Multilingual Machine Translation

Mar 05, 2021

Albina Khusainova, Adil Khan, Adín Ramírez Rivera, Vitaly Romanov

Figure 1 for Hierarchical Transformer for Multilingual Machine Translation

Figure 2 for Hierarchical Transformer for Multilingual Machine Translation

Figure 3 for Hierarchical Transformer for Multilingual Machine Translation

Figure 4 for Hierarchical Transformer for Multilingual Machine Translation

Abstract:The choice of parameter sharing strategy in multilingual machine translation models determines how optimally parameter space is used and hence, directly influences ultimate translation quality. Inspired by linguistic trees that show the degree of relatedness between different languages, the new general approach to parameter sharing in multilingual machine translation was suggested recently. The main idea is to use these expert language hierarchies as a basis for multilingual architecture: the closer two languages are, the more parameters they share. In this work, we test this idea using the Transformer architecture and show that despite the success in previous work there are problems inherent to training such hierarchical models. We demonstrate that in case of carefully chosen training strategy the hierarchical architecture can outperform bilingual models and multilingual models with full parameter sharing.

* Accepted to VarDial 2021

Via

Access Paper or Ask Questions

A Survey of Methods to Leverage Monolingual Data in Low-resource Neural Machine Translation

Oct 01, 2019

Ilshat Gibadullin, Aidar Valeev, Albina Khusainova, Adil Khan

Figure 1 for A Survey of Methods to Leverage Monolingual Data in Low-resource Neural Machine Translation

Figure 2 for A Survey of Methods to Leverage Monolingual Data in Low-resource Neural Machine Translation

Abstract:Neural machine translation has become the state-of-the-art for language pairs with large parallel corpora. However, the quality of machine translation for low-resource languages leaves much to be desired. There are several approaches to mitigate this problem, such as transfer learning, semi-supervised and unsupervised learning techniques. In this paper, we review the existing methods, where the main idea is to exploit the power of monolingual data, which, compared to parallel, is usually easier to obtain and significantly greater in amount.

* Presented in ICATHS'19

Via

Access Paper or Ask Questions

Application of Low-resource Machine Translation Techniques to Russian-Tatar Language Pair

Oct 01, 2019

Aidar Valeev, Ilshat Gibadullin, Albina Khusainova, Adil Khan

Figure 1 for Application of Low-resource Machine Translation Techniques to Russian-Tatar Language Pair

Figure 2 for Application of Low-resource Machine Translation Techniques to Russian-Tatar Language Pair

Figure 3 for Application of Low-resource Machine Translation Techniques to Russian-Tatar Language Pair

Figure 4 for Application of Low-resource Machine Translation Techniques to Russian-Tatar Language Pair

Abstract:Neural machine translation is the current state-of-the-art in machine translation. Although it is successful in a resource-rich setting, its applicability for low-resource language pairs is still debatable. In this paper, we explore the effect of different techniques to improve machine translation quality when a parallel corpus is as small as 324 000 sentences, taking as an example previously unexplored Russian-Tatar language pair. We apply such techniques as transfer learning and semi-supervised learning to the base Transformer model, and empirically show that the resulting models improve Russian to Tatar and Tatar to Russian translation quality by +2.57 and +3.66 BLEU, respectively.

* Presented on ICATHS'19

Via

Access Paper or Ask Questions

SART - Similarity, Analogies, and Relatedness for Tatar Language: New Benchmark Datasets for Word Embeddings Evaluation

Mar 31, 2019

Albina Khusainova, Adil Khan, Adín Ramírez Rivera

Figure 1 for SART - Similarity, Analogies, and Relatedness for Tatar Language: New Benchmark Datasets for Word Embeddings Evaluation

Figure 2 for SART - Similarity, Analogies, and Relatedness for Tatar Language: New Benchmark Datasets for Word Embeddings Evaluation

Figure 3 for SART - Similarity, Analogies, and Relatedness for Tatar Language: New Benchmark Datasets for Word Embeddings Evaluation

Figure 4 for SART - Similarity, Analogies, and Relatedness for Tatar Language: New Benchmark Datasets for Word Embeddings Evaluation

Abstract:There is a huge imbalance between languages currently spoken and corresponding resources to study them. Most of the attention naturally goes to the "big" languages: those which have the largest presence in terms of media and number of speakers. Other less represented languages sometimes do not even have a good quality corpus to study them. In this paper, we tackle this imbalance by presenting a new set of evaluation resources for Tatar, a language of the Turkic language family which is mainly spoken in Tatarstan Republic, Russia. We present three datasets: Similarity and Relatedness datasets that consist of human scored word pairs and can be used to evaluate semantic models; and Analogies dataset that comprises analogy questions and allows to explore semantic, syntactic, and morphological aspects of language modeling. All three datasets build upon existing datasets for the English language and follow the same structure. However, they are not mere translations. They take into account specifics of the Tatar language and expand beyond the original datasets. We evaluate state-of-the-art word embedding models for two languages using our proposed datasets for Tatar and the original datasets for English and report our findings on performance comparison.

* The datasets are available at https://github.com/tat-nlp/SART

Via

Access Paper or Ask Questions