Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ekaterina Chernyak

Automated Word Stress Detection in Russian

Jul 12, 2019

Maria Ponomareva, Kirill Milintsevich, Ekaterina Chernyak, Anatoly Starostin

Figure 1 for Automated Word Stress Detection in Russian

Figure 2 for Automated Word Stress Detection in Russian

Figure 3 for Automated Word Stress Detection in Russian

Figure 4 for Automated Word Stress Detection in Russian

Abstract:In this study we address the problem of automated word stress detection in Russian using character level models and no part-speech-taggers. We use a simple bidirectional RNN with LSTM nodes and achieve the accuracy of 90% or higher. We experiment with two training datasets and show that using the data from an annotated corpus is much more efficient than using a dictionary, since it allows us to take into account word frequencies and the morphological context of the word.

* Published in Proceedings of the First Workshop on Subword and Character Level Models in NLP, pages 31 35, Copenhagen, Denmark, September 7, 2017
* SCLeM 2017

Via

Access Paper or Ask Questions

Char-RNN for Word Stress Detection in East Slavic Languages

Jun 10, 2019

Ekaterina Chernyak, Maria Ponomareva, Kirill Milintsevich

Figure 1 for Char-RNN for Word Stress Detection in East Slavic Languages

Figure 2 for Char-RNN for Word Stress Detection in East Slavic Languages

Figure 3 for Char-RNN for Word Stress Detection in East Slavic Languages

Figure 4 for Char-RNN for Word Stress Detection in East Slavic Languages

Abstract:We explore how well a sequence labeling approach, namely, recurrent neural network, is suited for the task of resource-poor and POS tagging free word stress detection in the Russian, Ukranian, Belarusian languages. We present new datasets, annotated with the word stress, for the three languages and compare several RNN models trained on three languages and explore possible applications of the transfer learning for the task. We show that it is possible to train a model in a cross-lingual setting and that using additional languages improves the quality of the results.

* 2019, In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 35-41,TOBEFILLED-Ann Arbor, Michigan, Association for Computational Linguistics
* Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects at NAACL-2019

Via

Access Paper or Ask Questions

Creation and Evaluation of Datasets for Distributional Semantics Tasks in the Digital Humanities Domain

Mar 07, 2019

Gerhard Wohlgenannt, Ariadna Barinova, Dmitry Ilvovsky, Ekaterina Chernyak

Figure 1 for Creation and Evaluation of Datasets for Distributional Semantics Tasks in the Digital Humanities Domain

Figure 2 for Creation and Evaluation of Datasets for Distributional Semantics Tasks in the Digital Humanities Domain

Figure 3 for Creation and Evaluation of Datasets for Distributional Semantics Tasks in the Digital Humanities Domain

Figure 4 for Creation and Evaluation of Datasets for Distributional Semantics Tasks in the Digital Humanities Domain

Abstract:Word embeddings are already well studied in the general domain, usually trained on large text corpora, and have been evaluated for example on word similarity and analogy tasks, but also as an input to downstream NLP processes. In contrast, in this work we explore the suitability of word embedding technologies in the specialized digital humanities domain. After training embedding models of various types on two popular fantasy novel book series, we evaluate their performance on two task types: term analogies, and word intrusion. To this end, we manually construct test datasets with domain experts. Among the contributions are the evaluation of various word embedding techniques on the different task types, with the findings that even embeddings trained on small corpora perform well for example on the word intrusion task. Furthermore, we provide extensive and high-quality datasets in digital humanities for further investigation, as well as the implementation to easily reproduce or extend the experiments.

Via

Access Paper or Ask Questions

Relation Extraction Datasets in the Digital Humanities Domain and their Evaluation with Word Embeddings

Mar 04, 2019

Gerhard Wohlgenannt, Ekaterina Chernyak, Dmitry Ilvovsky, Ariadna Barinova, Dmitry Mouromtsev

Figure 1 for Relation Extraction Datasets in the Digital Humanities Domain and their Evaluation with Word Embeddings

Figure 2 for Relation Extraction Datasets in the Digital Humanities Domain and their Evaluation with Word Embeddings

Figure 3 for Relation Extraction Datasets in the Digital Humanities Domain and their Evaluation with Word Embeddings

Figure 4 for Relation Extraction Datasets in the Digital Humanities Domain and their Evaluation with Word Embeddings

Abstract:In this research, we manually create high-quality datasets in the digital humanities domain for the evaluation of language models, specifically word embedding models. The first step comprises the creation of unigram and n-gram datasets for two fantasy novel book series for two task types each, analogy and doesn't-match. This is followed by the training of models on the two book series with various popular word embedding model types such as word2vec, GloVe, fastText, or LexVec. Finally, we evaluate the suitability of word embedding models for such specific relation extraction tasks in a situation of comparably small corpus sizes. In the evaluations, we also investigate and analyze particular aspects such as the impact of corpus term frequencies and task difficulty on accuracy. The datasets, and the underlying system and word embedding models are available on github and can be easily extended with new datasets and tasks, be used to reproduce the presented results, or be transferred to other domains.

Via

Access Paper or Ask Questions