Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Elena Bruches

Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers

May 13, 2024

Alena Tsanda, Elena Bruches

Figure 1 for Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers

Figure 2 for Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers

Figure 3 for Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers

Figure 4 for Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers

Abstract:The paper discusses the creation of a multimodal dataset of Russian-language scientific papers and testing of existing language models for the task of automatic text summarization. A feature of the dataset is its multimodal data, which includes texts, tables and figures. The paper presents the results of experiments with two language models: Gigachat from SBER and YandexGPT from Yandex. The dataset consists of 420 papers and is publicly available on https://github.com/iis-research-team/summarization-dataset.

* 12 pages, accepted to AINL

Via

Access Paper or Ask Questions

Automatic Aspect Extraction from Scientific Texts

Oct 06, 2023

Anna Marshalova, Elena Bruches, Tatiana Batura

Figure 1 for Automatic Aspect Extraction from Scientific Texts

Figure 2 for Automatic Aspect Extraction from Scientific Texts

Figure 3 for Automatic Aspect Extraction from Scientific Texts

Figure 4 for Automatic Aspect Extraction from Scientific Texts

Abstract:Being able to extract from scientific papers their main points, key insights, and other important information, referred to here as aspects, might facilitate the process of conducting a scientific literature review. Therefore, the aim of our research is to create a tool for automatic aspect extraction from Russian-language scientific texts of any domain. In this paper, we present a cross-domain dataset of scientific texts in Russian, annotated with such aspects as Task, Contribution, Method, and Conclusion, as well as a baseline algorithm for aspect extraction, based on the multilingual BERT model fine-tuned on our data. We show that there are some differences in aspect representation in different domains, but even though our model was trained on a limited number of scientific domains, it is still able to generalize to new domains, as was proved by cross-domain experiments. The code and the dataset are available at \url{https://github.com/anna-marshalova/automatic-aspect-extraction-from-scientific-texts}.

Via

Access Paper or Ask Questions

TERMinator: A system for scientific texts processing

Sep 29, 2022

Elena Bruches, Olga Tikhobaeva, Yana Dementyeva, Tatiana Batura

Figure 1 for TERMinator: A system for scientific texts processing

Figure 2 for TERMinator: A system for scientific texts processing

Figure 3 for TERMinator: A system for scientific texts processing

Figure 4 for TERMinator: A system for scientific texts processing

Abstract:This paper is devoted to the extraction of entities and semantic relations between them from scientific texts, where we consider scientific terms as entities. In this paper, we present a dataset that includes annotations for two tasks and develop a system called TERMinator for the study of the influence of language models on term recognition and comparison of different approaches for relation extraction. Experiments show that language models pre-trained on the target language are not always show the best performance. Also adding some heuristic approaches may improve the overall quality of the particular task. The developed tool and the annotated corpus are publicly available at https://github.com/iis-research-team/terminator and may be useful for other researchers.

Via

Access Paper or Ask Questions

A system for information extraction from scientific texts in Russian

Sep 14, 2021

Elena Bruches, Anastasia Mezentseva, Tatiana Batura

Figure 1 for A system for information extraction from scientific texts in Russian

Figure 2 for A system for information extraction from scientific texts in Russian

Figure 3 for A system for information extraction from scientific texts in Russian

Figure 4 for A system for information extraction from scientific texts in Russian

Abstract:In this paper, we present a system for information extraction from scientific texts in the Russian language. The system performs several tasks in an end-to-end manner: term recognition, extraction of relations between terms, and term linking with entities from the knowledge base. These tasks are extremely important for information retrieval, recommendation systems, and classification. The advantage of the implemented methods is that the system does not require a large amount of labeled data, which saves time and effort for data labeling and therefore can be applied in low- and mid-resource settings. The source code is publicly available and can be used for different research purposes.

Via

Access Paper or Ask Questions

Entity Recognition and Relation Extraction from Scientific and Technical Texts in Russian

Dec 14, 2020

Elena Bruches, Alexey Pauls, Tatiana Batura, Vladimir Isachenko

Figure 1 for Entity Recognition and Relation Extraction from Scientific and Technical Texts in Russian

Figure 2 for Entity Recognition and Relation Extraction from Scientific and Technical Texts in Russian

Figure 3 for Entity Recognition and Relation Extraction from Scientific and Technical Texts in Russian

Figure 4 for Entity Recognition and Relation Extraction from Scientific and Technical Texts in Russian

Abstract:This paper is devoted to the study of methods for information extraction (entity recognition and relation classification) from scientific texts on information technology. Scientific publications provide valuable information into cutting-edge scientific advances, but efficient processing of increasing amounts of data is a time-consuming task. In this paper, several modifications of methods for the Russian language are proposed. It also includes the results of experiments comparing a keyword extraction method, vocabulary method, and some methods based on neural networks. Text collections for these tasks exist for the English language and are actively used by the scientific community, but at present, such datasets in Russian are not publicly available. In this paper, we present a corpus of scientific texts in Russian, RuSERRC. This dataset consists of 1600 unlabeled documents and 80 labeled with entities and semantic relations (6 relation types were considered). The dataset and models are available at https://github.com/iis-research-team. We hope they can be useful for research purposes and development of information extraction systems.

Via

Access Paper or Ask Questions