Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mahdi Mohseni

Comparative Computational Analysis of Global Structure in Canonical, Non-Canonical and Non-Literary Texts

Aug 25, 2020

Mahdi Mohseni, Volker Gast, Christoph Redies

Figure 1 for Comparative Computational Analysis of Global Structure in Canonical, Non-Canonical and Non-Literary Texts

Figure 2 for Comparative Computational Analysis of Global Structure in Canonical, Non-Canonical and Non-Literary Texts

Figure 3 for Comparative Computational Analysis of Global Structure in Canonical, Non-Canonical and Non-Literary Texts

Figure 4 for Comparative Computational Analysis of Global Structure in Canonical, Non-Canonical and Non-Literary Texts

Abstract:This study investigates global properties of literary and non-literary texts. Within the literary texts, a distinction is made between canonical and non-canonical works. The central hypothesis of the study is that the three text types (non-literary, literary/canonical and literary/non-canonical) exhibit systematic differences with respect to structural design features as correlates of aesthetic responses in readers. To investigate these differences, we compiled a corpus containing texts of the three categories of interest, the Jena Textual Aesthetics Corpus. Two aspects of global structure are investigated, variability and self-similar (fractal) patterns, which reflect long-range correlations along texts. We use four types of basic observations, (i) the frequency of POS-tags per sentence, (ii) sentence length, (iii) lexical diversity in chunks of text, and (iv) the distribution of topic probabilities in chunks of texts. These basic observations are grouped into two more general categories, (a) the low-level properties (i) and (ii), which are observed at the level of the sentence (reflecting linguistic decoding), and (b) the high-level properties (iii) and (iv), which are observed at the textual level (reflecting comprehension). The basic observations are transformed into time series, and these time series are subject to multifractal detrended fluctuation analysis (MFDFA). Our results show that low-level properties of texts are better discriminators than high-level properties, for the three text types under analysis. Canonical literary texts differ from non-canonical ones primarily in terms of variability. Fractality seems to be a universal feature of text, more pronounced in non-literary than in literary texts. Beyond the specific results of the study, we intend to open up new perspectives on the experimental study of textual aesthetics.

* 30 pages + 7 pages supplementary material, 5 figures

Via

Access Paper or Ask Questions

PEYMA: A Tagged Corpus for Persian Named Entities

Jan 30, 2018

Mahsa Sadat Shahshahani, Mahdi Mohseni, Azadeh Shakery, Heshaam Faili

Figure 1 for PEYMA: A Tagged Corpus for Persian Named Entities

Figure 2 for PEYMA: A Tagged Corpus for Persian Named Entities

Figure 3 for PEYMA: A Tagged Corpus for Persian Named Entities

Figure 4 for PEYMA: A Tagged Corpus for Persian Named Entities

Abstract:The goal in the NER task is to classify proper nouns of a text into classes such as person, location, and organization. This is an important preprocessing step in many NLP tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art NER systems have reached performances of higher than 90 percent in terms of F1 measure, there are very few research studies for this task in Persian. One of the main important causes of this may be the lack of a standard Persian NER dataset to train and test NER systems. In this research we create a standard, big-enough tagged Persian NER dataset which will be distributed for free for research purposes. In order to construct such a standard dataset, we studied standard NER datasets which are constructed for English researches and found out that almost all of these datasets are constructed using news texts. So we collected documents from ten news websites. Later, in order to provide annotators with some guidelines to tag these documents, after studying guidelines used for constructing CoNLL and MUC standard English datasets, we set our own guidelines considering the Persian linguistic rules.

* 2017, Signal and Data Processing Journal

Via

Access Paper or Ask Questions