Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sandra M. Aluisio

Evaluating Word Embeddings for Sentence Boundary Detection in Speech Transcripts

Aug 15, 2017

Marcos V. Treviso, Christopher D. Shulby, Sandra M. Aluisio

Figure 1 for Evaluating Word Embeddings for Sentence Boundary Detection in Speech Transcripts

Figure 2 for Evaluating Word Embeddings for Sentence Boundary Detection in Speech Transcripts

Figure 3 for Evaluating Word Embeddings for Sentence Boundary Detection in Speech Transcripts

Abstract:This paper is motivated by the automation of neuropsychological tests involving discourse analysis in the retellings of narratives by patients with potential cognitive impairment. In this scenario the task of sentence boundary detection in speech transcripts is important as discourse analysis involves the application of Natural Language Processing tools, such as taggers and parsers, which depend on the sentence as a processing unit. Our aim in this paper is to verify which embedding induction method works best for the sentence boundary detection task, specifically whether it be those which were proposed to capture semantic, syntactic or morphological similarities.

* Accepted on STIL 2017

Via

Access Paper or Ask Questions

A Lightweight Regression Method to Infer Psycholinguistic Properties for Brazilian Portuguese

May 19, 2017

Leandro B. dos Santos, Magali S. Duran, Nathan S. Hartmann, Arnaldo Candido Jr., Gustavo H. Paetzold, Sandra M. Aluisio

Figure 1 for A Lightweight Regression Method to Infer Psycholinguistic Properties for Brazilian Portuguese

Figure 2 for A Lightweight Regression Method to Infer Psycholinguistic Properties for Brazilian Portuguese

Figure 3 for A Lightweight Regression Method to Infer Psycholinguistic Properties for Brazilian Portuguese

Figure 4 for A Lightweight Regression Method to Infer Psycholinguistic Properties for Brazilian Portuguese

Abstract:Psycholinguistic properties of words have been used in various approaches to Natural Language Processing tasks, such as text simplification and readability assessment. Most of these properties are subjective, involving costly and time-consuming surveys to be gathered. Recent approaches use the limited datasets of psycholinguistic properties to extend them automatically to large lexicons. However, some of the resources used by such approaches are not available to most languages. This study presents a method to infer psycholinguistic properties for Brazilian Portuguese (BP) using regressors built with a light set of features usually available for less resourced languages: word length, frequency lists, lexical databases composed of school dictionaries and word embedding models. The correlations between the properties inferred are close to those obtained by related works. The resulting resource contains 26,874 words in BP annotated with concreteness, age of acquisition, imageability and subjective frequency.

* Paper accepted for TSD2017

Via

Access Paper or Ask Questions

Complex networks analysis of language complexity

Feb 19, 2013

Diego R. Amancio, Sandra M. Aluisio, Osvaldo N. Oliveira Jr., Luciano da F. Costa

Figure 1 for Complex networks analysis of language complexity

Figure 2 for Complex networks analysis of language complexity

Figure 3 for Complex networks analysis of language complexity

Figure 4 for Complex networks analysis of language complexity

Abstract:Methods from statistical physics, such as those involving complex networks, have been increasingly used in quantitative analysis of linguistic phenomena. In this paper, we represented pieces of text with different levels of simplification in co-occurrence networks and found that topological regularity correlated negatively with textual complexity. Furthermore, in less complex texts the distance between concepts, represented as nodes, tended to decrease. The complex networks metrics were treated with multivariate pattern recognition techniques, which allowed us to distinguish between original texts and their simplified versions. For each original text, two simplified versions were generated manually with increasing number of simplification operations. As expected, distinction was easier for the strongly simplified versions, where the most relevant metrics were node strength, shortest paths and diversity. Also, the discrimination of complex texts was improved with higher hierarchical network metrics, thus pointing to the usefulness of considering wider contexts around the concepts. Though the accuracy rate in the distinction was not as high as in methods using deep linguistic knowledge, the complex network approach is still useful for a rapid screening of texts whenever assessing complexity is essential to guarantee accessibility to readers with limited reading ability

* Europhysics Letters (2012) 100 58002
* The Supplementary Information (SI) is available from https://dl.dropbox.com/u/2740286/supplementary.pdf

Via

Access Paper or Ask Questions