Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Iñaki San Vicente

Give your Text Representation Models some Love: the Case for Basque

Apr 02, 2020

Rodrigo Agerri, Iñaki San Vicente, Jon Ander Campos, Ander Barrena, Xabier Saralegi, Aitor Soroa, Eneko Agirre

Figure 1 for Give your Text Representation Models some Love: the Case for Basque

Figure 2 for Give your Text Representation Models some Love: the Case for Basque

Figure 3 for Give your Text Representation Models some Love: the Case for Basque

Figure 4 for Give your Text Representation Models some Love: the Case for Basque

Abstract:Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora. In addition, monolingual pre-trained models for non-English languages are not always available. At best, models for those languages are included in multilingual versions, where each language shares the quota of substrings and parameters with the rest of the languages. This is particularly true for smaller languages such as Basque. In this paper we show that a number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER. This work sets a new state-of-the-art in those tasks for Basque. All benchmarks and models used in this work are publicly available.

* Accepted at LREC 2020; 8 pages, 7 tables

Via

Access Paper or Ask Questions

Talaia: a Real time Monitor of Social Media and Digital Press

Sep 28, 2018

Iñaki San Vicente, Xabier Saralegi, Rodrigo Agerri

Figure 1 for Talaia: a Real time Monitor of Social Media and Digital Press

Figure 2 for Talaia: a Real time Monitor of Social Media and Digital Press

Figure 3 for Talaia: a Real time Monitor of Social Media and Digital Press

Figure 4 for Talaia: a Real time Monitor of Social Media and Digital Press

Abstract:Talaia is a platform for monitoring social media and digital press. A configurable crawler gathers content with respect to user defined domains or topics. Crawled data is processed by means of IXA-pipes NLP chain and EliXa sentiment analysis system. A Django powered interface provides data visualization to provide the user analysis of the data. This paper presents the architecture of the system and describes in detail the different components of the system. To prove the validity of the approach, two real use cases are accounted for, one in the cultural domain and one in the political domain. Evaluation for the sentiment analysis task in both scenarios is also provided, showing the capacity for domain adaptation.

* Preprint draft, 21 pages

Via

Access Paper or Ask Questions

EliXa: A Modular and Flexible ABSA Platform

Feb 07, 2017

Iñaki San Vicente, Xabier Saralegi, Rodrigo Agerri

Figure 1 for EliXa: A Modular and Flexible ABSA Platform

Figure 2 for EliXa: A Modular and Flexible ABSA Platform

Figure 3 for EliXa: A Modular and Flexible ABSA Platform

Figure 4 for EliXa: A Modular and Flexible ABSA Platform

Abstract:This paper presents a supervised Aspect Based Sentiment Analysis (ABSA) system. Our aim is to develop a modular platform which allows to easily conduct experiments by replacing the modules or adding new features. We obtain the best result in the Opinion Target Extraction (OTE) task (slot 2) using an off-the-shelf sequence labeler. The target polarity classification (slot 3) is addressed by means of a multiclass SVM algorithm which includes lexical based features such as the polarity values obtained from domain and open polarity lexicons. The system obtains accuracies of 0.70 and 0.73 for the restaurant and laptop domain respectively, and performs second best in the out-of-domain hotel, achieving an accuracy of 0.80.

* Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Association for Computational Linguistics, June 2015, Denver, Colorado, pp.748-752
* 5 pages, conference

Via

Access Paper or Ask Questions

Q-WordNet PPV: Simple, Robust and Unsupervised Generation of Polarity Lexicons for Multiple Languages

Feb 06, 2017

Iñaki San Vicente, Rodrigo Agerri, German Rigau

Figure 1 for Q-WordNet PPV: Simple, Robust and Unsupervised Generation of Polarity Lexicons for Multiple Languages

Figure 2 for Q-WordNet PPV: Simple, Robust and Unsupervised Generation of Polarity Lexicons for Multiple Languages

Figure 3 for Q-WordNet PPV: Simple, Robust and Unsupervised Generation of Polarity Lexicons for Multiple Languages

Figure 4 for Q-WordNet PPV: Simple, Robust and Unsupervised Generation of Polarity Lexicons for Multiple Languages

Abstract:This paper presents a simple, robust and (almost) unsupervised dictionary-based method, qwn-ppv (Q-WordNet as Personalized PageRanking Vector) to automatically generate polarity lexicons. We show that qwn-ppv outperforms other automatically generated lexicons for the four extrinsic evaluations presented here. It also shows very competitive and robust results with respect to manually annotated ones. Results suggest that no single lexicon is best for every task and dataset and that the intrinsic evaluation of polarity lexicons is not a good performance indicator on a Sentiment Analysis task. The qwn-ppv method allows to easily create quality polarity lexicons whenever no domain-based annotated corpora are available for a given language.

* Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2014), pages 88-97, Gothenburg, Sweden, April 26-30 2014
* 8 pages plus 2 pages of references

Via

Access Paper or Ask Questions