Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Viviana Cotik

Indigenous Languages Spoken in Argentina: A Survey of NLP and Speech Resources

Jan 17, 2025

Belu Ticona, Fernando Carranza, Viviana Cotik

Figure 1 for Indigenous Languages Spoken in Argentina: A Survey of NLP and Speech Resources

Figure 2 for Indigenous Languages Spoken in Argentina: A Survey of NLP and Speech Resources

Figure 3 for Indigenous Languages Spoken in Argentina: A Survey of NLP and Speech Resources

Abstract:Argentina has a diverse, yet little-known, Indigenous language heritage. Most of these languages are at risk of disappearing, resulting in a significant loss of world heritage and cultural knowledge. Currently, no unified information on speakers and computational tools is available for these languages. In this work, we present a systematization of the Indigenous languages spoken in Argentina, along with national demographic data on the country's Indigenous population. The languages are classified into seven families: Mapuche, Tup\'i-Guaran\'i, Guaycur\'u, Quechua, Mataco-Mataguaya, Aymara, and Chon. We also provide an introductory survey of the computational resources available for these languages, whether or not they are specifically developed for Argentine varieties.

* Accepted to COLING Main 2025

Via

Access Paper or Ask Questions

Exploring Large Language Models for Hate Speech Detection in Rioplatense Spanish

Oct 16, 2024

Juan Manuel Pérez, Paula Miguel, Viviana Cotik

Abstract:Hate speech detection deals with many language variants, slang, slurs, expression modalities, and cultural nuances. This outlines the importance of working with specific corpora, when addressing hate speech within the scope of Natural Language Processing, recently revolutionized by the irruption of Large Language Models. This work presents a brief analysis of the performance of large language models in the detection of Hate Speech for Rioplatense Spanish. We performed classification experiments leveraging chain-of-thought reasoning with ChatGPT 3.5, Mixtral, and Aya, comparing their results with those of a state-of-the-art BERT classifier. These experiments outline that, even if large language models show a lower precision compared to the fine-tuned BERT classifier and, in some cases, they find hard-to-get slurs or colloquialisms, they still are sensitive to highly nuanced cases (particularly, homophobic/transphobic hate speech). We make our code and models publicly available for future research.

Via

Access Paper or Ask Questions

MessIRve: A Large-Scale Spanish Information Retrieval Dataset

Sep 09, 2024

Francisco Valentini, Viviana Cotik, Damián Furman, Ivan Bercovich, Edgar Altszyler, Juan Manuel Pérez

Figure 1 for MessIRve: A Large-Scale Spanish Information Retrieval Dataset

Figure 2 for MessIRve: A Large-Scale Spanish Information Retrieval Dataset

Figure 3 for MessIRve: A Large-Scale Spanish Information Retrieval Dataset

Figure 4 for MessIRve: A Large-Scale Spanish Information Retrieval Dataset

Abstract:Information retrieval (IR) is the task of finding relevant documents in response to a user query. Although Spanish is the second most spoken native language, current IR benchmarks lack Spanish data, hindering the development of information access tools for Spanish speakers. We introduce MessIRve, a large-scale Spanish IR dataset with around 730 thousand queries from Google's autocomplete API and relevant documents sourced from Wikipedia. MessIRve's queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets. We provide a comprehensive description of the dataset, comparisons with existing datasets, and baseline evaluations of prominent IR models. Our contributions aim to advance Spanish IR research and improve information access for Spanish speakers.

Via

Access Paper or Ask Questions

Assessing the impact of contextual information in hate speech detection

Oct 05, 2022

Juan Manuel Pérez, Franco Luque, Demian Zayat, Martín Kondratzky, Agustín Moro, Pablo Serrati, Joaquín Zajac, Paula Miguel, Natalia Debandi, Agustín Gravano(+1 more)

Figure 1 for Assessing the impact of contextual information in hate speech detection

Figure 2 for Assessing the impact of contextual information in hate speech detection

Figure 3 for Assessing the impact of contextual information in hate speech detection

Figure 4 for Assessing the impact of contextual information in hate speech detection

Abstract:In recent years, hate speech has gained great relevance in social networks and other virtual media because of its intensity and its relationship with violent acts against members of protected groups. Due to the great amount of content generated by users, great effort has been made in the research and development of automatic tools to aid the analysis and moderation of this speech, at least in its most threatening forms. One of the limitations of current approaches to automatic hate speech detection is the lack of context. Most studies and resources are performed on data without context; that is, isolated messages without any type of conversational context or the topic being discussed. This restricts the available information to define if a post on a social network is hateful or not. In this work, we provide a novel corpus for contextualized hate speech detection based on user responses to news posts from media outlets on Twitter. This corpus was collected in the Rioplatense dialectal variety of Spanish and focuses on hate speech associated with the COVID-19 pandemic. Classification experiments using state-of-the-art techniques show evidence that adding contextual information improves hate speech detection performance for two proposed tasks (binary and multi-label prediction). We make our code, models, and corpus available for further research.

Via

Access Paper or Ask Questions

Creation of an Annotated Corpus of Spanish Radiology Reports

Oct 30, 2017

Viviana Cotik, Darío Filippo, Roland Roller, Hans Uszkoreit, Feiyu Xu

Figure 1 for Creation of an Annotated Corpus of Spanish Radiology Reports

Figure 2 for Creation of an Annotated Corpus of Spanish Radiology Reports

Figure 3 for Creation of an Annotated Corpus of Spanish Radiology Reports

Figure 4 for Creation of an Annotated Corpus of Spanish Radiology Reports

Abstract:This paper presents a new annotated corpus of 513 anonymized radiology reports written in Spanish. Reports were manually annotated with entities, negation and uncertainty terms and relations. The corpus was conceived as an evaluation resource for named entity recognition and relation extraction algorithms, and as input for the use of supervised methods. Biomedical annotated resources are scarce due to confidentiality issues and associated costs. This work provides some guidelines that could help other researchers to undertake similar tasks.

* WiNLP Workshop ACL

Via

Access Paper or Ask Questions