Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sowmya Vajjala

Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?

May 22, 2025

Gaurav Kamath, Sowmya Vajjala

Abstract:Named Entity Recognition(NER) for low-resource languages aims to produce robust systems for languages where there is limited labeled training data available, and has been an area of increasing interest within NLP. Data augmentation for increasing the amount of low-resource labeled data is a common practice. In this paper, we explore the role of synthetic data in the context of multilingual, low-resource NER, considering 11 languages from diverse language families. Our results suggest that synthetic data does in fact hold promise for low-resource language NER, though we see significant variation between languages.

* pre-print

Via

Access Paper or Ask Questions

Text Classification in the LLM Era - Where do we stand?

Feb 17, 2025

Sowmya Vajjala, Shwetali Shimangaud

Abstract:Large Language Models revolutionized NLP and showed dramatic performance improvements across several tasks. In this paper, we investigated the role of such language models in text classification and how they compare with other approaches relying on smaller pre-trained language models. Considering 32 datasets spanning 8 languages, we compared zero-shot classification, few-shot fine-tuning and synthetic data based classifiers with classifiers built using the complete human labeled dataset. Our results show that zero-shot approaches do well for sentiment classification, but are outperformed by other approaches for the rest of the tasks, and synthetic data sourced from multiple LLMs can build better classifiers than zero-shot open LLMs. We also see wide performance disparities across languages in all the classification scenarios. We expect that these findings would guide practitioners working on developing text classification systems across languages.

* Pre-print

Via

Access Paper or Ask Questions

LLMs in Education: Novel Perspectives, Challenges, and Opportunities

Sep 18, 2024

Bashar Alhafni, Sowmya Vajjala, Stefano Bannò, Kaushal Kumar Maurya, Ekaterina Kochmar

Abstract:The role of large language models (LLMs) in education is an increasing area of interest today, considering the new opportunities they offer for teaching, learning, and assessment. This cutting-edge tutorial provides an overview of the educational applications of NLP and the impact that the recent advances in LLMs have had on this field. We will discuss the key challenges and opportunities presented by LLMs, grounding them in the context of four major educational applications: reading, writing, and speaking skills, and intelligent tutoring systems (ITS). This COLING 2025 tutorial is designed for researchers and practitioners interested in the educational applications of NLP and the role LLMs have to play in this area. It is the first of its kind to address this timely topic.

* COLING 2025 Tutorial

Via

Access Paper or Ask Questions

Annotation Errors and NER: A Study with OntoNotes 5.0

Jun 27, 2024

Gabriel Bernier-Colborne, Sowmya Vajjala

Abstract:Named Entity Recognition (NER) is a well-studied problem in NLP. However, there is much less focus on studying NER datasets, compared to developing new NER models. In this paper, we employed three simple techniques to detect annotation errors in the OntoNotes 5.0 corpus for English NER, which is the largest available NER corpus for English. Our techniques corrected ~10% of the sentences in train/dev/test data. In terms of entity mentions, we corrected the span and/or type of ~8% of mentions in the dataset, while adding/deleting/splitting/merging a few more. These are large numbers of changes, considering the size of OntoNotes. We used three NER libraries to train, evaluate and compare the models trained with the original and the re-annotated datasets, which showed an average improvement of 1.23% in overall F-scores, with large (>10%) improvements for some of the entity types. While our annotation error detection methods are not exhaustive and there is some manual annotation effort involved, they are largely language agnostic and can be employed with other NER datasets, and other sequence labelling tasks.

* Unpublished report. Originally submitted to LREC 2022

Via

Access Paper or Ask Questions

Dravidian language family through Universal Dependencies lens

Jun 20, 2024

Taraka Rama, Sowmya Vajjala

Figure 1 for Dravidian language family through Universal Dependencies lens

Figure 2 for Dravidian language family through Universal Dependencies lens

Figure 3 for Dravidian language family through Universal Dependencies lens

Figure 4 for Dravidian language family through Universal Dependencies lens

Abstract:The Universal Dependencies (UD) project aims to create a cross-linguistically consistent dependency annotation for multiple languages, to facilitate multilingual NLP. It currently supports 114 languages. Dravidian languages are spoken by over 200 million people across the word, and yet there are only two languages from this family in UD. This paper examines some of the morphological and syntactic features of Dravidian languages and explores how they can be annotated in the UD framework.

* unpublished report from 2021

Via

Access Paper or Ask Questions

Scope Ambiguities in Large Language Models

Apr 05, 2024

Gaurav Kamath, Sebastian Schuster, Sowmya Vajjala, Siva Reddy

Abstract:Sentences containing multiple semantic operators with overlapping scope often create ambiguities in interpretation, known as scope ambiguities. These ambiguities offer rich insights into the interaction between semantic structure and world knowledge in language processing. Despite this, there has been little research into how modern large language models treat them. In this paper, we investigate how different versions of certain autoregressive language models -- GPT-2, GPT-3/3.5, Llama 2 and GPT-4 -- treat scope ambiguous sentences, and compare this with human judgments. We introduce novel datasets that contain a joint total of almost 1,000 unique scope-ambiguous sentences, containing interactions between a range of semantic operators, and annotated for human judgments. Using these datasets, we find evidence that several models (i) are sensitive to the meaning ambiguity in these sentences, in a way that patterns well with human judgments, and (ii) can successfully identify human-preferred readings at a high level of accuracy (over 90% in some cases).

* To be published in Transactions of the Association for Computational Linguistics

Via

Access Paper or Ask Questions

A Multilingual Evaluation of NER Robustness to Adversarial Inputs

May 30, 2023

Akshay Srinivasan, Sowmya Vajjala

Abstract:Adversarial evaluations of language models typically focus on English alone. In this paper, we performed a multilingual evaluation of Named Entity Recognition (NER) in terms of its robustness to small perturbations in the input. Our results showed the NER models we explored across three languages (English, German and Hindi) are not very robust to such changes, as indicated by the fluctuations in the overall F1 score as well as in a more fine-grained evaluation. With that knowledge, we further explored whether it is possible to improve the existing NER models using a part of the generated adversarial data sets as augmented training data to train a new NER model or as fine-tuning data to adapt an existing NER model. Our results showed that both these approaches improve performance on the original as well as adversarial test sets. While there is no significant difference between the two approaches for English, re-training is significantly better than fine-tuning for German and Hindi.

* Paper accepted at Repl4NLP workshop, ACL 2023

Via

Access Paper or Ask Questions

Automatic Text Simplification of News Articles in the Context of Public Broadcasting

Dec 26, 2022

Diego Maupomé, Fanny Rancourt, Thomas Soulas, Alexandre Lachance, Marie-Jean Meurs, Desislava Aleksandrova, Olivier Brochu Dufour, Igor Pontes, Rémi Cardon, Michel Simard(+1 more)

Abstract:This report summarizes the work carried out by the authors during the Twelfth Montreal Industrial Problem Solving Workshop, held at Universit\'e de Montr\'eal in August 2022. The team tackled a problem submitted by CBC/Radio-Canada on the theme of Automatic Text Simplification (ATS).

Via

Access Paper or Ask Questions

What do we Really Know about State of the Art NER?

May 04, 2022

Sowmya Vajjala, Ramya Balasubramaniam

Figure 1 for What do we Really Know about State of the Art NER?

Figure 2 for What do we Really Know about State of the Art NER?

Figure 3 for What do we Really Know about State of the Art NER?

Figure 4 for What do we Really Know about State of the Art NER?

Abstract:Named Entity Recognition (NER) is a well researched NLP task and is widely used in real world NLP scenarios. NER research typically focuses on the creation of new ways of training NER, with relatively less emphasis on resources and evaluation. Further, state of the art (SOTA) NER models, trained on standard datasets, typically report only a single performance measure (F-score) and we don't really know how well they do for different entity types and genres of text, or how robust are they to new, unseen entities. In this paper, we perform a broad evaluation of NER using a popular dataset, that takes into consideration various text genres and sources constituting the dataset at hand. Additionally, we generate six new adversarial test sets through small perturbations in the original test set, replacing select entities while retaining the context. We also train and test our models on randomly generated train/dev/test splits followed by an experiment where the models are trained on a select set of genres but tested genres not seen in training. These comprehensive evaluation strategies were performed using three SOTA NER models. Based on our results, we recommend some useful reporting practices for NER researchers, that could help in providing a better understanding of a SOTA model's performance in future.

* LREC 2022

Via

Access Paper or Ask Questions

A Neural Pairwise Ranking Model for Readability Assessment

Mar 14, 2022

Justin Lee, Sowmya Vajjala

Figure 1 for A Neural Pairwise Ranking Model for Readability Assessment

Figure 2 for A Neural Pairwise Ranking Model for Readability Assessment

Figure 3 for A Neural Pairwise Ranking Model for Readability Assessment

Figure 4 for A Neural Pairwise Ranking Model for Readability Assessment

Abstract:Automatic Readability Assessment (ARA), the task of assigning a reading level to a text, is traditionally treated as a classification problem in NLP research. In this paper, we propose the first neural, pairwise ranking approach to ARA and compare it with existing classification, regression, and (non-neural) ranking methods. We establish the performance of our model by conducting experiments with three English, one French and one Spanish datasets. We demonstrate that our approach performs well in monolingual single/cross corpus testing scenarios and achieves a zero-shot cross-lingual ranking accuracy of over 80% for both French and Spanish when trained on English data. Additionally, we also release a new parallel bilingual readability dataset in English and French. To our knowledge, this paper proposes the first neural pairwise ranking model for ARA, and shows the first results of cross-lingual, zero-shot evaluation of ARA with neural models.

* to appear in Findings of ACL 2022

Via

Access Paper or Ask Questions