Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lokesh Madasu

TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu

Apr 17, 2024

Gopichand Kanumolu, Lokesh Madasu, Nirmal Surange, Manish Shrivastava

Figure 1 for TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu

Figure 2 for TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu

Figure 3 for TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu

Figure 4 for TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu

Abstract:News headline generation is a crucial task in increasing productivity for both the readers and producers of news. This task can easily be aided by automated News headline-generation models. However, the presence of irrelevant headlines in scraped news articles results in sub-optimal performance of generation models. We propose that relevance-based headline classification can greatly aid the task of generating relevant headlines. Relevance-based headline classification involves categorizing news headlines based on their relevance to the corresponding news articles. While this task is well-established in English, it remains under-explored in low-resource languages like Telugu due to a lack of annotated data. To address this gap, we present TeClass, the first-ever human-annotated Telugu news headline classification dataset, containing 78,534 annotations across 26,178 article-headline pairs. We experiment with various baseline models and provide a comprehensive analysis of their results. We further demonstrate the impact of this work by fine-tuning various headline generation models using TeClass dataset. The headlines generated by the models fine-tuned on highly relevant article-headline pairs, showed about a 5 point increment in the ROUGE-L scores. To encourage future research, the annotated dataset as well as the annotation guidelines will be made publicly available.

* Accepted at LREC-COLING 2024

Via

Access Paper or Ask Questions

SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 14 Languages

Feb 15, 2024

Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir Araujo, Abinew Ali Ayele, Pavan Baswani(+17 more)

Figure 1 for SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 14 Languages

Figure 2 for SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 14 Languages

Figure 3 for SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 14 Languages

Figure 4 for SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 14 Languages

Abstract:Exploring and quantifying semantic relatedness is central to representing language. It holds significant implications across various NLP tasks, including offering insights into the capabilities and performance of Large Language Models (LLMs). While earlier NLP research primarily focused on semantic similarity, often within the English language context, we instead investigate the broader phenomenon of semantic relatedness. In this paper, we present SemRel, a new semantic relatedness dataset collection annotated by native speakers across 14 languages:Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Punjabi, Spanish, and Telugu. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia -- regions characterised by a relatively limited availability of NLP resources. Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. The scores are obtained using a comparative annotation framework. We describe the data collection and annotation processes, related challenges when building the datasets, and their impact and utility in NLP. We further report experiments for each language and across the different languages.

* 18 pages

Via

Access Paper or Ask Questions

Unsupervised Approach to Evaluate Sentence-Level Fluency: Do We Really Need Reference?

Dec 03, 2023

Gopichand Kanumolu, Lokesh Madasu, Pavan Baswani, Ananya Mukherjee, Manish Shrivastava

Figure 1 for Unsupervised Approach to Evaluate Sentence-Level Fluency: Do We Really Need Reference?

Figure 2 for Unsupervised Approach to Evaluate Sentence-Level Fluency: Do We Really Need Reference?

Figure 3 for Unsupervised Approach to Evaluate Sentence-Level Fluency: Do We Really Need Reference?

Figure 4 for Unsupervised Approach to Evaluate Sentence-Level Fluency: Do We Really Need Reference?

Abstract:Fluency is a crucial goal of all Natural Language Generation (NLG) systems. Widely used automatic evaluation metrics fall short in capturing the fluency of machine-generated text. Assessing the fluency of NLG systems poses a challenge since these models are not limited to simply reusing words from the input but may also generate abstractions. Existing reference-based fluency evaluations, such as word overlap measures, often exhibit weak correlations with human judgments. This paper adapts an existing unsupervised technique for measuring text fluency without the need for any reference. Our approach leverages various word embeddings and trains language models using Recurrent Neural Network (RNN) architectures. We also experiment with other available multilingual Language Models (LMs). To assess the performance of the models, we conduct a comparative analysis across 10 Indic languages, correlating the obtained fluency scores with human judgments. Our code and human-annotated benchmark test-set for fluency is available at https://github.com/AnanyaCoder/TextFluencyForIndicLanaguges.

* Accepted at IJCNLP-AACL SEALP Workshop

Via

Access Paper or Ask Questions

Mukhyansh: A Headline Generation Dataset for Indic Languages

Nov 29, 2023

Lokesh Madasu, Gopichand Kanumolu, Nirmal Surange, Manish Shrivastava

Figure 1 for Mukhyansh: A Headline Generation Dataset for Indic Languages

Figure 2 for Mukhyansh: A Headline Generation Dataset for Indic Languages

Figure 3 for Mukhyansh: A Headline Generation Dataset for Indic Languages

Figure 4 for Mukhyansh: A Headline Generation Dataset for Indic Languages

Abstract:The task of headline generation within the realm of Natural Language Processing (NLP) holds immense significance, as it strives to distill the true essence of textual content into concise and attention-grabbing summaries. While noteworthy progress has been made in headline generation for widely spoken languages like English, there persist numerous challenges when it comes to generating headlines in low-resource languages, such as the rich and diverse Indian languages. A prominent obstacle that specifically hinders headline generation in Indian languages is the scarcity of high-quality annotated data. To address this crucial gap, we proudly present Mukhyansh, an extensive multilingual dataset, tailored for Indian language headline generation. Comprising an impressive collection of over 3.39 million article-headline pairs, Mukhyansh spans across eight prominent Indian languages, namely Telugu, Tamil, Kannada, Malayalam, Hindi, Bengali, Marathi, and Gujarati. We present a comprehensive evaluation of several state-of-the-art baseline models. Additionally, through an empirical analysis of existing works, we demonstrate that Mukhyansh outperforms all other models, achieving an impressive average ROUGE-L score of 31.43 across all 8 languages.

* Accepted at PACLIC 2023

Via

Access Paper or Ask Questions