Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Abhay Shanbhag

MahaParaphrase: A Marathi Paraphrase Detection Corpus and BERT-based Models

Aug 24, 2025

Suramya Jadhav, Abhay Shanbhag, Amogh Thakurdesai, Ridhima Sinare, Ananya Joshi, Raviraj Joshi

Figure 1 for MahaParaphrase: A Marathi Paraphrase Detection Corpus and BERT-based Models

Figure 2 for MahaParaphrase: A Marathi Paraphrase Detection Corpus and BERT-based Models

Figure 3 for MahaParaphrase: A Marathi Paraphrase Detection Corpus and BERT-based Models

Figure 4 for MahaParaphrase: A Marathi Paraphrase Detection Corpus and BERT-based Models

Abstract:Paraphrases are a vital tool to assist language understanding tasks such as question answering, style transfer, semantic parsing, and data augmentation tasks. Indic languages are complex in natural language processing (NLP) due to their rich morphological and syntactic variations, diverse scripts, and limited availability of annotated data. In this work, we present the L3Cube-MahaParaphrase Dataset, a high-quality paraphrase corpus for Marathi, a low resource Indic language, consisting of 8,000 sentence pairs, each annotated by human experts as either Paraphrase (P) or Non-paraphrase (NP). We also present the results of standard transformer-based BERT models on these datasets. The dataset and model are publicly shared at https://github.com/l3cube-pune/MarathiNLP

Via

Access Paper or Ask Questions

On Limitations of LLM as Annotator for Low Resource Languages

Nov 26, 2024

Suramya Jadhav, Abhay Shanbhag, Amogh Thakurdesai, Ridhima Sinare, Raviraj Joshi

Figure 1 for On Limitations of LLM as Annotator for Low Resource Languages

Figure 2 for On Limitations of LLM as Annotator for Low Resource Languages

Figure 3 for On Limitations of LLM as Annotator for Low Resource Languages

Figure 4 for On Limitations of LLM as Annotator for Low Resource Languages

Abstract:Low-resource languages face significant challenges due to the lack of sufficient linguistic data, resources, and tools for tasks such as supervised learning, annotation, and classification. This shortage hinders the development of accurate models and datasets, making it difficult to perform critical NLP tasks like sentiment analysis or hate speech detection. To bridge this gap, Large Language Models (LLMs) present an opportunity for potential annotators, capable of generating datasets and resources for these underrepresented languages. In this paper, we focus on Marathi, a low-resource language, and evaluate the performance of both closed-source and open-source LLMs as annotators. We assess models such as GPT-4o and Gemini 1.0 Pro, Gemma 2 (2B and 9B), and Llama 3.1 (8B) on classification tasks including sentiment analysis, news classification, and hate speech detection. Our findings reveal that while LLMs excel in annotation tasks for high-resource languages like English, they still fall short when applied to Marathi. Even advanced closed models like Gemini and GPT underperform in comparison to BERT-based baselines, highlighting the limitations of LLMs as annotators for low-resource languages.

Via

Access Paper or Ask Questions

BERT or FastText? A Comparative Analysis of Contextual as well as Non-Contextual Embeddings

Nov 26, 2024

Abhay Shanbhag, Suramya Jadhav, Amogh Thakurdesai, Ridhima Sinare, Raviraj Joshi

Figure 1 for BERT or FastText? A Comparative Analysis of Contextual as well as Non-Contextual Embeddings

Figure 2 for BERT or FastText? A Comparative Analysis of Contextual as well as Non-Contextual Embeddings

Figure 3 for BERT or FastText? A Comparative Analysis of Contextual as well as Non-Contextual Embeddings

Figure 4 for BERT or FastText? A Comparative Analysis of Contextual as well as Non-Contextual Embeddings

Abstract:Natural Language Processing (NLP) for low-resource languages presents significant challenges, particularly due to the scarcity of high-quality annotated data and linguistic resources. The choice of embeddings plays a critical role in enhancing the performance of NLP tasks, such as news classification, sentiment analysis, and hate speech detection, especially for low-resource languages like Marathi. In this study, we investigate the impact of various embedding techniques- Contextual BERT-based, Non-Contextual BERT-based, and FastText-based on NLP classification tasks specific to the Marathi language. Our research includes a thorough evaluation of both compressed and uncompressed embeddings, providing a comprehensive overview of how these embeddings perform across different scenarios. Specifically, we compare two BERT model embeddings, Muril and MahaBERT, as well as two FastText model embeddings, IndicFT and MahaFT. Our evaluation includes applying embeddings to a Multiple Logistic Regression (MLR) classifier for task performance assessment, as well as TSNE visualizations to observe the spatial distribution of these embeddings. The results demonstrate that contextual embeddings outperform non-contextual embeddings. Furthermore, BERT-based non-contextual embeddings extracted from the first BERT embedding layer yield better results than FastText-based embeddings, suggesting a potential alternative to FastText embeddings.

Via

Access Paper or Ask Questions