Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Pushpak Bhattacharyya

Stereotype Detection as a Catalyst for Enhanced Bias Detection: A Multi-Task Learning Approach

Jul 02, 2025

Aditya Tomar, Rudra Murthy, Pushpak Bhattacharyya

Abstract:Bias and stereotypes in language models can cause harm, especially in sensitive areas like content moderation and decision-making. This paper addresses bias and stereotype detection by exploring how jointly learning these tasks enhances model performance. We introduce StereoBias, a unique dataset labeled for bias and stereotype detection across five categories: religion, gender, socio-economic status, race, profession, and others, enabling a deeper study of their relationship. Our experiments compare encoder-only models and fine-tuned decoder-only models using QLoRA. While encoder-only models perform well, decoder-only models also show competitive results. Crucially, joint training on bias and stereotype detection significantly improves bias detection compared to training them separately. Additional experiments with sentiment analysis confirm that the improvements stem from the connection between bias and stereotypes, not multi-task learning alone. These findings highlight the value of leveraging stereotype information to build fairer and more effective AI systems.

Via

Access Paper or Ask Questions

Mathematics Isn't Culture-Free: Probing Cultural Gaps via Entity and Scenario Perturbations

Jul 01, 2025

Aditya Tomar, Nihar Ranjan Sahoo, Ashish Mittal, Rudra Murthy, Pushpak Bhattacharyya

Abstract:Although mathematics is often considered culturally neutral, the way mathematical problems are presented can carry implicit cultural context. Existing benchmarks like GSM8K are predominantly rooted in Western norms, including names, currencies, and everyday scenarios. In this work, we create culturally adapted variants of the GSM8K test set for five regions Africa, India, China, Korea, and Japan using prompt-based transformations followed by manual verification. We evaluate six large language models (LLMs), ranging from 8B to 72B parameters, across five prompting strategies to assess their robustness to cultural variation in math problem presentation. Our findings reveal a consistent performance gap: models perform best on the original US-centric dataset and comparatively worse on culturally adapted versions. However, models with reasoning capabilities are more resilient to these shifts, suggesting that deeper reasoning helps bridge cultural presentation gaps in mathematical tasks

Via

Access Paper or Ask Questions

Understand the Implication: Learning to Think for Pragmatic Understanding

Jun 16, 2025

Settaluri Lakshmi Sravanthi, Kishan Maharaj, Sravani Gunnu, Abhijit Mishra, Pushpak Bhattacharyya

Abstract:Pragmatics, the ability to infer meaning beyond literal interpretation, is crucial for social cognition and communication. While LLMs have been benchmarked for their pragmatic understanding, improving their performance remains underexplored. Existing methods rely on annotated labels but overlook the reasoning process humans naturally use to interpret implicit meaning. To bridge this gap, we introduce a novel pragmatic dataset, ImpliedMeaningPreference, that includes explicit reasoning (thoughts) for both correct and incorrect interpretations. Through preference-tuning and supervised fine-tuning, we demonstrate that thought-based learning significantly enhances LLMs' pragmatic understanding, improving accuracy by 11.12% across model families. We further discuss a transfer-learning study where we evaluate the performance of thought-based training for the other tasks of pragmatics (presupposition, deixis) that are not seen during the training time and observe an improvement of 16.10% compared to label-trained models.

* SS and KM contributed equally to this work

Via

Access Paper or Ask Questions

Detecting Stereotypes and Anti-stereotypes the Correct Way Using Social Psychological Underpinnings

Apr 04, 2025

Kaustubh Shivshankar Shejole, Pushpak Bhattacharyya

Abstract:Stereotypes are known to be highly pernicious, making their detection critically important. However, current research predominantly focuses on detecting and evaluating stereotypical biases in LLMs, leaving the study of stereotypes in its early stages. Many studies have failed to clearly distinguish between stereotypes and stereotypical biases, which has significantly slowed progress in advancing research in this area. Stereotype and anti-stereotype detection is a problem that requires knowledge of society; hence, it is one of the most difficult areas in Responsible AI. This work investigates this task, where we propose a four-tuple definition and provide precise terminology distinguishing stereotype, anti-stereotype, stereotypical bias, and bias, offering valuable insights into their various aspects. In this paper, we propose StereoDetect, a high-quality benchmarking dataset curated for this task by optimally utilizing current datasets such as StereoSet and WinoQueer, involving a manual verification process and the transfer of semantic information. We demonstrate that language models for reasoning with fewer than 10B parameters often get confused when detecting anti-stereotypes. We also demonstrate the critical importance of well-curated datasets by comparing our model with other current models for stereotype detection. The dataset and code is available at https://github.com/KaustubhShejole/StereoDetect.

Via

Access Paper or Ask Questions

Main Predicate and Their Arguments as Explanation Signals For Intent Classification

Feb 03, 2025

Sameer Pimparkhede, Pushpak Bhattacharyya

Figure 1 for Main Predicate and Their Arguments as Explanation Signals For Intent Classification

Figure 2 for Main Predicate and Their Arguments as Explanation Signals For Intent Classification

Figure 3 for Main Predicate and Their Arguments as Explanation Signals For Intent Classification

Figure 4 for Main Predicate and Their Arguments as Explanation Signals For Intent Classification

Abstract:Intent classification is crucial for conversational agents (chatbots), and deep learning models perform well in this area. However, little research has been done on the explainability of intent classification due to the absence of suitable benchmark data. Human annotation of explanation signals in text samples is time-consuming and costly. However, from inspection of data on intent classification, we see that, more often than not, the main verb denotes the action, and the direct object indicates the domain of conversation, serving as explanation signals for intent. This observation enables us to hypothesize that the main predicate in the text utterances, along with the arguments of the main predicate, can serve as explanation signals. Leveraging this, we introduce a new technique to automatically augment text samples from intent classification datasets with word-level explanations. We mark main predicates (primarily verbs) and their arguments (dependency relations) as explanation signals in benchmark intent classification datasets ATIS and SNIPS, creating a unique 21k-instance dataset for explainability. Further, we experiment with deep learning and language models. We observe that models that work well for classification do not perform well in explainability metrics like plausibility and faithfulness. We also observe that guiding models to focus on explanation signals from our dataset during training improves the plausibility Token F1 score by 3-4%, improving the model's reasoning.

Via

Access Paper or Ask Questions

Giving the Old a Fresh Spin: Quality Estimation-Assisted Constrained Decoding for Automatic Post-Editing

Jan 28, 2025

Sourabh Deoghare, Diptesh Kanojia, Pushpak Bhattacharyya

Abstract:Automatic Post-Editing (APE) systems often struggle with over-correction, where unnecessary modifications are made to a translation, diverging from the principle of minimal editing. In this paper, we propose a novel technique to mitigate over-correction by incorporating word-level Quality Estimation (QE) information during the decoding process. This method is architecture-agnostic, making it adaptable to any APE system, regardless of the underlying model or training approach. Our experiments on English-German, English-Hindi, and English-Marathi language pairs show the proposed approach yields significant improvements over their corresponding baseline APE systems, with TER gains of $0.65$, $1.86$, and $1.44$ points, respectively. These results underscore the complementary relationship between QE and APE tasks and highlight the effectiveness of integrating QE information to reduce over-correction in APE systems.

* Accepted to NAACL 2025 Main Conference: Short Papers

Via

Access Paper or Ask Questions

"My life is miserable, have to sign 500 autographs everyday": Exposing Humblebragging, the Brags in Disguise

Dec 28, 2024

Sharath Naganna, Saprativa Bhattacharjee, Pushpak Bhattacharyya, Biplab Banerjee

Figure 1 for "My life is miserable, have to sign 500 autographs everyday": Exposing Humblebragging, the Brags in Disguise

Figure 2 for "My life is miserable, have to sign 500 autographs everyday": Exposing Humblebragging, the Brags in Disguise

Figure 3 for "My life is miserable, have to sign 500 autographs everyday": Exposing Humblebragging, the Brags in Disguise

Figure 4 for "My life is miserable, have to sign 500 autographs everyday": Exposing Humblebragging, the Brags in Disguise

Abstract:Humblebragging is a phenomenon where individuals present self-promotional statements under the guise of modesty or complaints. For example, a statement like, "Ugh, I can't believe I got promoted to lead the entire team. So stressful!", subtly highlights an achievement while pretending to be complaining. Detecting humblebragging is important for machines to better understand the nuances of human language, especially in tasks like sentiment analysis and intent recognition. However, this topic has not yet been studied in computational linguistics. For the first time, we introduce the task of automatically detecting humblebragging in text. We formalize the task by proposing a 4-tuple definition of humblebragging and evaluate machine learning, deep learning, and large language models (LLMs) on this task, comparing their performance with humans. We also create and release a dataset called HB24, containing 3,340 humblebrags generated using GPT-4o. Our experiments show that detecting humblebragging is non-trivial, even for humans. Our best model achieves an F1-score of 0.88. This work lays the foundation for further exploration of this nuanced linguistic phenomenon and its integration into broader natural language understanding systems.

* Under review at ARR

Via

Access Paper or Ask Questions

"Did my figure do justice to the answer?" : Towards Multimodal Short Answer Grading with Feedback (MMSAF)

Dec 27, 2024

Pritam Sil, Bhaskaran Raman, Pushpak Bhattacharyya

Figure 1 for "Did my figure do justice to the answer?" : Towards Multimodal Short Answer Grading with Feedback (MMSAF)

Figure 2 for "Did my figure do justice to the answer?" : Towards Multimodal Short Answer Grading with Feedback (MMSAF)

Figure 3 for "Did my figure do justice to the answer?" : Towards Multimodal Short Answer Grading with Feedback (MMSAF)

Figure 4 for "Did my figure do justice to the answer?" : Towards Multimodal Short Answer Grading with Feedback (MMSAF)

Abstract:Personalized feedback plays a vital role in a student's learning process. While existing systems are adept at providing feedback over MCQ-based evaluation, this work focuses more on subjective and open-ended questions, which is similar to the problem of Automatic Short Answer Grading (ASAG) with feedback. Additionally, we introduce the Multimodal Short Answer grading with Feedback (MMSAF) problem over the traditional ASAG feedback problem to address the scenario where the student answer and reference answer might contain images. Moreover, we introduce the MMSAF dataset with 2197 data points along with an automated framework for generating such data sets. Our evaluations on existing LLMs over this dataset achieved an overall accuracy of 55\% on Level of Correctness labels, 75\% on Image Relevance labels and a score of 4.27 out of 5 in correctness level of LLM generated feedback as rated by experts. As per experts, Pixtral achieved a rating of above 4 out of all metrics, indicating that it is more aligned to human judgement, and that it is the best solution for assisting students.

Via

Access Paper or Ask Questions

Reconsidering SMT Over NMT for Closely Related Languages: A Case Study of Persian-Hindi Pair

Dec 22, 2024

Waisullah Yousofi, Pushpak Bhattacharyya

Abstract:This paper demonstrates that Phrase-Based Statistical Machine Translation (PBSMT) can outperform Transformer-based Neural Machine Translation (NMT) in moderate-resource scenarios, specifically for structurally similar languages, like the Persian-Hindi pair. Despite the Transformer architecture's typical preference for large parallel corpora, our results show that PBSMT achieves a BLEU score of 66.32, significantly exceeding the Transformer-NMT score of 53.7 on the same dataset. Additionally, we explore variations of the SMT architecture, including training on Romanized text and modifying the word order of Persian sentences to match the left-to-right (LTR) structure of Hindi. Our findings highlight the importance of choosing the right architecture based on language pair characteristics and advocate for SMT as a high-performing alternative, even in contexts commonly dominated by NMT.

* 5 pages, 2 figures

Via

Access Paper or Ask Questions

Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages

Dec 14, 2024

Poulami Ghosh, Raj Dabre, Pushpak Bhattacharyya

Figure 1 for Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages

Figure 2 for Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages

Figure 3 for Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages

Figure 4 for Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages

Abstract:Pre-trained language models (PLMs) are known to be susceptible to perturbations to the input text, but existing works do not explicitly focus on linguistically grounded attacks, which are subtle and more prevalent in nature. In this paper, we study whether PLMs are agnostic to linguistically grounded attacks or not. To this end, we offer the first study addressing this, investigating different Indic languages and various downstream tasks. Our findings reveal that although PLMs are susceptible to linguistic perturbations, when compared to non-linguistic attacks, PLMs exhibit a slightly lower susceptibility to linguistic attacks. This highlights that even constrained attacks are effective. Moreover, we investigate the implications of these outcomes across a range of languages, encompassing diverse language families and different scripts.

* Work in Progress

Via

Access Paper or Ask Questions