Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Pablo Mosteiro

Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation

Jul 17, 2025

Hadi Mohammadi, Tina Shahedi, Pablo Mosteiro, Massimo Poesio, Ayoub Bagheri, Anastasia Giachanou

Abstract:Understanding the sources of variability in annotations is crucial for developing fair NLP systems, especially for tasks like sexism detection where demographic bias is a concern. This study investigates the extent to which annotator demographic features influence labeling decisions compared to text content. Using a Generalized Linear Mixed Model, we quantify this inf luence, finding that while statistically present, demographic factors account for a minor fraction ( 8%) of the observed variance, with tweet content being the dominant factor. We then assess the reliability of Generative AI (GenAI) models as annotators, specifically evaluating if guiding them with demographic personas improves alignment with human judgments. Our results indicate that simplistic persona prompting often fails to enhance, and sometimes degrades, performance compared to baseline models. Furthermore, explainable AI (XAI) techniques reveal that model predictions rely heavily on content-specific tokens related to sexism, rather than correlates of demographic characteristics. We argue that focusing on content-driven explanations and robust annotation protocols offers a more reliable path towards fairness than potentially persona simulation.

Via

Access Paper or Ask Questions

How and where does CLIP process negation?

Jul 15, 2024

Vincent Quantmeyer, Pablo Mosteiro, Albert Gatt

Figure 1 for How and where does CLIP process negation?

Figure 2 for How and where does CLIP process negation?

Figure 3 for How and where does CLIP process negation?

Figure 4 for How and where does CLIP process negation?

Abstract:Various benchmarks have been proposed to test linguistic understanding in pre-trained vision \& language (VL) models. Here we build on the existence task from the VALSE benchmark (Parcalabescu et al, 2022) which we use to test models' understanding of negation, a particularly interesting issue for multimodal models. However, while such VL benchmarks are useful for measuring model performance, they do not reveal anything about the internal processes through which these models arrive at their outputs in such visio-linguistic tasks. We take inspiration from the growing literature on model interpretability to explain the behaviour of VL models on the understanding of negation. Specifically, we approach these questions through an in-depth analysis of the text encoder in CLIP (Radford et al, 2021), a highly influential VL model. We localise parts of the encoder that process negation and analyse the role of attention heads in this task. Our contributions are threefold. We demonstrate how methods from the language model interpretability literature (such as causal tracing) can be translated to multimodal models and tasks; we provide concrete insights into how CLIP processes negation on the VALSE existence task; and we highlight inherent limitations in the VALSE dataset as a benchmark for linguistic understanding.

* Accepted at the 3rd Workshop on Advances in Language and Vision Research (ALVR 2024)

Via

Access Paper or Ask Questions

UU-Tax at SemEval-2022 Task 3: Improving the generalizability of language models for taxonomy classification through data augmentation

Oct 07, 2022

Injy Sarhan, Pablo Mosteiro, Marco Spruit

Figure 1 for UU-Tax at SemEval-2022 Task 3: Improving the generalizability of language models for taxonomy classification through data augmentation

Figure 2 for UU-Tax at SemEval-2022 Task 3: Improving the generalizability of language models for taxonomy classification through data augmentation

Figure 3 for UU-Tax at SemEval-2022 Task 3: Improving the generalizability of language models for taxonomy classification through data augmentation

Figure 4 for UU-Tax at SemEval-2022 Task 3: Improving the generalizability of language models for taxonomy classification through data augmentation

Abstract:This paper presents our strategy to address the SemEval-2022 Task 3 PreTENS: Presupposed Taxonomies Evaluating Neural Network Semantics. The goal of the task is to identify if a sentence is deemed acceptable or not, depending on the taxonomic relationship that holds between a noun pair contained in the sentence. For sub-task 1 -- binary classification -- we propose an effective way to enhance the robustness and the generalizability of language models for better classification on this downstream task. We design a two-stage fine-tuning procedure on the ELECTRA language model using data augmentation techniques. Rigorous experiments are carried out using multi-task learning and data-enriched fine-tuning. Experimental results demonstrate that our proposed model, UU-Tax, is indeed able to generalize well for our downstream task. For sub-task 2 -- regression -- we propose a simple classifier that trains on features obtained from Universal Sentence Encoder (USE). In addition to describing the submitted systems, we discuss other experiments that employ pre-trained language models and data augmentation techniques. For both sub-tasks, we perform error analysis to further understand the behaviour of the proposed models. We achieved a global F1_Binary score of 91.25% in sub-task 1 and a rho score of 0.221 in sub-task 2.

* In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 271-281, Seattle, United States. Association for Computational Linguistics

Via

Access Paper or Ask Questions

Bias Discovery in Machine Learning Models for Mental Health

May 24, 2022

Pablo Mosteiro, Jesse Kuiper, Judith Masthoff, Floortje Scheepers, Marco Spruit

Figure 1 for Bias Discovery in Machine Learning Models for Mental Health

Figure 2 for Bias Discovery in Machine Learning Models for Mental Health

Figure 3 for Bias Discovery in Machine Learning Models for Mental Health

Figure 4 for Bias Discovery in Machine Learning Models for Mental Health

Abstract:Fairness and bias are crucial concepts in artificial intelligence, yet they are relatively ignored in machine learning applications in clinical psychiatry. We computed fairness metrics and present bias mitigation strategies using a model trained on clinical mental health data. We collected structured data related to the admission, diagnosis, and treatment of patients in the psychiatry department of the University Medical Center Utrecht. We trained a machine learning model to predict future administrations of benzodiazepines on the basis of past data. We found that gender plays an unexpected role in the predictions-this constitutes bias. Using the AI Fairness 360 package, we implemented reweighing and discrimination-aware regularization as bias mitigation strategies, and we explored their implications for model performance. This is the first application of bias exploration and mitigation in a machine learning model trained on real clinical psychiatry data.

Via

Access Paper or Ask Questions

Federated learning for violence incident prediction in a simulated cross-institutional psychiatric setting

May 17, 2022

Thomas Borger, Pablo Mosteiro, Heysem Kaya, Emil Rijcken, Albert Ali Salah, Floortje Scheepers, Marco Spruit

Figure 1 for Federated learning for violence incident prediction in a simulated cross-institutional psychiatric setting

Figure 2 for Federated learning for violence incident prediction in a simulated cross-institutional psychiatric setting

Figure 3 for Federated learning for violence incident prediction in a simulated cross-institutional psychiatric setting

Figure 4 for Federated learning for violence incident prediction in a simulated cross-institutional psychiatric setting

Abstract:Inpatient violence is a common and severe problem within psychiatry. Knowing who might become violent can influence staffing levels and mitigate severity. Predictive machine learning models can assess each patient's likelihood of becoming violent based on clinical notes. Yet, while machine learning models benefit from having more data, data availability is limited as hospitals typically do not share their data for privacy preservation. Federated Learning (FL) can overcome the problem of data limitation by training models in a decentralised manner, without disclosing data between collaborators. However, although several FL approaches exist, none of these train Natural Language Processing models on clinical notes. In this work, we investigate the application of Federated Learning to clinical Natural Language Processing, applied to the task of Violence Risk Assessment by simulating a cross-institutional psychiatric setting. We train and compare four models: two local models, a federated model and a data-centralised model. Our results indicate that the federated model outperforms the local models and has similar performance as the data-centralised model. These findings suggest that Federated Learning can be used successfully in a cross-institutional setting and is a step towards new applications of Federated Learning based on clinical notes

* Expert Systems with Applications Volume 199, 1 August 2022, 116720

Via

Access Paper or Ask Questions

Making sense of violence risk predictions using clinical notes

Apr 29, 2022

Pablo Mosteiro, Emil Rijcken, Kalliopi Zervanou, Uzay Kaymak, Floortje Scheepers, Marco Spruit

Figure 1 for Making sense of violence risk predictions using clinical notes

Figure 2 for Making sense of violence risk predictions using clinical notes

Figure 3 for Making sense of violence risk predictions using clinical notes

Figure 4 for Making sense of violence risk predictions using clinical notes

Abstract:Violence risk assessment in psychiatric institutions enables interventions to avoid violence incidents. Clinical notes written by practitioners and available in electronic health records (EHR) are valuable resources that are seldom used to their full potential. Previous studies have attempted to assess violence risk in psychiatric patients using such notes, with acceptable performance. However, they do not explain why classification works and how it can be improved. We explore two methods to better understand the quality of a classifier in the context of clinical note analysis: random forests using topic models, and choice of evaluation metric. These methods allow us to understand both our data and our methodology more profoundly, setting up the groundwork to work on improved models that build upon this understanding. This is particularly important when it comes to the generalizability of evaluated classifiers to new data, a trustworthiness problem that is of great interest due to the increased availability of new data in electronic format.

* In: Huang, Z., Siuly, S., Wang, H., Zhou, R., Zhang, Y. (eds) Health Information Science. HIS 2020. Lecture Notes in Computer Science(), vol 12435. Springer, Cham
* arXiv admin note: substantial text overlap with arXiv:2204.13535

Via

Access Paper or Ask Questions

Machine Learning for Violence Risk Assessment Using Dutch Clinical Notes

Apr 28, 2022

Pablo Mosteiro, Emil Rijcken, Kalliopi Zervanou, Uzay Kaymak, Floortje Scheepers, Marco Spruit

Figure 1 for Machine Learning for Violence Risk Assessment Using Dutch Clinical Notes

Figure 2 for Machine Learning for Violence Risk Assessment Using Dutch Clinical Notes

Figure 3 for Machine Learning for Violence Risk Assessment Using Dutch Clinical Notes

Figure 4 for Machine Learning for Violence Risk Assessment Using Dutch Clinical Notes

Abstract:Violence risk assessment in psychiatric institutions enables interventions to avoid violence incidents. Clinical notes written by practitioners and available in electronic health records are valuable resources capturing unique information, but are seldom used to their full potential. We explore conventional and deep machine learning methods to assess violence risk in psychiatric patients using practitioner notes. The performance of our best models is comparable to the currently used questionnaire-based method, with an area under the Receiver Operating Characteristic curve of approximately 0.8. We find that the deep-learning model BERTje performs worse than conventional machine learning methods. We also evaluate our data and our classifiers to understand the performance of our models better. This is particularly important for the applicability of evaluated classifiers to new data, and is also of great interest to practitioners, due to the increased availability of new data in electronic format.

* Journal of Artificial Intelligence for Medical Sciences Volume 2, Issue 1-2, June 2021, Pages 44 - 54

Via

Access Paper or Ask Questions