Abstract:Named entity recognition (NER) is a crucial task that aims to identify structured information, which is often replete with complex, technical terms and a high degree of variability. Accurate and reliable NER can facilitate the extraction and analysis of important information. However, NER for other than English is challenging due to limited data availability, as the high expertise, time, and expenses are required to annotate its data. In this paper, by using the limited data, we explore various factors including model structure, corpus annotation scheme and data augmentation techniques to improve the performance of a NER model for French. Our experiments demonstrate that these approaches can significantly improve the model's F1 score from original CRF score of 62.41 to 79.39. Our findings suggest that considering different extrinsic factors and combining these techniques is a promising approach for improving NER performance where the size of data is limited.
Abstract:Large Language Models (LLMs), representing a significant achievement in artificial intelligence (AI) research, have demonstrated their ability in a multitude of tasks. This project aims to explore the capabilities of GPT-3.5, a leading example of LLMs, in processing the sentiment analysis of Internet memes. Memes, which include both verbal and visual aspects, act as a powerful yet complex tool for expressing ideas and sentiments, demanding an understanding of societal norms and cultural contexts. Notably, the detection and moderation of hateful memes pose a significant challenge due to their implicit offensive nature. This project investigates GPT's proficiency in such subjective tasks, revealing its strengths and potential limitations. The tasks include the classification of meme sentiment, determination of humor type, and detection of implicit hate in memes. The performance evaluation, using datasets from SemEval-2020 Task 8 and Facebook hateful memes, offers a comparative understanding of GPT responses against human annotations. Despite GPT's remarkable progress, our findings underscore the challenges faced by these models in handling subjective tasks, which are rooted in their inherent limitations including contextual understanding, interpretation of implicit meanings, and data biases. This research contributes to the broader discourse on the applicability of AI in handling complex, context-dependent tasks, and offers valuable insights for future advancements.
Abstract:Traditional evaluation metrics for classification in natural language processing such as accuracy and area under the curve fail to differentiate between models with different predictive behaviors despite their similar performance metrics. We introduce sensitivity score, a metric that scrutinizes models' behaviors at the vocabulary level to provide insights into disparities in their decision-making logic. We assess the sensitivity score on a set of representative words in the test set using two classifiers trained for hospital readmission classification with similar performance statistics. Our experiments compare the decision-making logic of clinicians and classifiers based on rank correlations of sensitivity scores. The results indicate that the language model's sensitivity score aligns better with the professionals than the xgboost classifier on tf-idf embeddings, which suggests that xgboost uses some spurious features. Overall, this metric offers a novel perspective on assessing models' robustness by quantifying their discrepancy with professional opinions. Our code is available on GitHub (https://github.com/nyuolab/Model_Sensitivity).