Abstract:In this study, we take a closer look at how Winograd schema challenges can be used to evaluate common sense reasoning in LLMs. Specifically, we evaluate generative models of different sizes on the popular WinoGrande benchmark. We release WinoWhat, a new corpus, in which each instance of the WinoGrande validation set is paraphrased. Additionally, we evaluate the performance on the challenge across five common sense knowledge categories, giving more fine-grained insights on what types of knowledge are more challenging for LLMs. Surprisingly, all models perform significantly worse on WinoWhat, implying that LLM reasoning capabilities are overestimated on WinoGrande. To verify whether this is an effect of benchmark memorization, we match benchmark instances to LLM trainingdata and create two test-suites. We observe that memorization has a minimal effect on model performance on WinoGrande.
Abstract:Emotion lexica are commonly used resources to combat data poverty in automatic emotion detection. However, methodological issues emerge when employing them: lexica are often not very extensive, and the way they are constructed can vary widely -- from lab conditions to crowdsourced approaches and distant supervision. Furthermore, both categorical frameworks and dimensional frameworks coexist, in which theorists provide many different sets of categorical labels or dimensional axes. The heterogenous nature of the resulting emotion detection resources results in a need for a unified approach to utilising them. This paper contributes to the field of emotion analysis in NLP by a) presenting the first study to unify existing emotion detection resources automatically and thus learn more about the relationships between them; b) exploring the use of existing lexica for the above-mentioned task; c) presenting an approach to automatically combining emotion lexica, namely by a multi-view variational auto-encoder (VAE), which facilitates the mapping of datasets into a joint emotion label space. We test the utility of joint emotion lexica by using them as additional features in state-of-the art emotion detection models. Our overall findings are that emotion lexica can offer complementary information to even extremely large pre-trained models such as BERT. The performance of our models is comparable to state-of-the art models that are specifically engineered for certain datasets, and even outperform the state-of-the art on four datasets.