Abstract:Despite large language models (LLMs) being known to exhibit bias against non-mainstream varieties, there are no known labeled datasets for sentiment analysis of English. To address this gap, we introduce BESSTIE, a benchmark for sentiment and sarcasm classification for three varieties of English: Australian (en-AU), Indian (en-IN), and British (en-UK). Using web-based content from two domains, namely, Google Place reviews and Reddit comments, we collect datasets for these language varieties using two methods: location-based and topic-based filtering. Native speakers of the language varieties manually annotate the datasets with sentiment and sarcasm labels. Subsequently, we fine-tune nine large language models (LLMs) (representing a range of encoder/decoder and mono/multilingual models) on these datasets, and evaluate their performance on the two tasks. Our results reveal that the models consistently perform better on inner-circle varieties (i.e., en-AU and en-UK), with significant performance drops for en-IN, particularly in sarcasm detection. We also report challenges in cross-variety generalisation, highlighting the need for language variety-specific datasets such as ours. BESSTIE promises to be a useful evaluative benchmark for future research in equitable LLMs, specifically in terms of language varieties. The BESSTIE datasets, code, and models are currently available on request, while the paper is under review. Please email aditya.joshi@unsw.edu.au.
Abstract:This paper investigates data sampling strategies to create a benchmark for dialectal sentiment classification of Google Places reviews written in English. Based on location-based filtering, we collect a self-supervised dataset of reviews in Australian (Australian English), Indian (Indian English), and British (British English) English with self-supervised sentiment labels (1-star to 5-star). We employ sampling techniques based on label semantics, review length, and sentiment proportion and report performances on three fine-tuned BERT-based models. Our multi-dialect evaluation provides pointers to challenging scenarios for inner-circle (Australian English and British English) as well as non-native dialects (Indian English) of English, highlighting the need for more diverse benchmarks.