The effectiveness of brand monitoring in India is increasingly challenged by the rise of Hinglish--a hybrid of Hindi and English--used widely in user-generated content on platforms like Twitter. Traditional Natural Language Processing (NLP) models, built for monolingual data, often fail to interpret the syntactic and semantic complexity of this code-mixed language, resulting in inaccurate sentiment analysis and misleading market insights. To address this gap, we propose a high-performance sentiment classification framework specifically designed for Hinglish tweets. Our approach fine-tunes mBERT (Multilingual BERT), leveraging its multilingual capabilities to better understand the linguistic diversity of Indian social media. A key component of our methodology is the use of subword tokenization, which enables the model to effectively manage spelling variations, slang, and out-of-vocabulary terms common in Romanized Hinglish. This research delivers a production-ready AI solution for brand sentiment tracking and establishes a strong benchmark for multilingual NLP in low-resource, code-mixed environments.
Social media (SM) platforms (e.g. Facebook, Twitter, and Reddit) are increasingly leveraged to share opinions and emotions, specifically during challenging events, such as natural disasters, pandemics, and political elections, and joyful occasions like festivals and celebrations. Among the SM platforms, Reddit provides a unique space for its users to anonymously express their experiences and thoughts on sensitive issues such as health and daily life. In this work, we present a novel dataset, called NepEMO, for multi-label emotion (MLE) and sentiment classification (SC) on the Nepali subreddit post. We curate and build a manually annotated dataset of 4,462 posts (January 2019- June 2025) written in English, Romanised Nepali and Devanagari script for five emotions (fear, anger, sadness, joy, and depression) and three sentiment classes (positive, negative, and neutral). We perform a detailed analysis of posts to capture linguistic insights, including emotion trends, co-occurrence of emotions, sentiment-specific n-grams, and topic modelling using Latent Dirichlet Allocation and TF-IDF keyword extraction. Finally, we compare various traditional machine learning (ML), deep learning (DL), and transformer models for MLE and SC tasks. The result shows that transformer models consistently outperform the ML and DL models for both tasks.
This paper asks whether promotional Twitter/X bots form behavioural families and whether members evolve similarly. We analyse 2,798,672 tweets from 2,615 ground-truth promotional bot accounts (2006-2021), focusing on complete years 2009 to 2020. Each bot is encoded as a sequence of symbolic blocks (``digital DNA'') from seven categorical post-level behavioural features (posting action, URL, media, text duplication, hashtags, emojis, sentiment), preserving temporal order only. Using non-overlapping blocks (k=7), cosine similarity over block-frequency vectors, and hierarchical clustering, we obtain four coherent families: Unique Tweeters, Duplicators with URLs, Content Multipliers, and Informed Contributors. Families share behavioural cores but differ systematically in engagement strategies and life-cycle dynamics (beginning/middle/end). We then model behavioural change as mutations. Within each family we align sequences via multiple sequence alignment (MSA) and label events as insertions, deletions, substitutions, alterations, and identity. This quantifies mutation rates, change-prone blocks/features, and mutation hotspots. Deletions and substitutions dominate, insertions are rare, and mutation profiles differ by family, with hotspots early for some families and dispersed for others. Finally, we test predictive value: bots within the same family share mutations more often than bots across families; closer bots share and propagate mutations more than distant ones; and responses to external triggers (e.g., Christmas, Halloween) follow family-specific, partly predictable patterns. Overall, sequence-based family modelling plus mutation analysis provides a fine-grained account of how promotional bot behaviour adapts over time.
Machine learning models typically assume that training and test data follow the same distribution, an assumption that often fails in real-world scenarios due to distribution shifts. This issue is especially pronounced in low-resource settings, where data scarcity and limited domain diversity hinder robust generalization. Domain generalization (DG) approaches address this challenge by learning features that remain invariant across domains, often using causal mechanisms to improve model robustness. In this study, we examine two distinct causal DG techniques in low-resource natural language tasks. First, we investigate a causal data augmentation (CDA) approach that automatically generates counterfactual examples to improve robustness to spurious correlations. We apply this method to sentiment classification on the NaijaSenti Twitter corpus, expanding the training data with semantically equivalent paraphrases to simulate controlled distribution shifts. Second, we explore an invariant causal representation learning (ICRL) approach using the DINER framework, originally proposed for debiasing aspect-based sentiment analysis. We adapt DINER to a multilingual setting. Our findings demonstrate that both approaches enhance robustness to unseen domains: counterfactual data augmentation yields consistent cross-domain accuracy gains in sentiment classification, while causal representation learning with DINER improves out-of-distribution performance in multilingual sentiment analysis, albeit with varying gains across languages.
Aspect-based sentiment analysis (ABSA) aims to identify aspect terms and determine their sentiment polarity. While dependency trees combined with contextual semantics effectively identify aspect sentiment, existing methods relying on syntax trees and aspect-aware attention struggle to model complex semantic relationships. Their dependence on linear dot-product features fails to capture nonlinear associations, allowing noisy similarity from irrelevant words to obscure key opinion terms. Motivated by Differentiable Optimal Matching, we propose the Optimal Transport Enhanced Syntactic-Semantic Graph Network (OTESGN), which introduces a Syntactic-Semantic Collaborative Attention. It comprises a Syntactic Graph-Aware Attention for mining latent syntactic dependencies and modeling global syntactic topology, as well as a Semantic Optimal Transport Attention designed to uncover fine-grained semantic alignments amidst textual noise, thereby accurately capturing sentiment signals obscured by irrelevant tokens. A Adaptive Attention Fusion module integrates these heterogeneous features, and contrastive regularization further improves robustness. Experiments demonstrate that OTESGN achieves state-of-the-art results, outperforming previous best models by +1.01% F1 on Twitter and +1.30% F1 on Laptop14 benchmarks. Ablative studies and visual analyses corroborate its efficacy in precise localization of opinion words and noise resistance.
Mental disorders including depression, anxiety, and other neurological disorders pose a significant global challenge, particularly among individuals exhibiting social avoidance tendencies. This study proposes a hybrid approach by leveraging smartphone sensor data measuring daily physical activities and analyzing their social media (Twitter) interactions for evaluating an individual's depression level. Using CNN-based deep learning models and Naive Bayes classification, we identify human physical activities accurately and also classify the user sentiments. A total of 33 participants were recruited for data acquisition, and nine relevant features were extracted from the physical activities and analyzed with their weekly depression scores, evaluated using the Geriatric Depression Scale (GDS) questionnaire. Of the nine features, six are derived from physical activities, achieving an activity recognition accuracy of 95%, while three features stem from sentiment analysis of Twitter activities, yielding a sentiment analysis accuracy of 95.6%. Notably, several physical activity features exhibited significant correlations with the severity of depression symptoms. For classifying the depression severity, a support vector machine (SVM)-based algorithm is employed that demonstrated a very high accuracy of 94%, outperforming alternative models, e.g., the multilayer perceptron (MLP) and k-nearest neighbor. It is a simple approach yet highly effective in the long run for monitoring depression without breaching personal privacy.
This study explores the intersection of fashion trends and social media sentiment through computational analysis of Twitter data using the T4SA (Twitter for Sentiment Analysis) dataset. By applying natural language processing and machine learning techniques, we examine how sentiment patterns in fashion-related social media conversations can serve as predictors for emerging fashion trends. Our analysis involves the identification and categorization of fashion-related content, sentiment classification with improved normalization techniques, time series decomposition, statistically validated causal relationship modeling, cross-platform sentiment comparison, and brand-specific sentiment analysis. Results indicate correlations between sentiment patterns and fashion theme popularity, with accessories and streetwear themes showing statistically significant rising trends. The Granger causality analysis establishes sustainability and streetwear as primary trend drivers, showing bidirectional relationships with several other themes. The findings demonstrate that social media sentiment analysis can serve as an effective early indicator of fashion trend trajectories when proper statistical validation is applied. Our improved predictive model achieved 78.35% balanced accuracy in sentiment classification, establishing a reliable foundation for trend prediction across positive, neutral, and negative sentiment categories.
Social media platforms like Twitter have increasingly relied on Natural Language Processing NLP techniques to analyze and understand the sentiments expressed in the user generated content. One such state of the art NLP model is Bidirectional Encoder Representations from Transformers BERT which has been widely adapted in sentiment analysis. BERT is susceptible to adversarial attacks. This paper aims to scrutinize the inherent vulnerabilities of such models in Twitter sentiment analysis. It aims to formulate a framework for constructing targeted adversarial texts capable of deceiving these models, while maintaining stealth. In contrast to conventional methodologies, such as Importance Reweighting, this framework core idea resides in its reliance on gradients to prioritize the importance of individual words within the text. It uses a whitebox approach to attain fine grained sensitivity, pinpointing words that exert maximal influence on the classification outcome. This paper is organized into three interdependent phases. It starts with fine-tuning a pre-trained BERT model on Twitter data. It then analyzes gradients of the model to rank words on their importance, and iteratively replaces those with feasible candidates until an acceptable solution is found. Finally, it evaluates the effectiveness of the adversarial text against the custom trained sentiment classification model. This assessment would help in gauging the capacity of the adversarial text to successfully subvert classification without raising any alarm.
We present a framework for large-scale sentiment and topic analysis of Twitter discourse. Our pipeline begins with targeted data collection using conflict-specific keywords, followed by automated sentiment labeling via multiple pre-trained models to improve annotation robustness. We examine the relationship between sentiment and contextual features such as timestamp, geolocation, and lexical content. To identify latent themes, we apply Latent Dirichlet Allocation (LDA) on partitioned subsets grouped by sentiment and metadata attributes. Finally, we develop an interactive visualization interface to support exploration of sentiment trends and topic distributions across time and regions. This work contributes a scalable methodology for social media analysis in dynamic geopolitical contexts.
Internet penetration rates in Africa are rising steadily, and mobile Internet is getting an even bigger boost with the availability of smartphones. Young people are increasingly using the Internet, especially social networks, and Senegal is no exception to this revolution. Social networks have become the main means of expression for young people. Despite this evolution in Internet access, there are few operators on the market, which limits the alternatives available in terms of value for money. In this paper, we will look at how young people feel about the price of mobile Internet in Senegal, in relation to the perceived quality of the service, through their comments on social networks. We scanned a set of Twitter and Facebook comments related to the subject and applied a sentiment analysis model to gather their general feelings.