Abstract:Recent advances in large language models (LLMs) have demonstrated strong performance on simple text classification tasks, frequently under zero-shot settings. However, their efficacy declines when tackling complex social media challenges such as propaganda detection, hateful meme classification, and toxicity identification. Much of the existing work has focused on using LLMs to generate synthetic training data, overlooking the potential of LLM-based text preprocessing and semantic augmentation. In this paper, we introduce an approach that prompts LLMs to clean noisy text and provide context-rich explanations, thereby enhancing training sets without substantial increases in data volume. We systematically evaluate on the SemEval 2024 multi-label Persuasive Meme dataset and further validate on the Google Jigsaw toxic comments and Facebook hateful memes datasets to assess generalizability. Our results reveal that zero-shot LLM classification underperforms on these high-context tasks compared to supervised models. In contrast, integrating LLM-based semantic augmentation yields performance on par with approaches that rely on human-annotated data, at a fraction of the cost. These findings underscore the importance of strategically incorporating LLMs into machine learning (ML) pipeline for social media classification tasks, offering broad implications for combating harmful content online.
Abstract:In the realm of political advertising, persuasion operates as a pivotal element within the broader framework of propaganda, exerting profound influences on public opinion and electoral outcomes. In this paper, we (1) introduce a lightweight model for persuasive text detection that achieves state-of-the-art performance in Subtask 3 of SemEval 2023 Task 3, while significantly reducing the computational resource requirements; and (2) leverage the proposed model to gain insights into political campaigning strategies on social media platforms by applying it to a real-world dataset we curated, consisting of Facebook political ads from the 2022 Australian Federal election campaign. Our study shows how subtleties can be found in persuasive political advertisements and presents a pragmatic approach to detect and analyze such strategies with limited resources, enhancing transparency in social media political campaigns.
Abstract:High-quality textual training data is essential for the success of multimodal data processing tasks, yet outputs from image captioning models like BLIP and GIT often contain errors and anomalies that are difficult to rectify using rule-based methods. While recent work addressing this issue has predominantly focused on using GPT models for data preprocessing on relatively simple public datasets, there is a need to explore a broader range of Large Language Models (LLMs) and tackle more challenging and diverse datasets. In this study, we investigate the use of multiple LLMs, including LLaMA 3.1 70B, GPT-4 Turbo, and Sonnet 3.5 v2, to refine and clean the textual outputs of BLIP and GIT. We assess the impact of LLM-assisted data cleaning by comparing downstream-task (SemEval 2024 Subtask "Multilabel Persuasion Detection in Memes") models trained on cleaned versus non-cleaned data. While our experimental results show improvements when using LLM-cleaned captions, statistical tests reveal that most of these improvements are not significant. This suggests that while LLMs have the potential to enhance data cleaning and repairing, their effectiveness may be limited depending on the context they are applied to, the complexity of the task, and the level of noise in the text. Our findings highlight the need for further research into the capabilities and limitations of LLMs in data preprocessing pipelines, especially when dealing with challenging datasets, contributing empirical evidence to the ongoing discussion about integrating LLMs into data preprocessing pipelines.