Abstract:Nowadays, the use of intelligent systems to detect redundant information in news articles has become especially prevalent with the proliferation of news media outlets in order to enhance user experience. However, the heterogeneous nature of news can lead to spurious findings in these systems: Simple heuristics such as whether a pair of news are both about politics can provide strong but deceptive downstream performance. Segmenting news similarity datasets into topics improves the training of these models by forcing them to learn how to distinguish salient characteristics under more narrow domains. However, this requires the existence of topic-specific datasets, which are currently lacking. In this article, we propose a new dataset of similar news, SPICED, which includes seven topics: Crime & Law, Culture & Entertainment, Disasters & Accidents, Economy & Business, Politics & Conflicts, Science & Technology, and Sports. Futhermore, we present four distinct approaches for generating news pairs, which are used in the creation of datasets specifically designed for news similarity detection task. We benchmarked the created datasets using MinHash, BERT, SBERT, and SimCSE models.
Abstract:The paper is devoted to the participation of the TUDublin team in Constraint@AAAI2021 - COVID19 Fake News Detection Challenge. Today, the problem of fake news detection is more acute than ever in connection with the pandemic. The number of fake news is increasing rapidly and it is necessary to create AI tools that allow us to identify and prevent the spread of false information about COVID-19 urgently. The main goal of the work was to create a model that would carry out a binary classification of messages from social media as real or fake news in the context of COVID-19. Our team constructed the ensemble consisting of Bidirectional Long Short Term Memory, Support Vector Machine, Logistic Regression, Naive Bayes and a combination of Logistic Regression and Naive Bayes. The model allowed us to achieve 0.94 F1-score, which is within 5\% of the best result.