Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels

Sep 21, 2023

Elena Shushkevich, Long Mai, Manuel V. Loureiro, Steven Derby, Tri Kurniawan Wijaya

Figure 1 for SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels

Figure 2 for SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels

Figure 3 for SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels

Figure 4 for SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels

Share this with someone who'll enjoy it:

Abstract:Nowadays, the use of intelligent systems to detect redundant information in news articles has become especially prevalent with the proliferation of news media outlets in order to enhance user experience. However, the heterogeneous nature of news can lead to spurious findings in these systems: Simple heuristics such as whether a pair of news are both about politics can provide strong but deceptive downstream performance. Segmenting news similarity datasets into topics improves the training of these models by forcing them to learn how to distinguish salient characteristics under more narrow domains. However, this requires the existence of topic-specific datasets, which are currently lacking. In this article, we propose a new dataset of similar news, SPICED, which includes seven topics: Crime & Law, Culture & Entertainment, Disasters & Accidents, Economy & Business, Politics & Conflicts, Science & Technology, and Sports. Futhermore, we present four distinct approaches for generating news pairs, which are used in the creation of datasets specifically designed for news similarity detection task. We benchmarked the created datasets using MinHash, BERT, SBERT, and SimCSE models.

View paper on

Share this with someone who'll enjoy it:

Title:SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels

Paper and Code