Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:A diverse Multilingual News Headlines Dataset from around the World

Mar 28, 2024

Felix Leeb, Bernhard Schölkopf

Figure 1 for A diverse Multilingual News Headlines Dataset from around the World

Figure 2 for A diverse Multilingual News Headlines Dataset from around the World

Figure 3 for A diverse Multilingual News Headlines Dataset from around the World

Figure 4 for A diverse Multilingual News Headlines Dataset from around the World

Share this with someone who'll enjoy it:

Abstract:Babel Briefings is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide with English translations of all articles included. Designed for natural language processing and media studies, it serves as a high-quality dataset for training or evaluating language models as well as offering a simple, accessible collection of articles, for example, to analyze global news coverage and cultural narratives. As a simple demonstration of the analyses facilitated by this dataset, we use a basic procedure using a TF-IDF weighted similarity metric to group articles into clusters about the same event. We then visualize the \emph{event signatures} of the event showing articles of which languages appear over time, revealing intuitive features based on the proximity of the event and unexpectedness of the event. The dataset is available on \href{https://www.kaggle.com/datasets/felixludos/babel-briefings}{Kaggle} and \href{https://huggingface.co/datasets/felixludos/babel-briefings}{HuggingFace} with accompanying \href{https://github.com/felixludos/babel-briefings}{GitHub} code.

* Published in NAACL 2024 Proceedings (Short Paper track)

View paper on

Share this with someone who'll enjoy it:

Title:A diverse Multilingual News Headlines Dataset from around the World

Paper and Code