Abstract:We present 3DLNews, a novel dataset with local news articles from the United States spanning the period from 1996 to 2024. It contains almost 1 million URLs (with HTML text) from over 14,000 local newspapers, TV, and radio stations across all 50 states, and provides a broad snapshot of the US local news landscape. The dataset was collected by scraping Google and Twitter search results. We employed a multi-step filtering process to remove non-news article links and enriched the dataset with metadata such as the names and geo-coordinates of the source news media organizations, article publication dates, etc. Furthermore, we demonstrated the utility of 3DLNews by outlining four applications.
Abstract:Social bots remain a major vector for spreading disinformation on social media and a menace to the public. Despite the progress made in developing multiple sophisticated social bot detection algorithms and tools, bot detection remains a challenging, unsolved problem that is fraught with uncertainty due to the heterogeneity of bot behaviors, training data, and detection algorithms. Detection models often disagree on whether to label the same account as bot or human-controlled. However, they do not provide any measure of uncertainty to indicate how much we should trust their results. We propose to address both bot detection and the quantification of uncertainty at the account level - a novel feature of this research. This dual focus is crucial as it allows us to leverage additional information related to the quantified uncertainty of each prediction, thereby enhancing decision-making and improving the reliability of bot classifications. Specifically, our approach facilitates targeted interventions for bots when predictions are made with high confidence and suggests caution (e.g., gathering more data) when predictions are uncertain.
Abstract:We investigate the overlap of topics of online news articles from a variety of sources. To do this, we provide a platform for studying the news by measuring this overlap and scoring news stories according to the degree of attention in near-real time. This can enable multiple studies, including identifying topics that receive the most attention from news organizations and identifying slow news days versus major news days. Our application, StoryGraph, periodically (10-minute intervals) extracts the first five news articles from the RSS feeds of 17 US news media organizations across the partisanship spectrum (left, center, and right). From these articles, StoryGraph extracts named entities (PEOPLE, LOCATIONS, ORGANIZATIONS, etc.) and then represents each news article with its set of extracted named entities. Finally, StoryGraph generates a news similarity graph where the nodes represent news articles, and an edge between a pair of nodes represents a high degree of similarity between the nodes (similar news stories). Each news story within the news similarity graph is assigned an attention score which quantifies the amount of attention the topics in the news story receive collectively from the news media organizations. The StoryGraph service has been running since August 2017, and using this method, we determined that the top news story of 2018 was the "Kavanaugh hearings" with attention score of 25.85 on September 27, 2018. Similarly, the top news story for 2019 so far (2019-12-12) is "AG William Barr's release of his principal conclusions of the Mueller Report," with an attention score of 22.93 on March 24, 2019.