Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yumeng Qin

How UMass-FSD Inadvertently Leverages Temporal Bias

Aug 02, 2022

Dominik Wurzer, Yumeng Qin

Figure 1 for How UMass-FSD Inadvertently Leverages Temporal Bias

Figure 2 for How UMass-FSD Inadvertently Leverages Temporal Bias

Figure 3 for How UMass-FSD Inadvertently Leverages Temporal Bias

Abstract:First Story Detection describes the task of identifying new events in a stream of documents. The UMass-FSD system is known for its strong performance in First Story Detection competitions. Recently, it has been frequently used as a high accuracy baseline in research publications. We are the first to discover that UMass-FSD inadvertently leverages temporal bias. Interestingly, the discovered bias contrasts previously known biases and performs significantly better. Our analysis reveals an increased contribution of temporally distant documents, resulting from an unusual way of handling incremental term statistics. We show that this form of temporal bias is also applicable to other well-known First Story Detection systems, where it improves the detection accuracy. To provide a more generalizable conclusion and demonstrate that the observed bias is not only an artefact of a particular implementation, we present a model that intentionally leverages a bias on temporal distance. Our model significantly improves the detection effectiveness of state-of-the-art First Story Detection systems.

* SIGIR 20, July 2020
* Temporal Bias, First Story Detection, Topic Detection and Tracking, UMass-FSD, LSH-FSD

Via

Access Paper or Ask Questions

Parameterizing Kterm Hashing

Aug 02, 2022

Dominik Wurzer, Yumeng Qin

Figure 1 for Parameterizing Kterm Hashing

Figure 2 for Parameterizing Kterm Hashing

Figure 3 for Parameterizing Kterm Hashing

Figure 4 for Parameterizing Kterm Hashing

Abstract:Kterm Hashing provides an innovative approach to novelty detection on massive data streams. Previous research focused on maximizing the efficiency of Kterm Hashing and succeeded in scaling First Story Detection to Twitter-size data stream without sacrificing detection accuracy. In this paper, we focus on improving the effectiveness of Kterm Hashing. Traditionally, all kterms are considered as equally important when calculating a document's degree of novelty with respect to the past. We believe that certain kterms are more important than others and hypothesize that uniform kterm weights are sub-optimal for determining novelty in data streams. To validate our hypothesis, we parameterize Kterm Hashing by assigning weights to kterms based on their characteristics. Our experiments apply Kterm Hashing in a First Story Detection setting and reveal that parameterized Kterm Hashing can surpass state-of-the-art detection accuracy and significantly outperform the uniformly weighted approach.

* SIGIR 18, July 2018, Ann Arbor, MI, USA
* Kterm Hashing, Novelty Detection, First Story Detection

Via

Access Paper or Ask Questions

Spotting Rumors via Novelty Detection

Nov 19, 2016

Yumeng Qin, Dominik Wurzer, Victor Lavrenko, Cunchen Tang

Figure 1 for Spotting Rumors via Novelty Detection

Figure 2 for Spotting Rumors via Novelty Detection

Figure 3 for Spotting Rumors via Novelty Detection

Figure 4 for Spotting Rumors via Novelty Detection

Abstract:Rumour detection is hard because the most accurate systems operate retrospectively, only recognizing rumours once they have collected repeated signals. By then the rumours might have already spread and caused harm. We introduce a new category of features based on novelty, tailored to detect rumours early on. To compensate for the absence of repeated signals, we make use of news wire as an additional data source. Unconfirmed (novel) information with respect to the news articles is considered as an indication of rumours. Additionally we introduce pseudo feedback, which assumes that documents that are similar to previous rumours, are more likely to also be a rumour. Comparison with other real-time approaches shows that novelty based features in conjunction with pseudo feedback perform significantly better, when detecting rumours instantly after their publication.

Via

Access Paper or Ask Questions