Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Maëlick Claes

Measuring LDA Topic Stability from Clusters of Replicated Runs

Aug 24, 2018

Mika Mäntylä, Maëlick Claes, Umar Farooq

Figure 1 for Measuring LDA Topic Stability from Clusters of Replicated Runs

Figure 2 for Measuring LDA Topic Stability from Clusters of Replicated Runs

Figure 3 for Measuring LDA Topic Stability from Clusters of Replicated Runs

Figure 4 for Measuring LDA Topic Stability from Clusters of Replicated Runs

Abstract:Background: Unstructured and textual data is increasing rapidly and Latent Dirichlet Allocation (LDA) topic modeling is a popular data analysis methods for it. Past work suggests that instability of LDA topics may lead to systematic errors. Aim: We propose a method that relies on replicated LDA runs, clustering, and providing a stability metric for the topics. Method: We generate k LDA topics and replicate this process n times resulting in n*k topics. Then we use K-medioids to cluster the n*k topics to k clusters. The k clusters now represent the original LDA topics and we present them like normal LDA topics showing the ten most probable words. For the clusters, we try multiple stability metrics, out of which we recommend Rank-Biased Overlap, showing the stability of the topics inside the clusters. Results: We provide an initial validation where our method is used for 270,000 Mozilla Firefox commit messages with k=20 and n=20. We show how our topic stability metrics are related to the contents of the topics. Conclusions: Advances in text mining enable us to analyze large masses of text in software engineering but non-deterministic algorithms, such as LDA, may lead to unreplicable conclusions. Our approach makes LDA stability transparent and is also complementary rather than alternative to many prior works that focus on LDA parameter tuning.

* ESEM '18 ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)}{October 11--12, 2018}{Oulu, Finland}

Via

Access Paper or Ask Questions

Bootstrapping a Lexicon for Emotional Arousal in Software Engineering

Mar 27, 2017

Mika V. Mäntylä, Nicole Novielli, Filippo Lanubile, Maëlick Claes, Miikka Kuutila

Figure 1 for Bootstrapping a Lexicon for Emotional Arousal in Software Engineering

Figure 2 for Bootstrapping a Lexicon for Emotional Arousal in Software Engineering

Figure 3 for Bootstrapping a Lexicon for Emotional Arousal in Software Engineering

Figure 4 for Bootstrapping a Lexicon for Emotional Arousal in Software Engineering

Abstract:Emotional arousal increases activation and performance but may also lead to burnout in software development. We present the first version of a Software Engineering Arousal lexicon (SEA) that is specifically designed to address the problem of emotional arousal in the software developer ecosystem. SEA is built using a bootstrapping approach that combines word embedding model trained on issue-tracking data and manual scoring of items in the lexicon. We show that our lexicon is able to differentiate between issue priorities, which are a source of emotional activation and then act as a proxy for arousal. The best performance is obtained by combining SEA (428 words) with a previously created general purpose lexicon by Warriner et al. (13,915 words) and it achieves Cohen's d effect sizes up to 0.5.

* 5 pages. Accepted version. Copyright IEEE

Via

Access Paper or Ask Questions