Abstract:Stopword removal is a critical stage in many Machine Learning methods but often receives little consideration, it interferes with the model visualizations and disrupts user confidence. Inappropriately chosen or hastily omitted stopwords not only lead to suboptimal performance but also significantly affect the quality of models, thus reducing the willingness of practitioners and stakeholders to rely on the output visualizations. This paper proposes a novel extraction method that provides a corpus-specific probabilistic estimation of stopword likelihood and an interactive visualization system to support their analysis. We evaluated our approach and interface using real-world data, a commonly used Machine Learning method (Topic Modelling), and a comprehensive qualitative experiment probing user confidence. The results of our work show that our system increases user confidence in the credibility of topic models by (1) returning reasonable probabilities, (2) generating an appropriate and representative extension of common stopword lists, and (3) providing an adjustable threshold for estimating and analyzing stopwords visually. Finally, we discuss insights, recommendations, and best practices to support practitioners while improving the output of Machine Learning methods and topic model visualizations with robust stopword analysis and removal.
Abstract:Searching large digital repositories can be extremely frustrating, as common list-based formats encourage users to adopt a convenience-sampling approach that favours chance discovery and random search, over meaningful exploration. We have designed a methodology that allows users to visually and thematically explore corpora, while developing personalised holistic reading strategies. We describe the results of a three-phase qualitative study, in which experienced researchers used our interactive visualisation approach to analyse a set of publications and select relevant themes and papers. Using in-depth semi-structured interviews and stimulated recall, we found that users: (i) selected papers that they otherwise would not have read, (ii) developed a more coherent reading strategy, and (iii) understood the thematic structure and relationships between papers more effectively. Finally, we make six design recommendations to enhance current digital repositories that we have shown encourage users to adopt a more holistic and thematic research approach.