Abstract:Documents in the health domain are often annotated with semantic concepts (i.e., terms) from controlled vocabularies. As the volume of these documents gets large, the annotation work is increasingly done by algorithms. Compared to humans, automatic indexing algorithms are imperfect and may assign wrong terms to documents, which affect subsequent search tasks where queries contain these terms. In this work, we aim to understand the performance impact of using imperfectly assigned terms in Boolean semantic searches. We used MeSH terms and biomedical literature search as a case study. We implemented multiple automatic indexing algorithms on real-world Boolean queries that consist of MeSH terms, and found that (1) probabilistic logic can handle inaccurately assigned terms better than traditional Boolean logic, (2) query-level performance is mostly limited by lowest-performing terms in a query, and (3) mixing a small amount of human indexing with automatic indexing can regain excellent query-level performance. These findings provide important implications for future work on automatic indexing.
Abstract:When people search for information about a new topic within large document collections, they implicitly construct a mental model of the unfamiliar information space to represent what they currently know and guide their exploration into the unknown. Building this mental model can be challenging as it requires not only finding relevant documents, but also synthesizing important concepts and the relationships that connect those concepts both within and across documents. This paper describes a novel interactive approach designed to help users construct a mental model of an unfamiliar information space during exploratory search. We propose a new semantic search system to organize and visualize important concepts and their relations for a set of search results. A user study ($n=20$) was conducted to compare the proposed approach against a baseline faceted search system on exploratory literature search tasks. Experimental results show that the proposed approach is more effective in helping users recognize relationships between key concepts, leading to a more sophisticated understanding of the search topic while maintaining similar functionality and usability as a faceted search system.
Abstract:Text simplification is concerned with reducing the language complexity and improving the readability of professional content so that the text is accessible to readers at different ages and educational levels. As a promising practice to improve the fairness and transparency of text information systems, the notion of text simplification has been mixed in existing literature, ranging all the way through assessing the complexity of single words to automatically generating simplified documents. We show that the general problem of text simplification can be formally decomposed into a compact pipeline of tasks to ensure the transparency and explanability of the process. In this paper, we present a systematic analysis of the first two steps in this pipeline: 1) predicting the complexity of a given piece of text, and 2) identifying complex components from the text considered to be complex. We show that these two tasks can be solved separately, using either lexical approaches or the state-of-the-art deep learning methods, or they can be solved jointly through an end-to-end, explainable machine learning predictor. We propose formal evaluation metrics for both tasks, through which we are able to compare the performance of the candidate approaches using multiple datasets from a diversity of domains.