Abstract:Background: Clear language makes communication easier between any two parties. A layman may have difficulty communicating with a professional due to not understanding the specialized terms common to the domain. In healthcare, it is rare to find a layman knowledgeable in medical terminology which can lead to poor understanding of their condition and/or treatment. To bridge this gap, several professional vocabularies and ontologies have been created to map laymen medical terms to professional medical terms and vice versa. Objective: Many of the presented vocabularies are built manually or semi-automatically requiring large investments of time and human effort and consequently the slow growth of these vocabularies. In this paper, we present an automatic method to enrich laymen's vocabularies that has the benefit of being able to be applied to vocabularies in any domain. Methods: Our entirely automatic approach uses machine learning, specifically Global Vectors for Word Embeddings (GloVe), on a corpus collected from a social media healthcare platform to extend and enhance consumer health vocabularies (CHV). Our approach further improves the CHV by incorporating synonyms and hyponyms from the WordNet ontology. The basic GloVe and our novel algorithms incorporating WordNet were evaluated using two laymen datasets from the National Library of Medicine (NLM), Open-Access Consumer Health Vocabulary (OAC CHV) and MedlinePlus Healthcare Vocabulary. Results: The results show that GloVe was able to find new laymen terms with an F-score of 48.44%. Furthermore, our enhanced GloVe approach outperformed basic GloVe with an average F-score of 61%, a relative improvement of 25%. Furthermore, the enhanced GloVe showed a statistical significance over the two ground truth datasets with P<.001.
Abstract:Word vector representations open up new opportunities to extract useful information from unstructured text. Defining a word as a vector made it easy for the machine learning algorithms to understand a text and extract information from. Word vector representations have been used in many applications such word synonyms, word analogy, syntactic parsing, and many others. GloVe, based on word contexts and matrix vectorization, is an ef-fective vector-learning algorithm. It improves on previous vector-learning algorithms. However, the GloVe model fails to explicitly consider the order in which words appear within their contexts. In this paper, multiple methods of incorporating word order in GloVe word embeddings are proposed. Experimental results show that our Word Order Vector (WOVe) word embeddings approach outperforms unmodified GloVe on the natural lan-guage tasks of analogy completion and word similarity. WOVe with direct concatenation slightly outperformed GloVe on the word similarity task, increasing average rank by 2%. However, it greatly improved on the GloVe baseline on a word analogy task, achieving an average 36.34% improvement in accuracy.
Abstract:Open-Access and Collaborative Consumer Health Vocabulary (OAC CHV, or CHV for short), is a collection of medical terms written in plain English. It provides a list of simple, easy, and clear terms that laymen prefer to use rather than an equivalent professional medical term. The National Library of Medicine (NLM) has integrated and mapped the CHV terms to their Unified Medical Language System (UMLS). These CHV terms mapped to 56000 professional concepts on the UMLS. We found that about 48% of these laymen's terms are still jargon and matched with the professional terms on the UMLS. In this paper, we present an enhanced word embedding technique that generates new CHV terms from a consumer-generated text. We downloaded our corpus from a healthcare social media and evaluated our new method based on iterative feedback to word embedding using ground truth built from the existing CHV terms. Our feedback algorithm outperformed unmodified GLoVe and new CHV terms have been detected.