Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bob Vanderheyden

Method for Customizable Automated Tagging: Addressing the Problem of Over-tagging and Under-tagging Text Documents

Apr 30, 2020

Maharshi R. Pandya, Jessica Reyes, Bob Vanderheyden

Figure 1 for Method for Customizable Automated Tagging: Addressing the Problem of Over-tagging and Under-tagging Text Documents

Figure 2 for Method for Customizable Automated Tagging: Addressing the Problem of Over-tagging and Under-tagging Text Documents

Figure 3 for Method for Customizable Automated Tagging: Addressing the Problem of Over-tagging and Under-tagging Text Documents

Figure 4 for Method for Customizable Automated Tagging: Addressing the Problem of Over-tagging and Under-tagging Text Documents

Abstract:Using author provided tags to predict tags for a new document often results in the overgeneration of tags. In the case where the author doesn't provide any tags, our documents face the severe under-tagging issue. In this paper, we present a method to generate a universal set of tags that can be applied widely to a large document corpus. Using IBM Watson's NLU service, first, we collect keywords/phrases that we call "complex document tags" from 8,854 popular reports in the corpus. We apply LDA model over these complex document tags to generate a set of 765 unique "simple tags". In applying the tags to a corpus of documents, we run each document through the IBM Watson NLU and apply appropriate simple tags. Using only 765 simple tags, our method allows us to tag 87,397 out of 88,583 total documents in the corpus with at least one tag. About 92.1% of the total 87,397 documents are also determined to be sufficiently-tagged. In the end, we discuss the performance of our method and its limitations.

* Work done by Maharshi R. Pandya and Jessica Reyes as IBM interns under leadership of Bob Vanderheyden. Article to be published

Via

Access Paper or Ask Questions

Logistic Ensemble Models

Jun 12, 2018

Bob Vanderheyden, Jennifer Priestley

Abstract:Predictive models that are developed in a regulated industry or a regulated application, like determination of credit worthiness, must be interpretable and rational (e.g., meaningful improvements in basic credit behavior must result in improved credit worthiness scores). Machine Learning technologies provide very good performance with minimal analyst intervention, making them well suited to a high volume analytic environment, but the majority are black box tools that provide very limited insight or interpretability into key drivers of model performance or predicted model output values. This paper presents a methodology that blends one of the most popular predictive statistical modeling methods for binary classification with a core model enhancement strategy found in machine learning. The resulting prediction methodology provides solid performance, from minimal analyst effort, while providing the interpretability and rationality required in regulated industries, as well as in other environments where interpretation of model parameters is required (e.g. businesses that require interpretation of models, to take action on them).

* Presented at 30Th Annual Conference Of The International Academy Of Business Disciplines

Via

Access Paper or Ask Questions