Abstract:The utilization of AI in an increasing number of fields is the latest iteration of a long process, where machines and systems have been replacing humans, or changing the roles that they play, in various tasks. Although humans are often resistant to technological innovation, especially in workplaces, there is a general trend towards increasing automation, and more recently, AI. AI is now capable of carrying out, or assisting with, many tasks that used to be regarded as exclusively requiring human expertise. In this paper we consider the case of tasks that could be performed either by human experts or by AI and locate them on a continuum running from exclusively human task performance at one end to AI autonomy on the other, with a variety of forms of human-AI interaction between those extremes. Implementation of AI is constrained by the context of the systems and workflows that it will be embedded within. There is an urgent need for methods to determine how AI should be used in different situations and to develop appropriate methods of human-AI interaction so that humans and AI can work together effectively to perform tasks. In response to the evolving landscape of AI progress and increasing mastery, we introduce an AI Mastery Lifecycle framework and discuss its implications for human-AI interaction. The framework provides guidance on human-AI task allocation and how human-AI interfaces need to adapt to improvements in AI task performance over time. Within the framework we identify a zone of uncertainty where the issues of human-AI task allocation and user interface design are likely to be most challenging.
Abstract:Redacted emails satisfy most privacy requirements but they make it more difficult to detect anomalous emails that may be indicative of data exfiltration. In this paper we develop an enhanced method of Active Learning using an information gain maximizing heuristic, and we evaluate its effectiveness in a real world setting where only redacted versions of email could be labeled by human analysts due to privacy concerns. In the first case study we examined how Active Learning should be carried out. We found that model performance was best when a single highly skilled (in terms of the labelling task) analyst provided the labels. In the second case study we used confidence ratings to estimate the labeling uncertainty of analysts and then prioritized instances for labeling based on the expected information gain (the difference between model uncertainty and analyst uncertainty) that would be provided by labelling each instance. We found that the information maximization gain heuristic improved model performance over existing sampling methods for Active Learning. Based on the results obtained, we recommend that analysts should be screened, and possibly trained, prior to implementation of Active Learning in cybersecurity applications. We also recommend that the information gain maximizing sample method (based on expert confidence) should be used in early stages of Active Learning, providing that well-calibrated confidence can be obtained. We also note that the expertise of analysts should be assessed prior to Active Learning, as we found that analysts with lower labelling skill had poorly calibrated (over-) confidence in their labels.
Abstract:Exfiltration of data via email is a serious cybersecurity threat for many organizations. Detecting data exfiltration (anomaly) patterns typically requires labeling, most often done by a human annotator, to reduce the high number of false alarms. Active Learning (AL) is a promising approach for labeling data efficiently, but it needs to choose an efficient order in which cases are to be labeled, and there are uncertainties as to what scoring procedure should be used to prioritize cases for labeling, especially when detecting rare cases of interest is crucial. We propose an adaptive AL sampling strategy that leverages the underlying prior data distribution, as well as model uncertainty, to produce batches of cases to be labeled that contain instances of rare anomalies. We show that (1) the classifier benefits from a batch of representative and informative instances of both normal and anomalous examples, (2) unsupervised anomaly detection plays a useful role in building the classifier in the early stages of training when relatively little labeling has been done thus far. Our approach to AL for anomaly detection outperformed existing AL approaches on three highly unbalanced UCI benchmarks and on one real-world redacted email data set.
Abstract:Research on email anomaly detection has typically relied on specially prepared datasets that may not adequately reflect the type of data that occurs in industry settings. In our research, at a major financial services company, privacy concerns prevented inspection of the bodies of emails and attachment details (although subject headings and attachment filenames were available). This made labeling possible anomalies in the resulting redacted emails more difficult. Another source of difficulty is the high volume of emails combined with the scarcity of resources making machine learning (ML) a necessity, but also creating a need for more efficient human training of ML models. Active learning (AL) has been proposed as a way to make human training of ML models more efficient. However, the implementation of Active Learning methods is a human-centered AI challenge due to potential human analyst uncertainty, and the labeling task can be further complicated in domains such as the cybersecurity domain (or healthcare, aviation, etc.) where mistakes in labeling can have highly adverse consequences. In this paper we present research results concerning the application of Active Learning to anomaly detection in redacted emails, comparing the utility of different methods for implementing active learning in this context. We evaluate different AL strategies and their impact on resulting model performance. We also examine how ratings of confidence that experts have in their labels can inform AL. The results obtained are discussed in terms of their implications for AL methodology and for the role of experts in model-assisted email anomaly screening.
Abstract:While many machine learning methods have been used for medical prediction and risk factor analysis on healthcare data, most prior research has involved single-task learning (STL) methods. However, healthcare research often involves multiple related tasks. For instance, implementation of disease scores prediction and risk factor analysis in multiple subgroups of patients simultaneously and risk factor analysis at multi-levels synchronously. In this paper, we developed a new ensemble machine learning Python package based on multi-task learning (MTL), referred to as the Med-Multi-Task Learning (MD-MTL) package and applied it in predicting disease scores of patients, and in carrying out risk factor analysis on multiple subgroups of patients simultaneously. Our experimental results on two datasets demonstrate the utility of the MD-MTL package, and show the advantage of MTL (vs. STL), when analyzing data that is organized into different categories (tasks, which can be various age groups, different levels of disease severity, etc.).