Abstract:Besides the classical offline setup of machine learning, stream learning constitutes a well-established setup where data arrives over time in potentially non-stationary environments. Concept drift, the phenomenon that the underlying distribution changes over time poses a significant challenge. Yet, despite high practical relevance, there is little to no foundational theory for learning in the drifting setup comparable to classical statistical learning theory in the offline setting. This can be attributed to the lack of an underlying object comparable to a probability distribution as in the classical setup. While there exist approaches to transfer ideas to the streaming setup, these start from a data perspective rather than an algorithmic one. In this work, we suggest a new model of data over time that is aimed at the algorithm's perspective. Instead of defining the setup using time points, we utilize a window-based approach that resembles the inner workings of most stream learning algorithms. We compare our framework to others from the literature on a theoretical basis, showing that in many cases both model the same situation. Furthermore, we perform a numerical evaluation and showcase an application in the domain of critical infrastructure.
Abstract:Concept drift refers to the change of data distributions over time. While drift poses a challenge for learning models, requiring their continual adaption, it is also relevant in system monitoring to detect malfunctions, system failures, and unexpected behavior. In the latter case, the robust and reliable detection of drifts is imperative. This work studies the shortcomings of commonly used drift detection schemes. We show how to construct data streams that are drifting without being detected. We refer to those as drift adversarials. In particular, we compute all possible adversairals for common detection schemes and underpin our theoretical findings with empirical evaluations.
Abstract:Fairness is an important objective throughout society. From the distribution of limited goods such as education, over hiring and payment, to taxes, legislation, and jurisprudence. Due to the increasing importance of machine learning approaches in all areas of daily life including those related to health, security, and equity, an increasing amount of research focuses on fair machine learning. In this work, we focus on the fairness of partition- and prototype-based models. The contribution of this work is twofold: 1) we develop a general framework for fair machine learning of partition-based models that does not depend on a specific fairness definition, and 2) we derive a fair version of learning vector quantization (LVQ) as a specific instantiation. We compare the resulting algorithm against other algorithms from the literature on theoretical and real-world data showing its practical relevance.
Abstract:Research on methods for planning and controlling water distribution networks gains increasing relevance as the availability of drinking water will decrease as a consequence of climate change. So far, the majority of approaches is based on hydraulics and engineering expertise. However, with the increasing availability of sensors, machine learning techniques constitute a promising tool. This work presents the main tasks in water distribution networks, discusses how they relate to machine learning and analyses how the particularities of the domain pose challenges to and can be leveraged by machine learning approaches. Besides, it provides a technical toolkit by presenting evaluation benchmarks and a structured survey of the exemplary task of leakage detection and localization.
Abstract:Plenty of works have brought social biases in language models to attention and proposed methods to detect such biases. As a result, the literature contains a great deal of different bias tests and scores, each introduced with the premise to uncover yet more biases that other scores fail to detect. What severely lacks in the literature, however, are comparative studies that analyse such bias scores and help researchers to understand the benefits or limitations of the existing methods. In this work, we aim to close this gap for cosine based bias scores. By building on a geometric definition of bias, we propose requirements for bias scores to be considered meaningful for quantifying biases. Furthermore, we formally analyze cosine based scores from the literature with regard to these requirements. We underline these findings with experiments to show that the bias scores' limitations have an impact in the application case.
Abstract:Leakages are a major risk in water distribution networks as they cause water loss and increase contamination risks. Leakage detection is a difficult task due to the complex dynamics of water distribution networks. In particular, small leakages are hard to detect. From a machine-learning perspective, leakages can be modeled as concept drift. Thus, a wide variety of drift detection schemes seems to be a suitable choice for detecting leakages. In this work, we explore the potential of model-loss-based and distribution-based drift detection methods to tackle leakage detection. We additionally discuss the issue of temporal dependencies in the data and propose a way to cope with it when applying distribution-based detection. We evaluate different methods systematically for leakages of different sizes and detection times. Additionally, we propose a first drift-detection-based technique for localizing leakages.
Abstract:Concept drift, i.e., the change of the data generating distribution, can render machine learning models inaccurate. Several works address the phenomenon of concept drift in the streaming context usually assuming that consecutive data points are independent of each other. To generalize to dependent data, many authors link the notion of concept drift to time series. In this work, we show that the temporal dependencies are strongly influencing the sampling process. Thus, the used definitions need major modifications. In particular, we show that the notion of stationarity is not suited for this setup and discuss alternatives. We demonstrate that these alternative formal notions describe the observable learning behavior in numerical experiments.
Abstract:The world surrounding us is subject to constant change. These changes, frequently described as concept drift, influence many industrial and technical processes. As they can lead to malfunctions and other anomalous behavior, which may be safety-critical in many scenarios, detecting and analyzing concept drift is crucial. In this paper, we provide a literature review focusing on concept drift in unsupervised data streams. While many surveys focus on supervised data streams, so far, there is no work reviewing the unsupervised setting. However, this setting is of particular relevance for monitoring and anomaly detection which are directly applicable to many tasks and challenges in engineering. This survey provides a taxonomy of existing work on drift detection. Besides, it covers the current state of research on drift localization in a systematic way. In addition to providing a systematic literature review, this work provides precise mathematical definitions of the considered problems and contains standardized experiments on parametric artificial datasets allowing for a direct comparison of different strategies for detection and localization. Thereby, the suitability of different schemes can be analyzed systematically and guidelines for their usage in real-world scenarios can be provided. Finally, there is a section on the emerging topic of explaining concept drift.
Abstract:Facing climate change the already limited availability of drinking water will decrease in the future rendering drinking water an increasingly scarce resource. Considerable amounts of it are lost through leakages in water transportation and distribution networks. Leakage detection and localization are challenging problems due to the complex interactions and changing demands in water distribution networks. Especially small leakages are hard to pinpoint yet their localization is vital to avoid water loss over long periods of time. While there exist different approaches to solving the tasks of leakage detection and localization, they are relying on various information about the system, e.g. real-time demand measurements and the precise network topology, which is an unrealistic assumption in many real-world scenarios. In contrast, this work attempts leakage localization using pressure measurements only. For this purpose, first, leakages in the water distribution network are modeled employing Bayesian networks, and the system dynamics are analyzed. We then show how the problem is connected to and can be considered through the lens of concept drift. In particular, we argue that model-based explanations of concept drift are a promising tool for localizing leakages given limited information about the network. The methodology is experimentally evaluated using realistic benchmark scenarios.
Abstract:The notion of concept drift refers to the phenomenon that the distribution generating the observed data changes over time. If drift is present, machine learning models can become inaccurate and need adjustment. While there do exist methods to detect concept drift or to adjust models in the presence of observed drift, the question of explaining drift, i.e., describing the potentially complex and high dimensional change of distribution in a human-understandable fashion, has hardly been considered so far. This problem is of importance since it enables an inspection of the most prominent characteristics of how and where drift manifests itself. Hence, it enables human understanding of the change and it increases acceptance of life-long learning models. In this paper, we present a novel technology characterizing concept drift in terms of the characteristic change of spatial features based on various explanation techniques. To do so, we propose a methodology to reduce the explanation of concept drift to an explanation of models that are trained in a suitable way extracting relevant information regarding the drift. This way a large variety of explanation schemes is available. Thus, a suitable method can be selected for the problem of drift explanation at hand. We outline the potential of this approach and demonstrate its usefulness in several examples.