APTIKAL
Abstract:Tree-based ensemble methods such as random forests, gradient-boosted trees, and Bayesianadditive regression trees have been successfully used for regression problems in many applicationsand research studies. In this paper, we study ensemble versions of probabilisticregression trees that provide smooth approximations of the objective function by assigningeach observation to each region with respect to a probability distribution. We prove thatthe ensemble versions of probabilistic regression trees considered are consistent, and experimentallystudy their bias-variance trade-off and compare them with the state-of-the-art interms of performance prediction.
Abstract:Supervised machine learning often requires large training sets to train accurate models, yet obtaining large amounts of labeled data is not always feasible. Hence, it becomes crucial to explore active learning methods for reducing the size of training sets while maintaining high accuracy. The aim is to select the optimal subset of data for labeling from an initial unlabeled set, ensuring precise prediction of outcomes. However, conventional active learning approaches are comparable to classical random sampling. This paper proposes a wrapper active learning method for classification, organizing the sampling process into a tree structure, that improves state-of-the-art algorithms. A classification tree constructed on an initial set of labeled samples is considered to decompose the space into low-entropy regions. Input-space based criteria are used thereafter to sub-sample from these regions, the total number of points to be labeled being decomposed into each region. This adaptation proves to be a significant enhancement over existing active learning methods. Through experiments conducted on various benchmark data sets, the paper demonstrates the efficacy of the proposed framework by being effective in constructing accurate classification models, even when provided with a severely restricted labeled data set.
Abstract:This paper introduces a new structural causal model tailored for representing threshold-based IT systems and presents a new algorithm designed to rapidly detect root causes of anomalies in such systems. When root causes are not causally related, the method is proven to be correct; while an extension is proposed based on the intervention of an agent to relax this assumption. Our algorithm and its agent-based extension leverage causal discovery from offline data and engage in subgraph traversal when encountering new anomalies in online data. Our extensive experiments demonstrate the superior performance of our methods, even when applied to data generated from alternative structural causal models or real IT monitoring data.
Abstract:We study the problem of identifiability of the total effect of an intervention from observational time series only given an abstraction of the causal graph of the system. Specifically, we consider two types of abstractions: the extended summary causal graph which conflates all lagged causal relations but distinguishes between lagged and instantaneous relations; and the summary causal graph which does not give any indication about the lag between causal relations. We show that the total effect is always identifiable in extended summary causal graphs and we provide necessary and sufficient graphical conditions for identifiability in summary causal graphs. Furthermore, we provide adjustment sets allowing to estimate the total effect whenever it is identifiable.
Abstract:Machine learning methods usually rely on large sample size to have good performance, while it is difficult to provide labeled set in many applications. Pool-based active learning methods are there to detect, among a set of unlabeled data, the ones that are the most relevant for the training. We propose in this paper a meta-approach for pool-based active learning strategies in the context of multi-class classification tasks based on Proper Topological Regions. PTR, based on topological data analysis (TDA), are relevant regions used to sample cold-start points or within the active learning scheme. The proposed method is illustrated empirically on various benchmark datasets, being competitive to the classical methods from the literature.
Abstract:Information technology (IT) systems are vital for modern businesses, handling data storage, communication, and process automation. Monitoring these systems is crucial for their proper functioning and efficiency, as it allows collecting extensive observational time series data for analysis. The interest in causal discovery is growing in IT monitoring systems as knowing causal relations between different components of the IT system helps in reducing downtime, enhancing system performance and identifying root causes of anomalies and incidents. It also allows proactive prediction of future issues through historical data analysis. Despite its potential benefits, applying causal discovery algorithms on IT monitoring data poses challenges, due to the complexity of the data. For instance, IT monitoring data often contains misaligned time series, sleeping time series, timestamp errors and missing values. This paper presents case studies on applying causal discovery algorithms to different IT monitoring datasets, highlighting benefits and ongoing challenges.
Abstract:Constraint-based and noise-based methods have been proposed to discover summary causal graphs from observational time series under strong assumptions which can be violated or impossible to verify in real applications. Recently, a hybrid method (Assaad et al, 2021) that combines these two approaches, proved to be robust to assumption violation. However, this method assumes that the summary causal graph is acyclic, but cycles are common in many applications. For example, in ecological communities, there may be cyclic relationships between predator and prey populations, creating feedback loops. Therefore, this paper presents two new frameworks for hybrids of constraint-based and noise-based methods that can discover summary causal graphs that may or may not contain cycles. For each framework, we provide two hybrid algorithms which are experimentally tested on simulated data, realistic ecological data, and real data from various applications. Experiments show that our hybrid approaches are robust and yield good results over most datasets.
Abstract:This study addresses the problem of learning an extended summary causal graph on time series. The algorithms we propose fit within the well-known constraint-based framework for causal discovery and make use of information-theoretic measures to determine (in)dependencies between time series. We first introduce generalizations of the causation entropy measure to any lagged or instantaneous relations, prior to using this measure to construct extended summary causal graphs by adapting two well-known algorithms, namely PC and FCI. The behavior of our methods is illustrated through several experiments run on simulated and real datasets.
Abstract:In recent years, semi-supervised algorithms have received a lot of interest in both academia and industry. Among the existing techniques, self-training methods have arguably received more attention in the last few years. These models are designed to search the decision boundary on low density regions without making extra assumptions on the data distribution, and use the unsigned output score of a learned classifier, or its margin, as an indicator of confidence. The working principle of self-training algorithms is to learn a classifier iteratively by assigning pseudo-labels to the set of unlabeled training samples with a margin greater than a certain threshold. The pseudo-labeled examples are then used to enrich the labeled training data and train a new classifier in conjunction with the labeled training set. We present self-training methods for binary and multiclass classification and their variants which were recently developed using Neural Networks. Finally, we discuss our ideas for future research in self-training. To the best of our knowledge, this is the first thorough and complete survey on this subject.
Abstract:Self-learning is a classical approach for learning with both labeled and unlabeled observations which consists in giving pseudo-labels to unlabeled training instances with a confidence score over a predetermined threshold. At the same time, the pseudo-labeling technique is prone to error and runs the risk of adding noisy labels into unlabeled training data. In this paper, we present a probabilistic framework for analyzing self-learning in the multi-class classification scenario with partially labeled data. First, we derive a transductive bound over the risk of the multi-class majority vote classifier. Based on this result, we propose to automatically choose the threshold for pseudo-labeling that minimizes the transductive bound. Then, we introduce a mislabeling error model to analyze the error of the majority vote classifier in the case of the pseudo-labeled data. We derive a probabilistic C-bound over the majority vote error when an imperfect label is given. Empirical results on different data sets show the effectiveness of our framework compared to several state-of-the-art semi-supervised approaches.