MILES, LAMSADE
Abstract:The impact of outliers and anomalies on model estimation and data processing is of paramount importance, as evidenced by the extensive body of research spanning various fields over several decades: thousands of research papers have been published on the subject. As a consequence, numerous reviews, surveys, and textbooks have sought to summarize the existing literature, encompassing a wide range of methods from both the statistical and data mining communities. While these endeavors to organize and summarize the research are invaluable, they face inherent challenges due to the pervasive nature of outliers and anomalies in all data-intensive applications, irrespective of the specific application field or scientific discipline. As a result, the resulting collection of papers remains voluminous and somewhat heterogeneous. To address the need for knowledge organization in this domain, this paper implements the first systematic meta-survey of general surveys and reviews on outlier and anomaly detection. Employing a classical systematic survey approach, the study collects nearly 500 papers using two specialized scientific search engines. From this comprehensive collection, a subset of 56 papers that claim to be general surveys on outlier detection is selected using a snowball search technique to enhance field coverage. A meticulous quality assessment phase further refines the selection to a subset of 25 high-quality general surveys. Using this curated collection, the paper investigates the evolution of the outlier detection field over a 20-year period, revealing emerging themes and methods. Furthermore, an analysis of the surveys sheds light on the survey writing practices adopted by scholars from different communities who have contributed to this field. Finally, the paper delves into several topics where consensus has emerged from the literature. These include taxonomies of outlier types, challenges posed by high-dimensional data, the importance of anomaly scores, the impact of learning conditions, difficulties in benchmarking, and the significance of neural networks. Non-consensual aspects are also discussed, particularly the distinction between local and global outliers and the challenges in organizing detection methods into meaningful taxonomies.
Abstract:In recent years, Transformer-based auto-attention mechanisms have been successfully applied to the analysis of a variety of context-reliant data types, from texts to images and beyond, including data from non-Euclidean geometries. In this paper, we present such a mechanism, designed to classify sequences of Symmetric Positive Definite matrices while preserving their Riemannian geometry throughout the analysis. We apply our method to automatic sleep staging on timeseries of EEG-derived covariance matrices from a standard dataset, obtaining high levels of stage-wise performance.
Abstract:This paper presents an introduction to the state-of-the-art in anomaly and change-point detection. On the one hand, the main concepts needed to understand the vast scientific literature on those subjects are introduced. On the other, a selection of important surveys and books, as well as two selected active research topics in the field, are presented.
Abstract:CNNs are often assumed to be capable of using contextual information about distinct objects (such as their directional relations) inside their receptive field. However, the nature and limits of this capacity has never been explored in full. We explore a specific type of relationship~-- directional~-- using a standard U-Net trained to optimize a cross-entropy loss function for segmentation. We train this network on a pretext segmentation task requiring directional relation reasoning for success and state that, with enough data and a sufficiently large receptive field, it succeeds to learn the proposed task. We further explore what the network has learned by analysing scenarios where the directional relationships are perturbed, and show that the network has learned to reason using these relationships.
Abstract:Epistemic voting interprets votes as noisy signals about a ground truth. We consider contexts where the truth consists of a set of objective winners, knowing a lower and upper bound on its cardinality. A prototypical problem for this setting is the aggre-gation of multi-label annotations with prior knowledge on the size of the ground truth. We posit noisemodels, for which we define rules that output an optimal set of winners. We report on experiments on multi-label annotations (which we collected).
Abstract:In causality, estimating the effect of a treatment without confounding inference remains a major issue because requires to assess the outcome in both case with and without treatment. Not being able to observe simultaneously both of them, the estimation of potential outcome remains a challenging task. We propose an innovative approach where the problem is reformulated as a missing data model. The aim is to estimate the hidden distribution of \emph{causal populations}, defined as a function of treatment and outcome. A Causal Auto-Encoder (CAE), enhanced by a prior dependent on treatment and outcome information, assimilates the latent space to the probability distribution of the target populations. The features are reconstructed after being reduced to a latent space and constrained by a mask introduced in the intermediate layer of the network, containing treatment and outcome information.
Abstract:Epistemic social choice aims at unveiling a hidden ground truth given votes, which are interpreted as noisy signals about it. We consider here a simple setting where votes consist of approval ballots: each voter approves a set of alternatives which they believe can possibly be the ground truth. Based on the intuitive idea that more reliable votes contain fewer alternatives, we define several noise models that are approval voting variants of the Mallows model. The likelihood-maximizing alternative is then characterized as the winner of a weighted approval rule, where the weight of a ballot decreases with its cardinality. We have conducted an experiment on three image annotation datasets; they conclude that rules based on our noise model outperform standard approval voting; the best performance is obtained by a variant of the Condorcet noise model.
Abstract:This technical report is devoted to the continuous estimation of an epsilon-assignment. Roughly speaking, an epsilon assignment between two sets V1 and V2 may be understood as a bijective mapping between a sub part of V1 and a sub part of V2 . The remaining elements of V1 (not included in this mapping) are mapped onto an epsilon pseudo element of V2 . We say that such elements are deleted. Conversely, the remaining elements of V2 correspond to the image of the epsilon pseudo element of V1. We say that these elements are inserted. As a result our method provides a result similar to the one of the Sinkhorn algorithm with the additional ability to reject some elements which are either inserted or deleted. It thus naturally handles sets V1 and V2 of different sizes and decides mappings/insertions/deletions in a unified way. Our algorithms are iterative and differentiable and may thus be easily inserted within a backpropagation based learning framework such as artificial neural networks.
Abstract:Functional connectivity is a key approach to investigate oscillatory activities of the brain that provides important insights on the underlying dynamic of neuronal interactions and that is mostly applied for brain activity analysis. Building on the advances in information geometry for brain-computer interface, we propose a novel framework that combines functional connectivity estimators and covariance-based pipelines to classify mental states, such as motor imagery. A Riemannian classifier is trained for each estimator and an ensemble classifier combines the decisions in each feature space. A thorough assessment of the functional connectivity estimators is provided and the best performing pipeline, called FUCONE, is evaluated on different conditions and datasets. Using a meta-analysis to aggregate results across datasets, FUCONE performed significantly better than all state-of-the-art methods. The performance gain is mostly imputable to the improved diversity of the feature spaces, increasing the robustness of the ensemble classifier with respect to the inter- and intra-subject variability.
Abstract:The inference of minimum spanning arborescences within a set of objects is a general problem which translates into numerous application-specific unsupervised learning tasks. We introduce a unified and generic structure called edit arborescence that relies on edit paths between data in a collection, as well as the Min Edit Arborescence Problem, which asks for an edit arborescence that minimizes the sum of costs of its inner edit paths. Through the use of suitable cost functions, this generic framework allows to model a variety of problems. In particular, we show that by introducing encoding size preserving edit costs, it can be used as an efficient method for compressing collections of labeled graphs. Experiments on various graph datasets, with comparisons to standard compression tools, show the potential of our method.