Abstract:We introduce Natural Learning (NL), a novel algorithm that elevates the explainability and interpretability of machine learning to an extreme level. NL simplifies decisions into intuitive rules, like "We rejected your loan because your income, employment status, and age collectively resemble a rejected prototype more than an accepted prototype." When applied to real-life datasets, NL produces impressive results. For example, in a colon cancer dataset with 1545 patients and 10935 genes, NL achieves 98.1% accuracy, comparable to DNNs and RF, by analyzing just 3 genes of test samples against 2 discovered prototypes. Similarly, in the UCI's WDBC dataset, NL achieves 98.3% accuracy using only 7 features and 2 prototypes. Even on the MNIST dataset (0 vs. 1), NL achieves 99.5% accuracy with only 3 pixels from 2 prototype images. NL is inspired by prototype theory, an old concept in cognitive psychology suggesting that people learn single sparse prototypes to categorize objects. Leveraging this relaxed assumption, we redesign Support Vector Machines (SVM), replacing its mathematical formulation with a fully nearest-neighbor-based solution, and to address the curse of dimensionality, we utilize locality-sensitive hashing. Following theory's generalizability principle, we propose a recursive method to prune non-core features. As a result, NL efficiently discovers the sparsest prototypes in O(n^2pL) with high parallelization capacity in terms of n. Evaluation of NL with 17 benchmark datasets shows its significant outperformance compared to decision trees and logistic regression, two methods widely favored in healthcare for their interpretability. Moreover, NL achieves performance comparable to finetuned black-box models such as deep neural networks and random forests in 40% of cases, with only a 1-2% lower average accuracy. The code is available via http://natural-learning.cc.
Abstract:SimTensor is a multi-platform, open-source software for generating artificial tensor data (either with CP/PARAFAC or Tucker structure) for reproducible research on tensor factorization algorithms. SimTensor is a stand-alone application based on MATALB. It provides a wide range of facilities for generating tensor data with various configurations. It comes with a user-friendly graphical user interface, which enables the user to generate tensors with complicated settings in an easy way. It also has this facility to export generated data to universal formats such as CSV and HDF5, which can be imported via a wide range of programming languages (C, C++, Java, R, Fortran, MATLAB, Perl, Python, and many more). The most innovative part of SimTensor is this that can generate temporal tensors with periodic waves, seasonal effects and streaming structure. it can apply constraints such as non-negativity and different kinds of sparsity to the data. SimTensor also provides this facility to simulate different kinds of change-points and inject various types of anomalies. The source code and binary versions of SimTensor is available for download in http://www.simtensor.org.
Abstract:Hotspot detection aims at identifying subgroups in the observations that are unexpected, with respect to the some baseline information. For instance, in disease surveillance, the purpose is to detect sub-regions in spatiotemporal space, where the count of reported diseases (e.g. Cancer) is higher than expected, with respect to the population. The state-of-the-art method for this kind of problem is the Space-Time Scan Statistics (STScan), which exhaustively search the whole space through a sliding window looking for significant spatiotemporal clusters. STScan makes some restrictive assumptions about the distribution of data, the shape of the hotspots and the quality of data, which can be unrealistic for some nontraditional data sources. A novel methodology called EigenSpot is proposed where instead of an exhaustive search over the space, tracks the changes in a space-time correlation structure. Not only does the new approach presents much more computational efficiency, but also makes no assumption about the data distribution, hotspot shape or the data quality. The principal idea is that with the joint combination of abnormal elements in the principal spatial and the temporal singular vectors, the location of hotspots in the spatiotemporal space can be approximated. A comprehensive experimental evaluation, both on simulated and real data sets reveals the effectiveness of the proposed method.
Abstract:Space and time are two critical components of many real world systems. For this reason, analysis of anomalies in spatiotemporal data has been a great of interest. In this work, application of tensor decomposition and eigenspace techniques on spatiotemporal hotspot detection is investigated. An algorithm called SST-Hotspot is proposed which accounts for spatiotemporal variations in data and detect hotspots using matching of eigenvector elements of two cases and population tensors. The experimental results reveal the interesting application of tensor decomposition and eigenvector-based techniques in hotspot analysis.
Abstract:Syndromic surveillance systems continuously monitor multiple pre-diagnostic daily streams of indicators from different regions with the aim of early detection of disease outbreaks. The main objective of these systems is to detect outbreaks hours or days before the clinical and laboratory confirmation. The type of data that is being generated via these systems is usually multivariate and seasonal with spatial and temporal dimensions. The algorithm What's Strange About Recent Events (WSARE) is the state-of-the-art method for such problems. It exhaustively searches for contrast sets in the multivariate data and signals an alarm when find statistically significant rules. This bottom-up approach presents a much lower detection delay comparing the existing top-down approaches. However, WSARE is very sensitive to the small-scale changes and subsequently comes with a relatively high rate of false alarms. We propose a new approach called EigenEvent that is neither fully top-down nor bottom-up. In this method, we instead of top-down or bottom-up search, track changes in data correlation structure via eigenspace techniques. This new methodology enables us to detect both overall changes (via eigenvalue) and dimension-level changes (via eigenvectors). Experimental results on hundred sets of benchmark data reveals that EigenEvent presents a better overall performance comparing state-of-the-art, in particular in terms of the false alarm rate.
Abstract:Failure detection in telecommunication networks is a vital task. So far, several supervised and unsupervised solutions have been provided for discovering failures in such networks. Among them unsupervised approaches has attracted more attention since no label data is required. Often, network devices are not able to provide information about the type of failure. In such cases the type of failure is not known in advance and the unsupervised setting is more appropriate for diagnosis. Among unsupervised approaches, Principal Component Analysis (PCA) is a well-known solution which has been widely used in the anomaly detection literature and can be applied to matrix data (e.g. Users-Features). However, one of the important properties of network data is their temporal sequential nature. So considering the interaction of dimensions over a third dimension, such as time, may provide us better insights into the nature of network failures. In this paper we demonstrate the power of three-way analysis to detect events and anomalies in time-evolving network data.