Abstract:This paper addresses the problem of accurately estimating a function on one domain when only its discrete samples are available on another domain. To answer this challenge, we utilize a neural network, which we train to incorporate prior knowledge of the function. In addition, by carefully analyzing the problem, we obtain a bound on the error over the extrapolation domain and define a condition number for this problem that quantifies the level of difficulty of the setup. Compared to other machine learning methods that provide time series prediction, such as transformers, our approach is suitable for setups where the interpolation and extrapolation regions are general subdomains and, in particular, manifolds. In addition, our construction leads to an improved loss function that helps us boost the accuracy and robustness of our neural network. We conduct comprehensive numerical tests and comparisons of our extrapolation versus standard methods. The results illustrate the effectiveness of our approach in various scenarios.
Abstract:High-dimensional imbalanced data poses a machine learning challenge. In the absence of sufficient or high-quality labels, unsupervised feature selection methods are crucial for the success of subsequent algorithms. Therefore, there is a growing need for unsupervised feature selection algorithms focused on imbalanced data. Thus, we propose a Marginal Laplacian Score (MLS) a modification of the well-known Laplacian Score (LS) to be better suited for imbalance data. We introduce an assumption that the minority class or anomalous appear more frequently in the margin of the features. Consequently, MLS aims to preserve the local structure of the data set's margin. As MLS is better suited for handling imbalanced data, we propose its integration into modern feature selection methods that utilize the Laplacian score. We integrate the MLS algorithm into the Differentiable Unsupervised Feature Selection (DUFS), resulting in DUFS-MLS. The proposed methods demonstrate robust and improved performance on synthetic and public data sets.
Abstract:We consider a self-supervised approach to anomaly detection in tabular data. Random transformations are applied to the data, and then each transformation is identified based on its output. These predicted transformations are used to identify anomalies. In tabular data this approach faces many challenges that are related to the uncorrelated nature of the data. These challenges affect the transformations that should be used, as well as the use of their predictions. To this end, we propose SORTAD, a novel algorithm that is tailor-made to solve these challenges. SORTAD optimally chooses random transformations that help the classification process, and have a scoring function that is more sensitive to the changes in the transformations classification prediction encountered in tabular data. SORTAD achieved state-of-the-art results on multiple commonly used anomaly detection data sets, as well as in the overall results across all data sets tested.