Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anton Hinel

Targeted synthetic data generation for tabular data via hardness characterization

Oct 01, 2024

Tommaso Ferracci, Leonie Tabea Goldmann, Anton Hinel, Francesco Sanna Passino

Figure 1 for Targeted synthetic data generation for tabular data via hardness characterization

Figure 2 for Targeted synthetic data generation for tabular data via hardness characterization

Figure 3 for Targeted synthetic data generation for tabular data via hardness characterization

Figure 4 for Targeted synthetic data generation for tabular data via hardness characterization

Abstract:Synthetic data generation has been proven successful in improving model performance and robustness in the context of scarce or low-quality data. Using the data valuation framework to statistically identify beneficial and detrimental observations, we introduce a novel augmentation pipeline that generates only high-value training points based on hardness characterization. We first demonstrate via benchmarks on real data that Shapley-based data valuation methods perform comparably with learning-based methods in hardness characterisation tasks, while offering significant theoretical and computational advantages. Then, we show that synthetic data generators trained on the hardest points outperform non-targeted data augmentation on simulated data and on a large scale credit default prediction task. In particular, our approach improves the quality of out-of-sample predictions and it is computationally more efficient compared to non-targeted methods.

Via

Access Paper or Ask Questions

Extended Deep Adaptive Input Normalization for Preprocessing Time Series Data for Neural Networks

Oct 23, 2023

Marcus A. K. September, Francesco Sanna Passino, Leonie Goldmann, Anton Hinel

Abstract:Data preprocessing is a crucial part of any machine learning pipeline, and it can have a significant impact on both performance and training efficiency. This is especially evident when using deep neural networks for time series prediction and classification: real-world time series data often exhibit irregularities such as multi-modality, skewness and outliers, and the model performance can degrade rapidly if these characteristics are not adequately addressed. In this work, we propose the EDAIN (Extended Deep Adaptive Input Normalization) layer, a novel adaptive neural layer that learns how to appropriately normalize irregular time series data for a given task in an end-to-end fashion, instead of using a fixed normalization scheme. This is achieved by optimizing its unknown parameters simultaneously with the deep neural network using back-propagation. Our experiments, conducted using synthetic data, a credit default prediction dataset, and a large-scale limit order book benchmark dataset, demonstrate the superior performance of the EDAIN layer when compared to conventional normalization methods and existing adaptive time series preprocessing layers.

Via

Access Paper or Ask Questions