Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark

Oct 25, 2023

Lasse Hansen, Nabeel Seedat, Mihaela van der Schaar, Andrija Petrovic

Figure 1 for Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark

Figure 2 for Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark

Figure 3 for Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark

Figure 4 for Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark

Share this with someone who'll enjoy it:

Abstract:Synthetic data serves as an alternative in training machine learning models, particularly when real-world data is limited or inaccessible. However, ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper addresses this issue by exploring the potential of integrating data-centric AI techniques which profile the data to guide the synthetic data generation process. Moreover, we shed light on the often ignored consequences of neglecting these data profiles during synthetic data generation -- despite seemingly high statistical fidelity. Subsequently, we propose a novel framework to evaluate the integration of data profiles to guide the creation of more representative synthetic data. In an empirical study, we evaluate the performance of five state-of-the-art models for tabular data generation on eleven distinct tabular datasets. The findings offer critical insights into the successes and limitations of current synthetic data generation techniques. Finally, we provide practical recommendations for integrating data-centric insights into the synthetic data generation process, with a specific focus on classification performance, model selection, and feature selection. This study aims to reevaluate conventional approaches to synthetic data generation and promote the application of data-centric AI techniques in improving the quality and effectiveness of synthetic data.

* Presented at NeurIPS 2023 (Datasets & Benchmarks). *Hansen & Seedat contributed equally

View paper on

Share this with someone who'll enjoy it:

Title:Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark

Paper and Code