Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Samuel Maddock

Synthetic Tabular Data: Methods, Attacks and Defenses

Jun 06, 2025

Graham Cormode, Samuel Maddock, Enayat Ullah, Shripad Gade

Abstract:Synthetic data is often positioned as a solution to replace sensitive fixed-size datasets with a source of unlimited matching data, freed from privacy concerns. There has been much progress in synthetic data generation over the last decade, leveraging corresponding advances in machine learning and data analytics. In this survey, we cover the key developments and the main concepts in tabular synthetic data generation, including paradigms based on probabilistic graphical models and on deep learning. We provide background and motivation, before giving a technical deep-dive into the methodologies. We also address the limitations of synthetic data, by studying attacks that seek to retrieve information about the original sensitive data. Finally, we present extensions and open problems in this area.

* Survey paper for accepted lecture-style tutorial at ACM KDD 2025

Via

Access Paper or Ask Questions

Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation

Apr 15, 2025

Samuel Maddock, Shripad Gade, Graham Cormode, Will Bullock

Abstract:Differentially Private Synthetic Data Generation (DP-SDG) is a key enabler of private and secure tabular-data sharing, producing artificial data that carries through the underlying statistical properties of the input data. This typically involves adding carefully calibrated statistical noise to guarantee individual privacy, at the cost of synthetic data quality. Recent literature has explored scenarios where a small amount of public data is used to help enhance the quality of synthetic data. These methods study a horizontal public-private partitioning which assumes access to a small number of public rows that can be used for model initialization, providing a small utility gain. However, realistic datasets often naturally consist of public and private attributes, making a vertical public-private partitioning relevant for practical synthetic data deployments. We propose a novel framework that adapts horizontal public-assisted methods into the vertical setting. We compare this framework against our alternative approach that uses conditional generation, highlighting initial limitations of public-data assisted methods and proposing future research directions to address these challenges.

* Accepted to the Synthetic Data x Data Access Problem (SynthData) workshop @ ICLR 2025

Via

Access Paper or Ask Questions

FLAIM: AIM-based Synthetic Data Generation in the Federated Setting

Oct 05, 2023

Samuel Maddock, Graham Cormode, Carsten Maple

Abstract:Preserving individual privacy while enabling collaborative data sharing is crucial for organizations. Synthetic data generation is one solution, producing artificial data that mirrors the statistical properties of private data. While numerous techniques have been devised under differential privacy, they predominantly assume data is centralized. However, data is often distributed across multiple clients in a federated manner. In this work, we initiate the study of federated synthetic tabular data generation. Building upon a SOTA central method known as AIM, we present DistAIM and FLAIM. We show it is straightforward to distribute AIM, extending a recent approach based on secure multi-party computation which necessitates additional overhead, making it less suited to federated scenarios. We then demonstrate that naively federating AIM can lead to substantial degradation in utility under the presence of heterogeneity. To mitigate both issues, we propose an augmented FLAIM approach that maintains a private proxy of heterogeneity. We simulate our methods across a range of benchmark datasets under different degrees of heterogeneity and show this can improve utility while reducing overhead.

* 21 pages

Via

Access Paper or Ask Questions

CANIFE: Crafting Canaries for Empirical Privacy Measurement in Federated Learning

Oct 06, 2022

Samuel Maddock, Alexandre Sablayrolles, Pierre Stock

Figure 1 for CANIFE: Crafting Canaries for Empirical Privacy Measurement in Federated Learning

Figure 2 for CANIFE: Crafting Canaries for Empirical Privacy Measurement in Federated Learning

Figure 3 for CANIFE: Crafting Canaries for Empirical Privacy Measurement in Federated Learning

Figure 4 for CANIFE: Crafting Canaries for Empirical Privacy Measurement in Federated Learning

Abstract:Federated Learning (FL) is a setting for training machine learning models in distributed environments where the clients do not share their raw data but instead send model updates to a server. However, model updates can be subject to attacks and leak private information. Differential Privacy (DP) is a leading mitigation strategy which involves adding noise to clipped model updates, trading off performance for strong theoretical privacy guarantees. Previous work has shown that the threat model of DP is conservative and that the obtained guarantees may be vacuous or may not directly translate to information leakage in practice. In this paper, we aim to achieve a tighter measurement of the model exposure by considering a realistic threat model. We propose a novel method, CANIFE, that uses canaries - carefully crafted samples by a strong adversary to evaluate the empirical privacy of a training round. We apply this attack to vision models trained on CIFAR-10 and CelebA and to language models trained on Sent140 and Shakespeare. In particular, in realistic FL scenarios, we demonstrate that the empirical epsilon obtained with CANIFE is 2-7x lower than the theoretical bound.

Via

Access Paper or Ask Questions

Federated Boosted Decision Trees with Differential Privacy

Oct 06, 2022

Samuel Maddock, Graham Cormode, Tianhao Wang, Carsten Maple, Somesh Jha

Figure 1 for Federated Boosted Decision Trees with Differential Privacy

Figure 2 for Federated Boosted Decision Trees with Differential Privacy

Figure 3 for Federated Boosted Decision Trees with Differential Privacy

Figure 4 for Federated Boosted Decision Trees with Differential Privacy

Abstract:There is great demand for scalable, secure, and efficient privacy-preserving machine learning models that can be trained over distributed data. While deep learning models typically achieve the best results in a centralized non-secure setting, different models can excel when privacy and communication constraints are imposed. Instead, tree-based approaches such as XGBoost have attracted much attention for their high performance and ease of use; in particular, they often achieve state-of-the-art results on tabular data. Consequently, several recent works have focused on translating Gradient Boosted Decision Tree (GBDT) models like XGBoost into federated settings, via cryptographic mechanisms such as Homomorphic Encryption (HE) and Secure Multi-Party Computation (MPC). However, these do not always provide formal privacy guarantees, or consider the full range of hyperparameters and implementation settings. In this work, we implement the GBDT model under Differential Privacy (DP). We propose a general framework that captures and extends existing approaches for differentially private decision trees. Our framework of methods is tailored to the federated setting, and we show that with a careful choice of techniques it is possible to achieve very high utility while maintaining strong levels of privacy.

* Full version of a paper to appear at ACM CCS'22

Via

Access Paper or Ask Questions