Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Graham Cormode

Synthetic Tabular Data: Methods, Attacks and Defenses

Jun 06, 2025

Graham Cormode, Samuel Maddock, Enayat Ullah, Shripad Gade

Abstract:Synthetic data is often positioned as a solution to replace sensitive fixed-size datasets with a source of unlimited matching data, freed from privacy concerns. There has been much progress in synthetic data generation over the last decade, leveraging corresponding advances in machine learning and data analytics. In this survey, we cover the key developments and the main concepts in tabular synthetic data generation, including paradigms based on probabilistic graphical models and on deep learning. We provide background and motivation, before giving a technical deep-dive into the methodologies. We also address the limitations of synthetic data, by studying attacks that seek to retrieve information about the original sensitive data. Finally, we present extensions and open problems in this area.

* Survey paper for accepted lecture-style tutorial at ACM KDD 2025

Via

Access Paper or Ask Questions

Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation

Apr 15, 2025

Samuel Maddock, Shripad Gade, Graham Cormode, Will Bullock

Abstract:Differentially Private Synthetic Data Generation (DP-SDG) is a key enabler of private and secure tabular-data sharing, producing artificial data that carries through the underlying statistical properties of the input data. This typically involves adding carefully calibrated statistical noise to guarantee individual privacy, at the cost of synthetic data quality. Recent literature has explored scenarios where a small amount of public data is used to help enhance the quality of synthetic data. These methods study a horizontal public-private partitioning which assumes access to a small number of public rows that can be used for model initialization, providing a small utility gain. However, realistic datasets often naturally consist of public and private attributes, making a vertical public-private partitioning relevant for practical synthetic data deployments. We propose a novel framework that adapts horizontal public-assisted methods into the vertical setting. We compare this framework against our alternative approach that uses conditional generation, highlighting initial limitations of public-data assisted methods and proposing future research directions to address these challenges.

* Accepted to the Synthetic Data x Data Access Problem (SynthData) workshop @ ICLR 2025

Via

Access Paper or Ask Questions

Federated Analytics in Practice: Engineering for Privacy, Scalability and Practicality

Dec 03, 2024

Harish Srinivas, Graham Cormode, Mehrdad Honarkhah, Samuel Lurye, Jonathan Hehir, Lunwen He, George Hong, Ahmed Magdy, Dzmitry Huba, Kaikai Wang(+2 more)

Figure 1 for Federated Analytics in Practice: Engineering for Privacy, Scalability and Practicality

Figure 2 for Federated Analytics in Practice: Engineering for Privacy, Scalability and Practicality

Figure 3 for Federated Analytics in Practice: Engineering for Privacy, Scalability and Practicality

Figure 4 for Federated Analytics in Practice: Engineering for Privacy, Scalability and Practicality

Abstract:Cross-device Federated Analytics (FA) is a distributed computation paradigm designed to answer analytics queries about and derive insights from data held locally on users' devices. On-device computations combined with other privacy and security measures ensure that only minimal data is transmitted off-device, achieving a high standard of data protection. Despite FA's broad relevance, the applicability of existing FA systems is limited by compromised accuracy; lack of flexibility for data analytics; and an inability to scale effectively. In this paper, we describe our approach to combine privacy, scalability, and practicality to build and deploy a system that overcomes these limitations. Our FA system leverages trusted execution environments (TEEs) and optimizes the use of on-device computing resources to facilitate federated data processing across large fleets of devices, while ensuring robust, defensible, and verifiable privacy safeguards. We focus on federated analytics (statistics and monitoring), in contrast to systems for federated learning (ML workloads), and we flag the key differences.

Via

Access Paper or Ask Questions

Distributed, communication-efficient, and differentially private estimation of KL divergence

Nov 25, 2024

Mary Scott, Sayan Biswas, Graham Cormode, Carsten Maple

Abstract:A key task in managing distributed, sensitive data is to measure the extent to which a distribution changes. Understanding this drift can effectively support a variety of federated learning and analytics tasks. However, in many practical settings sharing such information can be undesirable (e.g., for privacy concerns) or infeasible (e.g., for high communication costs). In this work, we describe novel algorithmic approaches for estimating the KL divergence of data across federated models of computation, under differential privacy. We analyze their theoretical properties and present an empirical study of their performance. We explore parameter settings that optimize the accuracy of the algorithm catering to each of the settings; these provide sub-variations that are applicable to real-world tasks, addressing different context- and application-specific trust level requirements. Our experimental results confirm that our private estimators achieve accuracy comparable to a baseline algorithm without differential privacy guarantees.

* 28 pages, 5 figures

Via

Access Paper or Ask Questions

Towards Robust Federated Analytics via Differentially Private Measurements of Statistical Heterogeneity

Nov 07, 2024

Mary Scott, Graham Cormode, Carsten Maple

Abstract:Statistical heterogeneity is a measure of how skewed the samples of a dataset are. It is a common problem in the study of differential privacy that the usage of a statistically heterogeneous dataset results in a significant loss of accuracy. In federated scenarios, statistical heterogeneity is more likely to happen, and so the above problem is even more pressing. We explore the three most promising ways to measure statistical heterogeneity and give formulae for their accuracy, while simultaneously incorporating differential privacy. We find the optimum privacy parameters via an analytic mechanism, which incorporates root finding methods. We validate the main theorems and related hypotheses experimentally, and test the robustness of the analytic mechanism to different heterogeneity levels. The analytic mechanism in a distributed setting delivers superior accuracy to all combinations involving the classic mechanism and/or the centralized setting. All measures of statistical heterogeneity do not lose significant accuracy when a heterogeneous sample is used.

* 26 pages, 6 tables, 1 figure

Via

Access Paper or Ask Questions

FLAIM: AIM-based Synthetic Data Generation in the Federated Setting

Oct 05, 2023

Samuel Maddock, Graham Cormode, Carsten Maple

Abstract:Preserving individual privacy while enabling collaborative data sharing is crucial for organizations. Synthetic data generation is one solution, producing artificial data that mirrors the statistical properties of private data. While numerous techniques have been devised under differential privacy, they predominantly assume data is centralized. However, data is often distributed across multiple clients in a federated manner. In this work, we initiate the study of federated synthetic tabular data generation. Building upon a SOTA central method known as AIM, we present DistAIM and FLAIM. We show it is straightforward to distribute AIM, extending a recent approach based on secure multi-party computation which necessitates additional overhead, making it less suited to federated scenarios. We then demonstrate that naively federating AIM can lead to substantial degradation in utility under the presence of heterogeneity. To mitigate both issues, we propose an augmented FLAIM approach that maintains a private proxy of heterogeneity. We simulate our methods across a range of benchmark datasets under different degrees of heterogeneity and show this can improve utility while reducing overhead.

* 21 pages

Via

Access Paper or Ask Questions

Pruning Compact ConvNets for Efficient Inference

Jan 11, 2023

Sayan Ghosh, Karthik Prasad, Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Graham Cormode, Peter Vajda

Abstract:Neural network pruning is frequently used to compress over-parameterized networks by large amounts, while incurring only marginal drops in generalization performance. However, the impact of pruning on networks that have been highly optimized for efficient inference has not received the same level of attention. In this paper, we analyze the effect of pruning for computer vision, and study state-of-the-art ConvNets, such as the FBNetV3 family of models. We show that model pruning approaches can be used to further optimize networks trained through NAS (Neural Architecture Search). The resulting family of pruned models can consistently obtain better performance than existing FBNetV3 models at the same level of computation, and thus provide state-of-the-art results when trading off between computational complexity and generalization performance on the ImageNet benchmark. In addition to better generalization performance, we also demonstrate that when limited computation resources are available, pruning FBNetV3 models incur only a fraction of GPU-hours involved in running a full-scale NAS.

Via

Access Paper or Ask Questions

Federated Calibration and Evaluation of Binary Classifiers

Oct 22, 2022

Graham Cormode, Igor Markov

Abstract:We address two major obstacles to practical use of supervised classifiers on distributed private data. Whether a classifier was trained by a federation of cooperating clients or trained centrally out of distribution, (1) the output scores must be calibrated, and (2) performance metrics must be evaluated -- all without assembling labels in one place. In particular, we show how to perform calibration and compute precision, recall, accuracy and ROC-AUC in the federated setting under three privacy models (i) secure aggregation, (ii) distributed differential privacy, (iii) local differential privacy. Our theorems and experiments clarify tradeoffs between privacy, accuracy, and data efficiency. They also help decide whether a given application has sufficient data to support federated calibration and evaluation.

* 24 pages

Via

Access Paper or Ask Questions

Federated Boosted Decision Trees with Differential Privacy

Oct 06, 2022

Samuel Maddock, Graham Cormode, Tianhao Wang, Carsten Maple, Somesh Jha

Figure 1 for Federated Boosted Decision Trees with Differential Privacy

Figure 2 for Federated Boosted Decision Trees with Differential Privacy

Figure 3 for Federated Boosted Decision Trees with Differential Privacy

Figure 4 for Federated Boosted Decision Trees with Differential Privacy

Abstract:There is great demand for scalable, secure, and efficient privacy-preserving machine learning models that can be trained over distributed data. While deep learning models typically achieve the best results in a centralized non-secure setting, different models can excel when privacy and communication constraints are imposed. Instead, tree-based approaches such as XGBoost have attracted much attention for their high performance and ease of use; in particular, they often achieve state-of-the-art results on tabular data. Consequently, several recent works have focused on translating Gradient Boosted Decision Tree (GBDT) models like XGBoost into federated settings, via cryptographic mechanisms such as Homomorphic Encryption (HE) and Secure Multi-Party Computation (MPC). However, these do not always provide formal privacy guarantees, or consider the full range of hyperparameters and implementation settings. In this work, we implement the GBDT model under Differential Privacy (DP). We propose a general framework that captures and extends existing approaches for differentially private decision trees. Our framework of methods is tailored to the federated setting, and we show that with a careful choice of techniques it is possible to achieve very high utility while maintaining strong levels of privacy.

* Full version of a paper to appear at ACM CCS'22

Via

Access Paper or Ask Questions

Reconciling Security and Communication Efficiency in Federated Learning

Jul 26, 2022

Karthik Prasad, Sayan Ghosh, Graham Cormode, Ilya Mironov, Ashkan Yousefpour, Pierre Stock

Figure 1 for Reconciling Security and Communication Efficiency in Federated Learning

Figure 2 for Reconciling Security and Communication Efficiency in Federated Learning

Figure 3 for Reconciling Security and Communication Efficiency in Federated Learning

Figure 4 for Reconciling Security and Communication Efficiency in Federated Learning

Abstract:Cross-device Federated Learning is an increasingly popular machine learning setting to train a model by leveraging a large population of client devices with high privacy and security guarantees. However, communication efficiency remains a major bottleneck when scaling federated learning to production environments, particularly due to bandwidth constraints during uplink communication. In this paper, we formalize and address the problem of compressing client-to-server model updates under the Secure Aggregation primitive, a core component of Federated Learning pipelines that allows the server to aggregate the client updates without accessing them individually. In particular, we adapt standard scalar quantization and pruning methods to Secure Aggregation and propose Secure Indexing, a variant of Secure Aggregation that supports quantization for extreme compression. We establish state-of-the-art results on LEAF benchmarks in a secure Federated Learning setup with up to 40$\times$ compression in uplink communication with no meaningful loss in utility compared to uncompressed baselines.

Via

Access Paper or Ask Questions