Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Pre-training Differentially Private Models with Limited Public Data

Feb 28, 2024

Zhiqi Bu, Xinwei Zhang, Mingyi Hong, Sheng Zha, George Karypis

Figure 1 for Pre-training Differentially Private Models with Limited Public Data

Figure 2 for Pre-training Differentially Private Models with Limited Public Data

Figure 3 for Pre-training Differentially Private Models with Limited Public Data

Figure 4 for Pre-training Differentially Private Models with Limited Public Data

Share this with someone who'll enjoy it:

Abstract:The superior performance of large foundation models relies on the use of massive amounts of high-quality data, which often contain sensitive, private and copyrighted material that requires formal protection. While differential privacy (DP) is a prominent method to gauge the degree of security provided to the models, its application is commonly limited to the model fine-tuning stage, due to the performance degradation when applying DP during the pre-training stage. Consequently, DP is yet not capable of protecting a substantial portion of the data used during the initial pre-training process. In this work, we first provide a theoretical understanding of the efficacy of DP training by analyzing the per-iteration loss improvement. We make a key observation that DP optimizers' performance degradation can be significantly mitigated by the use of limited public data, which leads to a novel DP continual pre-training strategy. Empirically, using only 10\% of public data, our strategy can achieve DP accuracy of 41.5\% on ImageNet-21k (with $\epsilon=8$), as well as non-DP accuracy of 55.7\% and and 60.0\% on downstream tasks Places365 and iNaturalist-2021, respectively, on par with state-of-the-art standard pre-training and substantially outperforming existing DP pre-trained models.

View paper on

Share this with someone who'll enjoy it:

Title:Pre-training Differentially Private Models with Limited Public Data

Paper and Code