Abstract:We propose TVineSynth, a vine copula based synthetic tabular data generator, which is designed to balance privacy and utility, using the vine tree structure and its truncation to do the trade-off. Contrary to synthetic data generators that achieve DP by globally adding noise, TVineSynth performs a controlled approximation of the estimated data generating distribution, so that it does not suffer from poor utility of the resulting synthetic data for downstream prediction tasks. TVineSynth introduces a targeted bias into the vine copula model that, combined with the specific tree structure of the vine, causes the model to zero out privacy-leaking dependencies while relying on those that are beneficial for utility. Privacy is here measured with membership (MIA) and attribute inference attacks (AIA). Further, we theoretically justify how the construction of TVineSynth ensures AIA privacy under a natural privacy measure for continuous sensitive attributes. When compared to competitor models, with and without DP, on simulated and on real-world data, TVineSynth achieves a superior privacy-utility balance.
Abstract:We propose a type of generalised additive models with of model components based on pair-copula constructions, with prediction as a main aim. The model components are designed such that our model may capture potentially complex interaction effects in the relationship between the response covariates. In addition, our model does not require discretisation of continuous covariates, and is therefore suitable for problems with many such covariates. Further, we have designed a fitting algorithm inspired by gradient boosting, as well as efficient procedures for model selection and evaluation of the model components, through constraints on the model space and approximations, that speed up time-costly computations. In addition to being absolutely necessary for our model to be a realistic alternative in higher dimensions, these techniques may also be useful as a basis for designing efficient models selection algorithms for other types of copula regression models. We have explored the characteristics of our method in a simulation study, in particular comparing it to natural alternatives, such as logic regression, classic boosting models and penalised logistic regression. We have also illustrated our approach on the Wisconsin breast cancer dataset and on the Boston housing dataset. The results show that our method has a prediction performance that is either better than or comparable to the other methods, even when the proportion of discrete covariates is high.