Inferentially Valid, Partially Synthetic Data: Generating from Posterior Predictive Distributions not Necessary

201237 citationsJournal Article

Authors

Satkartar K. Kinney · École Supérieure de Psychologie

Abstract

To limit the risks of disclosures when releasing public use data on individual records, statistical agencies and other data disseminators can release multiply-imputed, partially synthetic data (Little, 1993; Reiter, 2003). These comprise the units originally surveyed with some collected values, e.g. sensitive values at high risk of disclosure or values of quasi-identifiers, replaced with multiple imputations. Partially synthetic data can protect confidentiality, since identification of units and their sensitive data can be difficult when select values in the released data are not actual, collected values. And, with appropriate estimation methods based on the concepts of multiple imputation (Rubin, 1987), they enable data users to make valid inferences for a variety of estimands using standard, complete-data statistical methods and software. Because of these appealing features, partially synthetic data products have been developed for several major data sources in the U.S., including the Longitudinal Business Database (Kinney et al., 2011), the Survey of Income and Program Participation (Abowd et al., 2006), the American Community Survey group quarters data (Hawala, 2008), and the OnTheMap database of where people live and work (Machanavajjhala et al., 2008). Other examples of partially synthetic data are described in Abowd and Woodcock (2004), Little et al. (2004), Drechsler et al. (2008), and Drechsler and Reiter (2010). In the statistical theory underlying the generation of partially synthetic data, as well as typical implementations in practice, replacement values are sampled from posterior predictive distributions. That is, the agency repeatedly draws values of the model parameters from their posterior distributions, and generates a set of replacement values based on each parameter draw. The motivation for sampling from posterior predictive distributions derives from multiple imputation of missing data, in which drawing the parameters is necessary to enable approximately unbiased variance estimation (Rubin, 1987, Chapter 4). In this article, we argue that it is not necessary to draw parameters to enable valid inferences with partially synthetic data. Instead, data disseminators can estimate posterior modes or maximum likelihood estimates of parameters in synthesis models, and simulate replacement values after plugging those modes into the models. Using a simple but informative case, we show mathematically that point and variance estimates based on the plug-in

Topics & Keywords

Statistical Methods and Bayesian Inference Survey Methodology and Nonresponse Census and Population Estimation

UN Sustainable Development Goals

No poverty

Publication Details

Volume 28, Issue 4, pp. 583-590

Field-Weighted Citation Impact: 2.76