# on_consistent_bayesian_inference_from_synthetic_data__31e8b2eb.pdf Journal of Machine Learning Research 26 (2025) 1-65 Submitted 10/23; Revised 12/24; Published 4/25 On Consistent Bayesian Inference from Synthetic Data Ossi Räisä ossi.raisa@helsinki.fi Joonas Jälkö joonas.jalko@helsinki.fi Antti Honkela antti.honkela@helsinki.fi Department of Computer Science University of Helsinki P.O. Box 68 (Pietari Kalmin katu 5) 00014 University of Helsinki, Finland Editor: Mohammad Emtiyaz Khan Generating synthetic data, with or without differential privacy, has attracted significant attention as a potential solution to the dilemma between making data easily available, and the privacy of data subjects. Several works have shown that consistency of downstream analyses from synthetic data, including accurate uncertainty estimation, requires accounting for the synthetic data generation. There are very few methods of doing so, most of them for frequentist analysis. In this paper, we study how to perform consistent Bayesian inference from synthetic data. We prove that mixing posterior samples obtained separately from multiple large synthetic data sets, that are sampled from a posterior predictive, converges to the posterior of the downstream analysis under standard regularity conditions when the analyst s model is compatible with the data provider s model. We also present several examples showing how the theory works in practice, and showing how Bayesian inference can fail when the compatibility assumption is not met, or the synthetic data set is not significantly larger than the original. Keywords: synthetic data, Bayesian inference, Bernstein-von Mises theorem, differential privacy 1. Introduction Synthetic data has the potential of opening privacy-sensitive data sets for widespread analysis. The idea is to train a generative model with real data, and release synthetic data that has been generated from the model. The synthetic data does not contain records from real people, and ideally it preserves the population-level properties of the real data, making it useful for analysis. Synthetic data can remain vulnerable to modern privacy attacks (van Breugel et al., 2023b), but they can be provably mitigated with differential privacy (DP; Dwork et al., 2006b). The most convenient and straightforward way for downstream analysts to analyse synthetic data is using the same method that would be used with real data. However, ignoring the additional stochasticity arising from the synthetic data generation will yield biased results and overconfident uncertainty estimates (Raghunathan et al., 2003; Wilde et al., 2021; Räisä et al., 2023). This is especially problematic under DP, which requires adding extra noise, c 2025 Ossi Räisä, Joonas Jälkö and Antti Honkela. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/23-1428.html. Räisä, Jälkö and Honkela which will be ignored if the synthetic data is treated like real data. This problem creates the need for noise-aware analyses that account for the synthetic data generation. For frequentist downstream analyses, it is possible to account for the synthetic data generation by generating and analysing multiple synthetic data sets (Raghunathan et al., 2003). Recent work has extended this to DP synthetic data (Räisä et al., 2023), which allows generating multiple synthetic data sets without compromising on privacy. These methods reuse the analysis method for the real data, and only require using simple combining rules to combine the results from the analyses on each synthetic data set, making them simple to apply. For Bayesian downstream analyses, Wilde et al. (2021) have shown that the analyst can use additional samples of public real data to correct their analysis. However, their method requires targeting a generalised notion of the posterior (Bissiri et al., 2016) and needs additional public data for calibration. Ghalebikesabi et al. (2022) propose a correction using importance sampling to avoid the need of public data, but only prove convergence to a generalised posterior and do not clearly address the noise-awareness of the method. The simple frequentist methods using multiple synthetic data sets were derived from methods in missing data imputation (Rubin, 1987), so our starting point is the method of Gelman et al. (2004, 2014) for Bayesian inference with missing data. They proposed inferring the downstream posterior by imputing multiple completed data sets, inferring the analysis posterior for each completed data set separately, and mixing the posteriors together. We investigate whether this method is also applicable to synthetic data, generated with or without DP by sampling a posterior predictive distribution. With DP synthetic data, we require that the posterior predictive accounts for the DP noise, i.e. it must be noise-aware (Bernstein and Sheldon, 2018; Räisä et al., 2023). 1.1 Contributions 1. We study inferring the downstream analysis posterior by generating multiple synthetic data sets from a posterior predictive distribution, inferring the analysis posterior for each synthetic data set as if it were the real data set, and mixing the posteriors together. We find two important conditions for consistent Bayesian inference with this method: synthetic data sets that are larger than the original one, and a notion of compatibility between the data provider s and analyst s models called congeniality (Meng, 1994). 2. We prove that when congeniality is met and the Bernstein von Mises theorem applies, this method converges to the true posterior as the number of synthetic data sets and the size of the synthetic data sets grow. Under stronger assumptions, we prove a convergence rate for this method in the synthetic data set size, which we expect to match the rate that usually applies in the Bernstein von Mises theorem (Hipp and Michel, 1976). These are presented in Section 3. 3. We evaluate this method with two examples in Sections 4 and 5: non-private univariate Gaussian mean or variance estimation, and DP Bayesian logistic regression. In the first example, we use the tractability of the model to derive further theoretical properties of the method, and in both examples, we verify that the method works in practice when On Consistent Bayesian Inference from Synthetic Data the assumptions are met, and examine what can happen when they are not met. Our code is available under an open-source license.1 1.2 Related Work Generating synthetic data to preserve privacy was, as far as we know, originally proposed by Liew et al. (1985). Rubin (1993) proposed accounting for the synthetic data generation in frequentist downstream analyses by adapting multiple imputation (Rubin, 1987), which involves generating multiple synthetic data sets, analysing each of them, and combining the results with so called Rubin s rules (Raghunathan et al., 2003; Reiter, 2002). Recently, Räisä et al. (2023) have shown that multiple imputation also works for synthetic data generated under DP when the data generation algorithm is noise-aware. Recently, van Breugel et al. (2023a) have studied uncertainty quantification for prediction tasks when training on synthetic data. They propose generating multiple synthetic data sets, like the multiple imputation line of work, and experimentally show that aggregating predictions from the multiple synthetic data sets improves generalisation performance and uncertainty quantification. They use a similar Bayesian framework as we do to justify using multiple synthetic data sets, but their theoretical study of the framework is very light. In particular, they do not consider the effect of the synthetic data set size theoretically. Wilde et al. (2021) study downstream Bayesian inference from DP synthetic data by considering the analyst s model to be misspecified, and targeting a generalised notion of the posterior (Bissiri et al., 2016) to deal with the misspecification, which makes their method more difficult to apply than standard Bayesian inference. They also assume that the analyst has additional public data available to calibrate their method. Ghalebikesabi et al. (2022) use importance sampling to correct for bias with DP synthetic data, and have Bayesian inference as an example application. However, they also target a generalised variant (Bissiri et al., 2016) of the posterior instead of the noise-aware posterior we target, and they do not evaluate uncertainty estimation, so the noise-awareness of their method is not clear. We are not aware of any existing work adapting multiple imputation for Bayesian downstream analysis in the synthetic data setting. In the missing data setting without DP, where multiple imputation was originally developed (Rubin, 1987), Gelman et al. (2004, 2014) have proposed sampling the downstream posterior by mixing samples of the downstream posteriors from each of the multiple imputed data sets. We find that this is not sufficient in the synthetic data setting, and add one extra component: our synthetic data sets are larger than the original data set. We compare the two cases in more detail in Section 3.6, and in particular explain why large synthetic data sets are not needed in the missing data setting. A line of work considers mixing posteriors from multiple bootstrap samples of the real dataset (Bühlmann, 2014; Huggins and Miller, 2020, 2023) in an approach called Bayes Bag. Bayes Bag is almost a special case of our approach, as bootstrapping is a form of generating synthetic data, but the standard bootstrap does not sample from a posterior predictive distribution.2 1. https://github.com/DPBayes/NAPSU-MQ-bayesian-downstream-experiments 2. The Bayesian bootstrap (Rubin, 1981) does sample from a posterior predictive, and is very close to the standard bootstrap. Räisä, Jälkö and Honkela Noise-aware DP Bayesian inference is critical for taking into account the DP noise in synthetic data, but only a few works address this even without synthetic data. Bernstein and Sheldon (2018) present an inference method for simple exponential family models. Their approach was extended to linear models (Bernstein and Sheldon, 2019) and generalised linear models (Kulkarni et al., 2021). Recently, Ju et al. (2022) developed an MCMC sampler that can sample the noise-aware posterior using a noisy summary statistic. 2. Background In this section, we introduce some background needed for the rest of our work. We start by introducing Bayesian inference and the Bernstein von Mises theorem in Section 2.1, and then introduce differential privacy in Section 2.2 and noise-aware synthetic data generation in Section 2.3. 2.1 Bayesian Inference Bayesian inference is a paradigm of statistical inference where the data analyst s uncertainty in a quantity Q after observing data X is represented using the posterior distribution p(Q|X) (Gelman et al., 2014). The posterior is given by Bayes rule: p(Q|X) = p(X|Q)p(Q) R p(X|Q )p(Q ) d Q , where p(X|Q) is the likelihood of observing the data X for a given value of Q, and p(Q) is the analyst s prior of Q. Computing the denominator is typically intractable, so analysts often use numerical methods to sample p(Q|X) (Gelman et al., 2014). It turns out that in many typical settings, the prior s influence on the posterior vanishes when the data set X is large. A basic example of this is the Bernstein von Mises theorem (van der Vaart, 1998), which informally states that under regularity conditions, the posterior approaches a Gaussian that does not depend on the prior as the size of the data set increases. A crucial component of the theorem, and also our theory, is the notion of total variation distance which is used to measure the difference between two random variables or probability distributions. Definition 1. The total variation distance between random variables (or distributions) P1 and P2 is TV(P1, P2) = sup A | Pr(P1 A) Pr(P2 A)|, where A is any measurable set. As a slight abuse of notation, we allow the arguments of TV( , ) to be random variables, probability distributions, or probability density functions interchangeably. We list some properties of total variation distance that we use in Lemma 18 in Appendix A.2. Now we can state the theorem. Theorem 2 (Bernstein von Mises, van der Vaart, 1998). Let n denote the size of the data set Xn. Under regularity conditions stated in Condition 21 in Appendix A.3, for true parameter On Consistent Bayesian Inference from Synthetic Data value Q0, the posterior Q(Xn) p(Q|Xn) satisfies TV n( Q(Xn) Q0), N(µ(Xn), Σ) P 0 as n for some µ(Xn) and Σ, that do not depend on the prior, where the convergence in probability is over sampling Xn p(Xn|Q0). 2.2 Differential Privacy and Noise-Aware Synthetic Data Differential privacy (DP) (Dwork et al., 2006b) quantifies the privacy loss from releasing the results of analysing data. The quantification is done by looking at the output distributions of the analysis algorithm for two data sets that differ in a single data subject (Dwork and Roth, 2014): Definition 3. An algorithm M is (ϵ, δ)-DP if Pr(M(X) S) eϵ Pr(M(X ) S) + δ for all measurable sets S and all data sets X, X that differ in one data subject. The choice of ϵ and δ is a matter of policy (Dwork, 2008). One should set δ 1/n for n datapoints, as δ 1/n permits mechanisms that clearly violate privacy (Dwork and Roth, 2014). A common primitive for making an algorithm DP is the Gaussian mechanism (Dwork et al., 2006a), which simply adds Gaussian noise to the output of a function: Definition 4. The Gaussian mechanism with noise variance σ2 DP and function f outputs f(X) + N(0, σ2 DP I) for input X. For a given (ϵ, δ)-bound and function f, the required value for σ2 DP can be computed tightly using the analytical Gaussian mechanism (Balle and Wang, 2018). 2.3 Noise-Aware Private Synthetic Data To solve the uncertainty estimation problem for frequentist analyses from DP synthetic data, Räisä et al. (2023) developed a noise-aware algorithm for generating synthetic data called NAPSU-MQ. NAPSU-MQ takes discrete data and summarises it with marginal queries, which count how many data points have each possible combination of values for given variables. See Appendix A.1 for a precise definition. Then NAPSU-MQ releases the query values under DP with the Gaussian mechanism, and finally generates multiple synthetic data sets. The downstream analysis is done on each synthetic data set, and the results are combined using Rubin s rules for synthetic data (Raghunathan et al., 2003; Rubin, 1993), which use the multiple analysis results to account for the extra uncertainty coming from the synthetic data generation. The synthetic data is generated by sampling the posterior predictive distribution p(X | s) = Z p(X |θ)p(θ| s) dθ, (1) Räisä, Jälkö and Honkela θ: data generating model parameters X: real data X : hypothetical data Z: observed summary of X (Z = X without DP) XSyn: synthetic data, XSyn p(X |Z, IS) Q: estimated quantity in downstream analysis IS: synthetic data provider s background information IA: analyst s background information Figure 1: Left: random variables in noise-aware uncertainty estimation from synthetic data. Right: a Bayesian network describing the dependencies of the random variables. This network is conditional on either IS or IA, depending on whether viewed by the data provider or the analyst. where θ is the parameters of the synthetic data generator and s is the noisy marginal query values. The conditioning on s and including the Gaussian mechanism in the model is what makes NAPSU-MQ noise-aware, and allows Rubin s rules to accurately account for the synthetic data generation and DP noise in the downstream analysis. As an alternative to NAPSU-MQ, one can use prior noise-aware Bayesian inference methods (Bernstein and Sheldon, 2018; Kulkarni et al., 2021) to generate synthetic data from a noise-aware posterior predictive distribution. However, these methods heavily restrict the possible models, so they are unlikely to be flexible enough to accurately model realistic data. 3. Bayesian Inference from Synthetic Data When the downstream analysis is Bayesian, the analyst would ultimately want to obtain the posterior p(Q|X, IA) of some quantity Q given real data X, where IA denotes the background knowledge such as priors of the analyst. We assume the analyst has a method to sample p(Q|X, IA) if they had access to real data, and study what they can do when they only have access to synthetic data. In order to introduce the synthetic data into the posterior of interest, we can decompose the posterior using the law of total probability as p(Q|X, IA) = Z p(Q|X, X , IA)p(X |X, IA) d X . (2) On Consistent Bayesian Inference from Synthetic Data We are using a standard abuse of notation here, where inside the integral X is the variable to integrate over, and outside the integral, X is a random variable. The decomposition in (2) means that we can sample from p(Q|X, IA) by first sampling synthetic data XSyn from the posterior predictive p(X |X, IA), and then sampling Q from p(Q|X, X = XSyn, IA). To make using the decomposition in (2) possible, the synthetic data provider must generate the synthetic data by sampling the posterior predictive XSyn p(X |X, IA). The most common way to accomplish this is to model X with a parametric model with parameters θ, taking a sample θs p(θ|X, IA) from the posterior of this model, and then sampling XSyn p(X |θ = θs, IA). Note that X and XSyn are not the same random variable. In fact, their relationship is analogous to θ and θs above. The random variable X represents a hypothetical real data set that could be obtained if more data was collected, as seen in Figure 1. The synthetic data set XSyn is a sample from the conditional distribution of X given X, analogous to how θs is a sample from a conditional distribution of θ. For this reason, p(Q|X, X , IA) = p(Q|X, IA). To make our notation less cluttered, we use a standard abuse of notation, and write θ p(θ| ), suppressing the separate θs symbol when the meaning is clear from now on. We will also use this convention with other random variables. However, due to the importance of XSyn and X in this work, we will keep them separate, when needed for clarity. The parameterisations of Q and θ may be very different. For example, in the example of Section 5.3, Q consists of the coefficients of logistic regression, while θ consists of the parameters of a Markov network (Räisä et al., 2023). Under DP, the exact posterior p(Q|X, IA) is unobtainable, so we assume that X is only available through a noisy summary s (Ju et al., 2022; Räisä et al., 2023), with the posterior p(Q| s, IA). It would be ideal for p(Q| s, IA) to be as close to p(Q|X, IA) as possible, but they cannot be equal due to the noise that must be added to s. For this reason, our goal will be consistent inference of p(Q| s, IA), which must account for the noise that is added to s. We can write the same decomposition as in (2) for p(Q| s, IA): p(Q| s, IA) = Z p(Q| s, X , IA)p(X | s, IA) d X , (3) and use synthetic data in the same way to sample p(Q| s, IA). Note that the synthetic data will need to be sampled from the posterior predictive p(X | s, IA), which will need to account for the noise that is added to s, i.e. it needs to be noise-aware. This can be accomplished with the methods introduced in Section 2.3: NAPSU-MQ (Räisä et al., 2023) or more restricted methods (Bernstein and Sheldon, 2018; Kulkarni et al., 2021). Since the only difference between the non-DP and DP cases is the symbol for the observed values, we unify the notations by using Z to denote the observed values, so Z = X in the non-DP case, Z = s in the DP case, and the posterior of interest is p(Q|Z, IA). The decomposition of interest is then p(Q|Z, IA) = Z p(Q|Z, X , IA)p(X |Z, IA) d X . (4) We summarise these random variables and their dependencies in Figure 1. There are still two major issues with the decomposition in (4): Räisä, Jälkö and Honkela 1. Sampling p(Q|Z, X , IA) requires access to Z, which defeats the purpose of using synthetic data. 2. The synthetic data needs to be sampled conditionally on the analyst s background information IA, while the synthetic data provider could have different background information IS. To solve the first issue, in Section 3.2 we show that if we replace p(Q|Z, X , IA) inside the integral of (4) with p(Q|X , IA), the resulting distribution converges to the desired posterior, Z p(Q|X , IA)p(X |Z, IA) d X p(Q|Z, IA) (5) in total variation distance as the size of X grows. It should be noted that many synthetic data sets XSyn p(X |Z, IA) will be needed to account for the integral over X . The second issue is known as congeniality in the multiple imputation literature (Meng, 1994; Xie and Meng, 2016). We look at congeniality in the context of Bayesian inference from synthetic data in Section 3.1, and find that we can obtain p(Q|Z, IA) under appropriate assumptions on the relationship between IA and IS. Exactly sampling the LHS of (5) requires generating a synthetic data set for each sample of p(Q|Z, IA), which is not practical. However, we can perform a Monte-Carlo approximation for p(Q|Z, IA) by independently generating m synthetic data sets XSyn 1 , . . . , XSyn m p(X |Z, IA), drawing multiple samples from each of the p(Q|X = XSyn i , IA), and mixing these samples, which allows us to obtain more than one sample of p(Q|Z, IA) per synthetic data set. We look at some properties of this in Section 3.4, but we use the integral form in (5) in the rest of our theory. 3.1 Congeniality In the decomposition (4) of the analyst s posterior, we need to sample the conditional distribution of X while conditioning on the analyst s background information IA, while in reality the synthetic data provider could have different background information IS. A similar distinction has been studied in the context of missing data (Meng, 1994; Xie and Meng, 2016), where the imputer of missing data has a similar role as the synthetic data provider. Meng (1994) found that Rubin s rules implicitly assume that the probability models of both parties are compatible in a certain sense, which Meng (1994) defined as congeniality. As our examples with Gaussian distributions in Section 4 show, some notion of congeniality is also required in our setting. However, because we study synthetic data instead of imputation, and Bayesian instead of frequentist downstream analysis, we need a different formal definition. As the analyst only makes inferences on Q, it suffices that both the analyst and synthetic data provider make the same inferences of Q: Definition 5. The background information sets IS and IA are congenial for observation Z if p(Q|X , IS) = p(Q|X , IA) (6) for all X and p(Q|Z, IS) = p(Q|Z, IA). (7) On Consistent Bayesian Inference from Synthetic Data In the non-DP case, (7) is redundant, as it is implied by (6), but in the DP case, both are needed, as the parties may draw different conclusions on X given Z = s. Combining congeniality and (5), Z p(Q|X , IA)p(X |Z, IS) d X = Z p(Q|X , IS)p(X |Z, IS) d X p(Q|Z, IS) = p(Q|Z, IA), (8) where the convergence is in total variation distance as the size of X grows. Congeniality is a strict condition, since it requires both parties to make exactly the same inference on Q. We will also study the following relaxation, where the parties only need to make the same inferences asymptotically. In the definition, we consider a countably infinite population X . We use X n X to denote the first n X members of the population. We also use this notation with Z and Zn Z, which denote the observed, potentially noisy, value computed from either the infinite population, or the first n Z members. Definition 6. The background information sets IS and IA are asymptotically congenial for population Z if, for all populations X , TV(p(Q|X n X , IS), p(Q|X n X , IA)) 0 (9) as n X , and TV(p(Q|Zn Z, IS), p(Q|Zn Z, IA)) 0 (10) If both parties make reasonable assumptions, they should arrive at the same posterior of Q with an infinite population, since the infinite population determines the value of Q. They should also arrive at the same posterior if given Z from an infinite population, even with added noise, since the process of adding the noise is known to both parties, and the signal-to-noise ratio grows to infinity when increasing n in the DP applications we are interested in. As a result, we can expect asymptotic congeniality to hold in most practical cases with reasonable parties. 3.2 Consistency Proof To recap, we want to prove that the posterior from synthetic data, pn(Q) = Z p(Q|X n)p(X n|Z) d X n, (11) converges in total variation distance to p(Q|Z) as the size n of X n grows. We prove this in Theorem 9, which requires that both p(Q|Z, X n) and p(Q|X n) approach the same distribution as n grows. We formally state this in Condition 7. In Lemma 8, we show that Condition 7 is a consequence of the Bernstein von Mises theorem (Theorem 2) under some additional assumptions, so we expect it to hold in typical settings. To make the notation more compact, let Q+ n p(Q|Z, X n), and let Qn p(Q|X n). Räisä, Jälkö and Honkela Condition 7. For the observed Z and all Q, there exist random variables Dn such that TV Q+ n , Dn P 0 and TV Qn, Dn P 0 as n , where the convergence in probability is over sampling X n p(X n|Z, Q). Theorem 2 implies Condition 7 with some additional assumptions: Lemma 8. Let the assumptions of Theorem 2 (stated in Condition 21) hold for the downstream analysis for all Q0, and the following assumptions: (1) Z and X are conditionally independent given Q; and (2) p(Z|Q) > 0 for all Q, hold. Then Condition 7 holds. Proof The full proof is in Appendix B.1. Proof idea: when Z and X are conditionally independent given Q, p(Q|Z, X ) p(X |Q)p(Z|Q)p(Q) so p(Q|Z, X ) can be equivalently seen as the result of Bayesian inference with observed data X and prior p(Q|Z). As the only difference to p(Q|X ) is the prior, the Bernstein von Mises theorem implies that both p(Q|Z, X ) and p(Q|X ) converge in total variation distance to the same distribution. Assumption (1) of Lemma 8 will hold if the downstream analysis treats its input data as an i.i.d. sample from some distribution. Note that this i.i.d. assumption is on the real data, so using a DP mechanism with correlated noise does not affect it. Assumption (2) holds when the likelihood is always positive, and in the DP case when every possible output has positive probability regardless of the private data, which is the case for common DP mechanisms like the Gaussian and Laplace mechanisms (Dwork and Roth, 2014). Next is the main theorem of this work: (5) holds under Condition 7. Theorem 9. Under congeniality and Condition 7, TV (p(Q|Z), pn(Q)) 0 as n . Proof The full proof is in Appendix B.1. Proof idea: the proof consists of three steps. The first two are in Lemma 22 and the third is in Lemma 23 in the Appendix. The first step is showing that TV( Qn, Q+ n ) P 0 when X n p(X n|Z, Q) for fixed Z and Q. This is a simple consequence of the triangle inequality and Condition 7, as total variation distance is a metric. In the second step, we show that TV( Qn, Q+ n ) P 0 also holds when X n p(X n|Z). In the final step, we show that this implies the claim. 3.3 Convergence Rate Under a stronger regularity condition, we can get a convergence rate for Theorem 9. The regularity condition depends on uniform integrability: On Consistent Bayesian Inference from Synthetic Data Definition 10. A sequence of random variables Xn is uniformly integrable if lim M sup n E(|Xn|I|Xn|>M) = 0. Now we can state the regularity condition for a convergence rate O(Rn): Condition 11. For the observed Z, there exist random variables Dn such that for a sequence R1, R2, > 0, Rn 0 as n , 1 Rn TV Q+ n , Dn and 1 Rn TV Qn, Dn are uniformly integrable when X n p(X n|Z). Note that X n p(X n|Z) conditions on Z, not Q and Z like in Condition 7. Condition 11 is not a standard regularity condition, so it is not clear what settings it applies in. To ensure that it at least applies in some setting, we show that it is met in multivariate Gaussian mean estimation with Rn = 1 n. This is the rate that commonly appears in the Bernstein von Mises theorem (Hipp and Michel, 1976). Theorem 12. When the upand downstream models are d-dimensional Gaussian mean estimations with known covariance Σk, and Dn N( X n, n 1Σk), n TV Q+ n , Dn and n TV Qn, Dn are uniformly integrable when X n p(X n|X). Proof The idea of the proof is to use Pinsker s inequality (Lemma 18) to upper bound the total variation distance with KL divergence, and prove the required uniform integrability for the KL divergence upper bound. This is a fairly lengthy exercise in upper bounding and computing the limit in the definition of uniform integrability for the various terms that appear in the KL-divergence between the two Gaussians in question. We defer the full proof to Appendix B.2. Condition 11 implies an O(Rn) convergence rate: Theorem 13. Under congeniality and Condition 11, TV (p(Q|Z), pn(Q)) = O(Rn). Proof The full proof is in Appendix B.2. Proof idea: first, we prove the uniform integrability of 1 Rn TV( Qn, Q+ n ) when X n p(X n|Z) by using the triangle inequality and properties of uniform integrability. Second, we prove that this implies the claimed convergence rate. 3.4 Finite Number of Synthetic Data Sets We have now shown that the mixture of posteriors pn(Q) = Z p(Q|X n)p(X n|Z) d X n, Räisä, Jälkö and Honkela converges to the target posterior p(Q|Z) as n grows. However, sampling pn(Q) exactly requires one synthetic data set per sample, which is not practical in realistic settings. We can further approximate by generating a fixed number m of synthetic datasets and using a Monte-Carlo approximation of the integral: p(Q|Z) Z p(Q|X n)p(X n|Z) d X n 1 i=1 p(Q|X n = XSyn i,n ), (12) with XSyn i,n p(X n|Z). In the following, we shorten p(Q|X n = XSyn i,n ) = p(Q|XSyn i,n ). Total variation distance is a metric, so i=1 p(Q|XSyn i,n ), p(Q|Z) i=1 p(Q|XSyn i,n ), pn(Q) + TV ( pn(Q), p(Q|Z)) . Theorem 9 gives lim n TV ( pn(Q), p(Q|Z)) = 0. If, for all n, p(Q|X n) is continuous for all X n, p(Q|X n) hn(X n) for an integrable function hn(X n), and Q Rd is compact, the uniform law of large numbers (Jennrich, 1969, Theorem 2) gives i=1 p(Q|XSyn i,n ) pn(Q) almost surely as m . If Q = Rd, we can represent Q as a countable union of compact sets Qk, apply the uniform law of large numbers on each Qk, and use the union bound to obtain (13) without the supremum for all Q Q. A similar decomposition of Q can be done for many other constrained parameter sets encountered in practice. This pointwise convergence implies (van der Vaart, 1998, Corollary 2.30) i=1 p(Q|XSyn i,n ), pn(Q) for almost all XSyn i,n , so lim n lim m TV i=1 p(Q|XSyn i,n ), p(Q|Z) almost surely when XSyn i,n p(X n|Z). In Figure 3, we see that the mixture of posteriors becomes very spiky with small m and large n, and does not fully converge to the target posterior when n grows but m is kept small. This suggests that lim m lim n TV i=1 p(Q|XSyn i,n ), p(Q|Z) because the distributions p(Q|XSyn i,n ) become narrower as n increases, so a fixed number of them is not enough to cover p(Q|Z). This means that in practice, the number of synthetic data sets should be increased along with the size of the synthetic data sets. On Consistent Bayesian Inference from Synthetic Data 3.5 Convergence with Asymptotic Congeniality Recall that with asymptotic congeniality, we consider an infinite population X , and we observe a value Zn Z computed from the first n Z elements of X . We also consider hypothetical infinite populations X , and their finite samples X n X of size n X . In this setting, we obtain an analogue of Theorem 9, with the difference that the size of the real data n Z also needs to approach infinity. Theorem 14. If asymptotic congeniality and Condition 7 for all Zn Z, with the probabilities of Condition 7 taken conditional to IS, hold, lim n Z lim n X TV p(Q|Zn Z, IA), pn X (Q) = 0. (16) Proof The full proof is in Appendix B.3. The idea is similar to the proof of Theorem 9, but we need to use the asymptotic congeniality assumption to show that the difference between posteriors with IA and IS approaches zero in the limit. 3.6 Relation to Missing Data Imputation Combining inferences by mixing posteriors from multiple data sets in the style of (4) was originally proposed by Gelman et al. (2004, 2014) for Bayesian inference with missing data, with completed data sets corresponding to synthetic data sets of our setting. Large completed data sets are not required in the missing data setting. Next, we explain where this difference between the two settings arises. In the missing data setting, only a part Xobs of the complete data set X is observed, while a part Xmis is missing (Rubin, 1987). To facilitate downstream analysis, the missing data are imputed by sampling Xmis p(Xmis|Xobs, II). Analogously with synthetic data, II represents the imputer s background knowledge. Like with synthetic data, we have the decomposition (Gelman et al., 2014) p(Q|Xobs, IA) = Z p(Q|Xobs, Xmis, IA)p(Xmis|Xobs, IA) d Xmis. (17) If the analyst s and imputer s models are congenial in the sense that p(Q|Xobs, IA) = p(Q|Xobs, II) and p(Q|X, IA) = p(Q|X, II) for any complete data set X, then p(Q|Xobs, IA) = p(Q|Xobs, II) = Z p(Q|Xobs, Xmis, II)p(Xmis|Xobs, II) d Xmis = Z p(Q|Xobs, Xmis, IA)p(Xmis|Xobs, II) d Xmis, (18) so sampling p(Q|Xobs, IA) can be done by sampling Xmis p(Xmis|Xobs, II) multiple times, sampling p(Q|Xobs, Xmis, IA) for each Xmis, and combining the samples. Unlike with Räisä, Jälkö and Honkela Data Provider Analyst Known Variance σ2 k ˆσ2 k Prior Mean µ0 ˆµ0 Prior Variance σ2 0 ˆσ2 0 Data X X Single Datapoint x x Data Size n X n X Data Average X X Posterior Mean µn X ˆµn X Posterior Variance σ2 n X ˆσ2 n X Posterior Sample µ ˆµ Mixture of Posteriors Sample µ Table 1: Notation for known variance model in Section 4.1. synthetic data, where sampling p(Q|X, X , IA) would require the original data and defeat the purpose of using synthetic data, sampling p(Q|Xobs, Xmis, IA) is simply the analysis for a complete data set, so generating large imputed data sets is not required. 4. Non-private Gaussian Examples In this section, we look at the Bayesian inference of a univariate Gaussian mean or variance from mixing the posteriors from multiple synthetic data sets, which are also generated from the same model. This allows us to analytically examine various aspects of Bayesian inference from multiple synthetic data sets which are not visible in our high-level theory in Section 3, as all of the posteriors are analytically tractable and relatively simple. In particular, we find that the synthetic data set needs to be larger than the original data set (Section 4.3) and find a variance correction formula for the Gaussian setting that gets around this requirement (Section 4.4). We also look at two forms of uncongeniality, and find that they cause very different effects: estimating the mean with incorrect known variance converges to the data provider s posterior, but estimating the variance with incorrect known mean does not converge to either party s posterior. For reference, we list the posteriors for these Gaussian settings in Appendix A.4. 4.1 Gaussian Mean Estimation with Known Variance Our first example is very simple: xi N(µ, σ2), X = (x1, . . . , xn X), and analyst infers the mean µ of a univariate Gaussian distribution with known variance from synthetic data that has been generated from the same model. To differentiate the variables for the analyst and data provider, we use bars for the data provider (like σ2 0) and hats for the analyst (like ˆσ2 0). Table 1 summarises the notation used in this section. On Consistent Bayesian Inference from Synthetic Data When the synthetic data is generated from the model with known variance σ2 k, we sample from the posterior predictive p(X |X) as µ|X N( µn X, σ2 n X), X | µ N n X ( µ, σ2 k) 1 σ02 µ0 + n X 1 σ2 0 + n X σ2 k , 1 σ2n X = 1 Here N n X denotes a Gaussian distribution over n X i.i.d. samples and X is the mean of X. When downstream analysis is the model with known variance ˆσ2 k, we have ˆµ|X N(ˆµn X , ˆσ2 n X ), ˆµn X = 1 ˆ σ02 ˆµ0 + n X 1 ˆσ2 0 + n X ˆσ2 k , 1 ˆσ2n X = 1 ˆσ2 0 + n X The following proposition gives the behavior of the mixture of posteriors from synthetic data pn(µ) from (11) in the n X limit. Proposition 15. If µ is a sample from pn(µ) in Gaussian mean estimation with known variance, µ has a Gaussian distribution, and as n X , Eµ (µ ) µn X, Varµ (µ ) σ2 n X. (19) Proof See Appendix B.4. This means that p(µ ) d p(µ|X, IS). This is regardless of congeniality, which corresponds to both parties having equal known variances and priors ( σ2 k = ˆσ2 k, µ0 = ˆµ0, σ2 0 = ˆσ2 0) in this setting. If the parties had equal known variances, but different priors, we would have asymptotic congeniality instead, and if they had different known variances, we would not have even asymptotic congeniality. These follow from upper and lower bounds on the total variation distance between one-dimensional Gaussian distributions (Devroye et al., 2022, Theorem 1.3). We test the theory with a numerical simulation in Figure 2. We generated the real data X of size n X = 100 by i.i.d. sampling from N(1, 4). Both the analyst and data provider use N(0, 102) as the prior. The data provider uses the correct known variance ( σ2 k = 4), and the analyst either uses the correct known variance (ˆσ2 k = 4), or a too small known variance (ˆσ2 k = 1), which is an example of uncongeniality. In the congenial case in the left panel of Figure 2, both parties have the same posterior given the real data X, and the mixture of posteriors from synthetic data is very close to that. In the uncongenial case in the right panel, where the analyst underestimates the variance, the parties have different posteriors given X, but the mixture of synthetic data posteriors is still close to the data provider s posterior. In Figure 3, we examine the convergence of the mixture of posteriors from synthetic data under congeniality. We see that setting n X = n X is not enough, as the mixture of posteriors is significantly wider than the analyst s posterior for all values of m. The synthetic data set needs to be larger than the original, with n X = 5n X already giving a decent approximation and n X = 20n X a rather good one with the larger values of m. We also see that m must be sufficiently large, otherwise the method produces very jagged posteriors, for example the top right corner. Räisä, Jälkö and Honkela 0.0 0.5 1.0 1.5 2.0 μ Correct Downstream Variance 0.0 0.5 1.0 1.5 2.0 μ Incorrect Downstream Variance Analyst's p(μ|X, IA) Data provider's p(μ|X, IS) With syn. data pn(μ) Figure 2: Simulation results for the Gaussian mean estimation example, showing that the mixture of posteriors from synthetic data in green converges. In the left panel, both the analyst and data provider have the correct known variance. The blue and orange lines overlap, as both parties have the same p(µ|X). On the right, the analyst s known variance is too small (ˆσ2 k = 1 4 σ2 k), so congeniality is not met, but the mixture of posteriors from synthetic data, pn(µ), still converges to the data provider s posterior. In both panels, m = 400 and n X Data Provider Analyst Variance Prior Inv-χ2( ν0, σ0) ˆσ2 k Mean Prior N( µ0, σ2| κ0) N(ˆµ0, ˆσ2 0) Data X X Single Datapoint x x Data Size n X n X Data Average X X Data Sample Variance s2 Posterior Parameters µn X, κn X, νn X, σ2 n X ˆµn X , ˆσ2 x X Posterior Sample µ, σ2 ˆµ2 Mixture of Posteriors Sample µ Table 2: Notation for unknown variance upstream, known variance downstream model in Section 4.2. 4.2 Gaussian with Unknown Variance Upstream, Known Variance Downstream In this section, we look at a slightly more complex version of Gaussian mean estimation, where the data provider does not assume a known variance, but the analyst still assumes a known variance. We summarise the notation used in this section in Table 2. On Consistent Bayesian Inference from Synthetic Data n X */n X = 1, m = 25 n X */n X = 2, m = 25 n X */n X = 5, m = 25 n X */n X = 10, m = 25 n X */n X = 20, m = 25 n X */n X = 1, m = 50 n X */n X = 2, m = 50 n X */n X = 5, m = 50 n X */n X = 10, m = 50 n X */n X = 20, m = 50 n X */n X = 1, m = 100 n X */n X = 2, m = 100 n X */n X = 5, m = 100 n X */n X = 10, m = 100 n X */n X = 20, m = 100 n X */n X = 1, m = 200 n X */n X = 2, m = 200 n X */n X = 5, m = 200 n X */n X = 10, m = 200 n X */n X = 20, m = 200 n X */n X = 1, m = 400 n X */n X = 2, m = 400 n X */n X = 5, m = 400 n X */n X = 10, m = 400 n X */n X = 20, m = 400 Analyst's p(μ|X, IA) With syn. data pn(μ) Figure 3: Convergence of the mixture of synthetic data posteriors (in orange) with different values of m (increasing top-to-bottom) and n X (increasing left-to-right) in Gaussian mean estimation with known variance. Räisä, Jälkö and Honkela In this case, the synthetic data is generated from the Gaussian mean estimation with unknown variance model, so p(X |X) is σ2|X Inv-χ2( νn X, σ2 n X) µ| σ2, X N µn X, σ2 X | µ, σ2 N n X ( µ, σ2). s2 = 1 n X 1 i=1 (xi X)2 µn = κ0 κ0 + n X µ0 + n X κ0 + n X X κn = κ0 + n X νn = ν0 + n X νn σ2 n = ν0 σ2 0 + (n X 1)s2 + κ0n X κ0 + n X ( X µ0)2. When downstream analysis is the model with known variance ˆσ2 k, p(µ |X ) is µ |X N(ˆµn X , ˆσ2 n X ) 1 ˆ σ02 ˆµ0 + n X 1 ˆσ2 0 + n X ˆσ2 k 1 ˆσ2n X = 1 ˆσ2 0 + n X The following proposition shows how the mean and variance of the mixture of posteriors from synthetic data behave as the size of the synthetic data grows. Proposition 16. If µ is a sample from pn(µ) in Gaussian mean estimation with synthetic data generation assuming unknown variance, but downstream analysis assuming known variance, Eµ (µ ) µn X, Varµ (µ ) σ2 0 κn X (20) Proof See Appendix B.4. This means that µ asymptotically has the same mean and variance as the marginal posterior p(µ|X, IS) of µ in the synthetic data model, which is not the same as the downstream posterior distribution p(µ|X, IA) on the real data. On Consistent Bayesian Inference from Synthetic Data 0.0 0.5 1.0 1.5 2.0 μ Correct Downstream Variance 0.0 0.5 1.0 1.5 2.0 μ Incorrect Downstream Variance Analyst's p(μ|X, IA) Data provider's p(μ|X, IS) With syn. data pn(μ) Figure 4: Results when the synthetic data is generated from the unknown variance Gaussian mean estimation model, and the analyst uses the model with known variance. On the left, the analyst s known variance is correct, on the right it is incorrect. In both cases, the mixture of synthetic data posteriors converges to the data provider s posterior. In both panels, m = 400 and n X We verify this with the simulation in Figure 4, where the synthetic data is generated from the model with unknown variance, while the analyst uses the known variance model. The setting is otherwise identical to the case where both used the known variance model in Figure 2. The mixture of synthetic data posteriors converges to the data provider s posterior, even when the analyst uses an incorrect value for the known variance ˆσ2 k. 4.3 Size of the Synthetic Data set In the preceding analysis, most of the approximations hold when n X is large, even when n X n X. However, based on the experiment with different values of n X and m in Figure 3, n X n X is needed for all of the approximations to hold. This is explained by looking at Var X ( X ). In the case where both parties use the known variance model, Var X ( X ) = 1 n X E µ(Varx | µ(x i )) + Var µ( µ) = 1 n X ( σ2 k + σ2 n X) + σ2 n X = 1 n X σ2 k + 1 + 1 n X 1 1 σ2 0 + n X If n X n X and both are large, 1 + 1 n X 1 and 1 σ2 0 + n X Var X ( X ) σ2 k n X + σ2 k n X 2 σ2 k n X . Räisä, Jälkö and Honkela With these approximations, Var µ( µ) σ2 k n X , so Var X ( X ) 2Var µ( µ), while the n X limit is Var X ( X ) Var µ( µ). This means that n X n X is required. The same happens when the synthetic data is generated from the unknown variance model: Var X ( X ) = 1 n X E σ2( σ2) + Var µ( µ) = 1 n X ν0 + n X ν0 + n X 2 σ2 n X + σ2 n X κ0 + n X . If n X n X and both are large, ν0+n X ν0+n X 2 1 and κ0 + n X n X, so Var X ( X ) σ2 n X n X + σ2 n X n X 2 σ2 n X n X . With these approximations, Var µ( µ) σ2 n X n X , so Var X ( X ) 2Var µ( µ). This leads to the same conclusion as the previous case: n X n X is required. 4.4 Approximate Variance Correction With similar approximations, we can find a correction term for the variance. When n X is large, Varµ (µ ) EX (ˆσ2 n X ) + Var X ( X ). If n X = cn X for some c > 0, from the analyses in Section 4.3, we get Var X ( X) 1 + 1 Varµ (µ ) EX (ˆσ2 n X ) + 1 + 1 Solving for Var µ( µ) gives Var µ( µ) 1 + 1 1 Varµ (µ ) EX (ˆσ2 n X ) . (21) which gives a Rubin s rules-like (Rubin, 1987) approximation of Var µ( µ) that can be computed from smaller synthetic data sets with n X n X. We validate this with the experiment in Figure 5, which shows that approximating p(µ|X, IA) with a Gaussian with variance from (21) is closer to the real data posterior than the mixed posterior approximation from Section 3. On Consistent Bayesian Inference from Synthetic Data 0.0 0.5 1.0 1.5 2.0 μ Known Variance p(X * |X) 0.0 0.5 1.0 1.5 2.0 μ Unknown Variance p(X * |X) Analyst's p(μ|X, IA) Data provider's p(μ|X, IS) With syn. data pn(μ) Gaussian approximation Figure 5: Results with the Gaussian approximation with n X = n X, showing that the Gaussian approximation is closer to the real data posterior than the mixture of synthetic data posteriors. On the left, the synthetic data is generated from the known variance model, and on the right, the synthetic data is generated from the unknown variance model. In both cases, the known variances for both parties are correct, and m = 400. Data Provider Analyst Known Mean µ2 k ˆµ2 k Prior Parameters ν0, σ2 0 ˆν0, ˆσ2 0 Data X X Single Datapoint x x Data Size n X n X Data Variance v ˆv Posterior Parameters νn X, σ2 n X ˆνn X , ˆσ2 n X Posterior Sample σ2 ˆσ2 Mixture of Posteriors Sample σ2 Table 3: Notation for known mean, unknown variance model in Section 4.5. Räisä, Jälkö and Honkela 4.5 Gaussian with Known Mean, Unknown Variance To asses the effects of uncongeniality when the downstream posterior is not Gaussian, we look at Bayesian estimation of the variance of a Gaussian, with known mean. The notation for this section is summarised in Table 3. In this case, the data provider s conjugate prior is σ2 Inv-χ2( ν0, σ2 0), and their known mean is µk. The synthetic data is generated from x i | σ2 N( µk, σ2) σ2|X Inv-χ2( νn X, σ2 n X) σ2 n X = ν0 σ2 0 + n X v ν0 + n X νn X = ν0 + n X i=1 (xi µk)2. The analyst s conjugate prior is ˆσ2 Inv-χ2(ˆν0, ˆσ2 0), their known mean is ˆµk, and the downstream posterior is i=1 (x i ˆµk)2 ˆνn X = ˆν0 + n X ˆσ2 n X = ˆν0ˆσ2 0 + n X ˆv ˆν0 + n X ˆσ2|X Inv-χ2(ˆνn X , ˆσ2 n X ). The behavior of the mixture of posteriors from synthetic data is given in the following proposition. Proposition 17. If σ2 is a sample from pn(σ) in Gaussian variance estimation with known mean, Eσ2 (σ2 ) E σ2( σ2) + ( µk ˆµk)2, (22) Proof See Appendix B.4. This means that mixing the downstream posteriors can only recover the data provider s posterior when both parties have equal known means. On Consistent Bayesian Inference from Synthetic Data 0 2 4 6 8 10 σ2 Correct Downstream Mean 0 2 4 6 8 10 σ2 Incorrect Downstream Mean Analyst's p(σ2|X, IA) Data provider's p(σ2|X, IS) With syn. data pn(σ2) Corrected mean Figure 6: Posteriors from estimating a Gaussian variance σ2 with known mean. On the right, both the data provider and the analyst use the correct known mean, so pn(σ2) converges as expected by our theory. On the right, the analyst mean is incorrect, so the model is not congenial. In this case, pn(σ2) does not converge to either the analyst s or the data provider s posterior. After correcting the mean of pn(σ2) as in (22), it appears to have the same variance and shape as the data provider s posterior. The gray line shows the true parameter value. In both panels, m = 400 and n X We verify this with a simulation shown in Figure 6. Both the data provider and analyst use the Gaussian with unknown variance and known mean as their model. Otherwise, the setting is identical with the other Gaussian examples. When both parties have the correct known mean, pn(σ2) converges as expected, but when the analyst has an incorrect known mean, pn(σ2) converges to neither party s posterior. However, after applying the mean correction from (22), pn(σ2) appears to have the same variance and shape as the data provider s posterior. 5. Differentially Private Logistic Regression Our second example is logistic regression with DP synthetic data. We consider two settings used by Räisä et al. (2023) with frequentist logistic regression, and change the downstream task to Bayesian logistic regression. The first setting uses a simple toy dataset, and the second uses the UCI Adult dataset (Kohavi and Becker, 1996). We also consider a hierarchical logistic regression task on the Adult dataset. Under DP, Z is a noisy summary s of the real data. We need synthetic data sampled from the posterior predictive p(X | s), which is exactly what the NAPSU-MQ algorithm of Räisä et al. (2023) provides. In NAPSU-MQ, s contains the values of user-selected marginal queries with added Gaussian noise. We used the open-source implementation of NAPSU-MQ3 by Räisä et al. (2023), and describe NAPSU-MQ in Section 2.3. To demonstrate our theory outside of the NAPSU-MQ algorithm, we include the synthpop (Nowok et al., 2016) synthetic data genetator, which does not provide DP, in the Adult 3. https://github.com/DPBayes/NAPSU-MQ-experiments Räisä, Jälkö and Honkela logistic regression experiment. Like NAPSU-MQ, synthpop attempts to generate synthetic data from a posterior predictive distribution, which is done by sequentially training predictive models to predict one column from previous ones. Synthetic data is generated column-bycolumn by sampling the first column from the real data, and sampling the successive columns predicting them from the already sampled columns using the predictive models. Note that both synthpop and NAPSU-MQ generate both the covariates and the target variable jointly. We asses the quality of posteriors from different methods by the coverages and widths of their credible intervals. Recall that a credible interval is an interval that contains a given amount of posterior probability mass, for example 95%, which we call the confidence level of the interval. The coverage is the number of times, in repeated experiments, the credible interval contained the true parameter value. We want the coverage to be above the confidence level for the interval to be valid. We also compare marginals of the posterior visually. 5.1 Posterior Sampling Background For the DP experiments, we rely on Markov chain Monte Carlo (MCMC) algorithms (Gelman et al., 2014), in particular the No-U-Turn sampler (NUTS; Hoffman and Gelman, 2014), to sample the posterior in NAPSU-MQ. Recall that MCMC algorithms sample a distribution with the density only known up to a normalising constant by obtaining many samples from a Markov chain. We discard the initial elements of chain, where it has not yet converged (Gelman et al., 2014). We call these discarded elements warmup samples, and call the rest of elements kept samples. We run the sampler multiple times from different starting points, forming multiple chains, and mix the kept samples from each chain to form the final posterior sample. We check for convergence of the sampler with the ˆR statistic (Gelman et al., 2014). For downstream posteriors, we use Laplace approximations (Gelman et al., 2014), also known as Laplace s method. The Laplace approximation is simply a Gaussian distribution centered at the mode of the posterior, and the covariance is the negated inverse Hessian of the log posterior density at the mode. 5.2 Toy Data Logistic Regression The first logistic regression setting we consider uses a simple toy data set of three binary variables, with n X = 2000 samples. The first two variables are sampled with independent coinflips, and the third is sampled from logistic regression on the other two, with coefficients (1, 0). The prior for the downstream logistic regression is N(0, 10I). We generate synthetic data with the NAPSU-MQ algorithm (Räisä et al., 2023), instructing the algorithm to generate m synthetic data sets of size n X . For the privacy bounds, we vary ϵ, and set δ = n 2 X = 2.5 10 7. Because of the simplicity of this model, it is possible to use the exact posterior decomposition (4) as a baseline, by using p(X| s) instead of p(X | s) to generate synthetic data. We give a detailed description of this process in Appendix C. We have also included the DP-GLM algorithm (Kulkarni et al., 2021) that does not use synthetic data, and the non-DP posterior from the real data as baselines. We obtained the code for DP-GLM from Kulkarni et al. (2021) upon request. On Consistent Bayesian Inference from Synthetic Data 5.2.1 Hyperparameters For NAPSU-MQ, we use the hyperparameters of Räisä et al. (2023), except we used NUTS (Hoffman and Gelman, 2014) with 200 warmup samples and 500 kept samples per chain for ϵ {0.5, 1}, and 1500 kept samples per chain for ϵ = 0.1, as the posterior sampling algorithm. The NAPSU-MQ prior is N(0, 102I), and the summary is the single 3-way marginal query over all three variables. The hyperparameters of DP-GLM are the L2-norm upper bound R for the covariates of the logistic regression, a coefficient norm upper bound s, and the parameters of the posterior sampling algorithm DP-GLM uses. We set R = 2 so that the covariates do not get clipped, and set s = 5 after some preliminary runs. The posterior sampling algorithm is NUTS (Hoffman and Gelman, 2014) with 1000 warmup samples and 1000 kept samples from 4 parallel chains. 5.2.2 Results Figure 7 compares the mixture of posteriors from synthetic data pn(Q) from (11) that uses p(Q|X ), with n X /n X = 20 and m = 400 synthetic data sets, to the baselines. The mixture pn(Q) is very close to the posterior p(Q| s) from (4). The DP-GLM posterior that does not use synthetic data is somewhat wider. We ran the experiment 100 times and also with ϵ = 0.1 and ϵ = 0.5, and plot coverages and widths of credible intervals in Figure 8. With ϵ = 1 and ϵ = 0.5, the coverages are accurate and DP-GLM consistently produces wider intervals. With ϵ = 0.1, the mixture of synthetic data posteriors likely needs more and larger synthetic data sets to converge, as it produced wider and slightly overconfident intervals for one coefficient. Figures 9, 10 and 11 look at how pn(Q) converges to p(Q| s) both visually and in terms of total variation distance as n X and m increase. They show that both need to be increased together for pn(Q) to converge to the target, otherwise the total variation distance plateaus at some point, and there is a clear visual difference. 5.2.3 Plotting Details The plotted density of DP-GLM in Figure 7 is a kernel density estimate from the posterior samples DP-GLM returns. The non-DP density is a Laplace approximation. Both synthetic data methods use Laplace approximations in the downstream analysis, so their posteriors are mixtures of these Laplace approximations for each synthetic data set. This was also used in Figure 9. 5.3 UCI Adult Logistic Regression To test our theory on real data, we used the UCI Adult data set (Kohavi and Becker, 1996) setting that was used to test NAPSU-MQ (Räisä et al., 2023). We generate synthetic data with either NAPSU-MQ under DP, or with the synthpop algorithms (Nowok et al., 2016) without DP. In this setting, the synthetic data set is generated from a subset of 10 columns4, with the continuous columns age and hours-per-week discretised to 5 categories, 4. age, workclass, education, marital-status, race, gender, capital-gain, capital-loss, hours-per-week and income Räisä, Jälkö and Honkela 0.5 0.0 0.5 1.0 1.5 2.0 2.5 Parameter Coefficient: 1.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Parameter Coefficient: 0.0 Syn. data pn(Q) Syn. data p(X| s) Non-DP DP-GLM Figure 7: Posteriors in the DP logistic regression experiment, where Q are the regression coefficients. The mixture of posteriors from synthetic data, pn(Q), (with n X /n X = 20, m = 400) is very close the to the private posterior p(Q| s) computed using (4). Computing the posterior without synthetic data with DP-GLM gives a somewhat wider posterior. The true parameter values are highlighted by the grey dashed lines and shown in the panel titles. The privacy bounds are ϵ = 1, δ = n 2 X = 2.5 10 7. and capital-loss and capital-gain binarised according to whether they are greater than 0 or not. The income column is already binarised in the original data to denote whether it is over $50000 or not. All rows with missing values in the original data set are deleted, which results in n X = 46043 datapoints. The downstream task is logistic regression predicting income using age, race and gender, with age converted back to a continuous value by picking the midpoint of each category. The reference value for race is white and for gender is female . These subsets were originally used to make the runtime of NAPSU-MQ manageable, and to make sure that enough relevant information for the downstream task can be included in the input queries for NAPSU-MQ (Räisä et al., 2023). We take bootstrap samples of the data to simulate draws from a population. 5.3.1 Algorithms and Hyperparameters The target distribution p(Q|Z) is not tractable in this setting, so we used the non-DP Laplace approximation from the original data set as an approximate ground truth posterior, and the DP variational inference (DPVI) algorithm (Jälkö et al., 2017; Prediger et al., 2022) as a baselines. We also tried running DP-GLM (Kulkarni et al., 2021), but we were not able to get useful results out of it in this setting. We have also included the Gaussian approximation to the mixture of synthetic data posteriors discussed in Section 4.4, which is called variance correction in the figures. Unlike the mixture of posteriors, the variance correction approximation uses synthetic datasets of the same size as the real dataset. The prior for the downstream Bayesian logistic regression is N(0, 10), i.i.d. for each coefficient. The privacy parameters are ϵ {0.25, 0.5, 1}, and δ = n 2 X 4.7 10 10. We repeat the experiment 20 times. On Consistent Bayesian Inference from Synthetic Data ε: 0.1, Coefficient: 1.0 ε: 0.1, Coefficient: 0.0 ε: 0.5, Coefficient: 1.0 ε: 0.5, Coefficient: 0.0 0.0 0.5 1.0 Conf Level ε: 1.0, Coefficient: 1.0 0.0 0.5 1.0 Conf Level ε: 1.0, Coefficient: 0.0 Syn. data pn(Q) Syn. data p(X| s) ε: 0.1, Coefficient: 1.0 ε: 0.1, Coefficient: 0.0 ε: 0.5, Coefficient: 1.0 ε: 0.5, Coefficient: 0.0 0.25 0.50 0.75 ε: 1.0, Coefficient: 1.0 0.25 0.50 0.75 ε: 1.0, Coefficient: 0.0 Syn. data pn(Q) Syn. data p(X| s) Figure 8: (a) Coverages of credible intervals in the toy data experiment. The mixture of synthetic data posteriors is accurate, except with ϵ = 0.1, where it may not have converged yet. (b) Widths of credible intervals in the toy data experiment. DP-GLM produces much wider intervals than other methods, except with ϵ = 0.1. Räisä, Jälkö and Honkela n X */n X: 1, m: 25 n X */n X: 2, m: 25 n X */n X: 5, m: 25 n X */n X: 10, m: 25 n X */n X: 20, m: 25 n X */n X: 1, m: 50 n X */n X: 2, m: 50 n X */n X: 5, m: 50 n X */n X: 10, m: 50 n X */n X: 20, m: 50 n X */n X: 1, m: 100 n X */n X: 2, m: 100 n X */n X: 5, m: 100 n X */n X: 10, m: 100 n X */n X: 20, m: 100 n X */n X: 1, m: 200 n X */n X: 2, m: 200 n X */n X: 5, m: 200 n X */n X: 10, m: 200 n X */n X: 20, m: 200 n X */n X: 1, m: 400 n X */n X: 2, m: 400 n X */n X: 5, m: 400 n X */n X: 10, m: 400 n X */n X: 20, m: 400 Syn. data pn(Q) Syn. data p(X| s) Figure 9: Convergence of the mixture of synthetic data posteriors (in blue) with different values of m and n X in the toy data logistic regression experiment. On Consistent Bayesian Inference from Synthetic Data ε: 0.1, m: 25 ε: 0.1, m: 50 ε: 0.1, m: 100 ε: 0.1, m: 200 ε: 0.1, m: 400 ε: 0.5, m: 25 ε: 0.5, m: 50 ε: 0.5, m: 100 ε: 0.5, m: 200 ε: 0.5, m: 400 ε: 1.0, m: 25 ε: 1.0, m: 50 ε: 1.0, m: 100 ε: 1.0, m: 200 ε: 1.0, m: 400 Figure 10: Total variation distance (TVD) between pn(Q) and the target p(Q| s) for both 1D marginals (blue and orange) in the toy data experiment. For ϵ = {0.5, 1}, increasing the size of the synthetic data sets n X decreases the total variation distance at a steady rate, until hitting a point where the decrease stops. This point moves further as number of synthetic data sets m increases. ε: 0.1, n X */n X: 1 ε: 0.1, n X */n X: 2 ε: 0.1, n X */n X: 5 ε: 0.1, n X */n X: 10 ε: 0.1, n X */n X: 20 ε: 0.5, n X */n X: 1 ε: 0.5, n X */n X: 2 ε: 0.5, n X */n X: 5 ε: 0.5, n X */n X: 10 ε: 0.5, n X */n X: 20 ε: 1.0, n X */n X: 1 ε: 1.0, n X */n X: 2 ε: 1.0, n X */n X: 5 ε: 1.0, n X */n X: 10 ε: 1.0, n X */n X: 20 Figure 11: Total variation distance (TVD) between pn(Q) and the target p(Q| s) for both 1D marginals (blue and orange) in the toy data experiment, with roles of n X and m swapped from Figure 10. Increasing m decreases the total variation distance at a rate which depends on n X and ϵ. Räisä, Jälkö and Honkela The hyperparameters, prior, and selected queries for NAPSU-MQ are the same as in the original paper (Räisä et al., 2023). The synthetic data set size and number are n X /n X = 10, m = 100. For the variance correction approximation, n X /n X = 1. For the predictive models used by synthpop, we used the parametric option, which chooses from various parametric models based on the types of the involved variables. We chose the parametric option after observing that the default of using decision trees as the predictor resulted in biased downstream estimates of several coefficients. DPVI runs DP-SGD (Rajkumar and Agarwal, 2012; Song et al., 2013; Abadi et al., 2016), specifically DP-Adam, under the hood, so it inherits the clip bound, learning rate, number of iterations, and subsampling (without replacement) ratio hyperparameters from DP-SGD. We tuned these with the Optuna library (Akiba et al., 2019), using the bounds [0.1, 50] for the clip bound, [10 4, 10 1] for the learning rate, [104, 105] for the number of iterations and [0.001, 1] for the subsampling ratio. We used the distance of the DPVI posterior mean from the non-DP real data Laplace approximation as the optimisation criterion. We also tried using KL divergence, which gave hyperparameters that produced much wider posteriors. We used 100 trials for the tuning, and repeated it independently for all values of ϵ. The privacy cost of the hyperparameter tuning is not reflected in the final results. As the variational posterior, we used a mean-field Gaussian. 5.3.2 Results Figure 12 compares posteriors from one of the 20 runs with ϵ = 1. The mixture of synthetic data posteriors pn(Q) from NAPSU-MQ is fairly close to the ground truth posterior from the real data set, with the exception of two coefficients. The Gaussian approximation to pn(Q) from Section 4.4 is very close to pn(Q). The mixture of posteriors from synthpop follows ground truth posterior fairly well. The posteriors from DPVI are close to ground truth posterior, but for some coefficients, they are too narrow to overlap the ground truth posterior. The logistic regression coefficients for which pn(Q) does not work well correspond to the two races with the smallest number of people in the original data set. These posteriors are very wide due to the fact that NAPSU-MQ adds noise uniformly to all queries, which means that the queries with small values, corresponding to minority groups in the data, get relatively larger amounts of noise. In contrast, synthpop is able to approximate these posteriors well, since it does not need to add noise for DP. Figure 13 shows credible interval coverages from the experiment for the DP algorithms, computed from 20 runs, with different bootstrap samples each run to simulate draws from a population. The mixture pn(Q) does not achieve perfect coverages, as some information is lost due to not running NAPSU-MQ with all marginal queries. The coverages are still much better than DPVI. The coverages of the variance correction approximation to pn(Q) are close to the coverages of pn(Q). Figure 14 shows the widths of credible intervals from the same 20 runs. DPVI produces narrower posteriors than pn(Q), but the width of pn(Q) posteriors drops as ϵ increases, reflecting the reduced uncertainty from DP, which is not the case for DPVI. The variance correction approximation produces narrower posteriors than pn(Q), especially with ϵ = 0.25. On Consistent Bayesian Inference from Synthetic Data 4.0 3.5 3.0 0.03 0.04 0.05 age_continuous race_Amer-India... race_Asian-Pac-... 1.0 0.8 0.6 0.4 0.8 1.0 1.2 1.4 gender_Male Non-DP Laplace Syn. data pn(Q) Variance correction (n X * = n X) DPVI Synthpop (non-DP) Figure 12: Posteriors from one run on the Adult logistic regression experiment with ϵ = 1. In summary, while DPVI is able to find the posterior mean fairly well, it fails to accurately reflect the additional uncertainty from DP, making the posteriors overconfident, which would lead to spurious findings if applied in practice. In contrast, pn(Q) accounts for the DP noise very well, at the cost of producing very wide posteriors for the coefficients with a large amount of noise. Estimating uncertainty reliably is much more important than producing a narrow uncertainty estimate: a very wide posterior containing the correct value signals uncertainty, but a narrow posterior in the wrong place is confidently incorrect. The variance correction approximation performed well, despite using smaller synthetic datasets: it had valid coverage in most cases while producing somewhat narrower posteriors than the mixture of posteriors. The narrower posteriors are likely a result of the mixture of posteriors having heavier tails than the Gaussian that the variance correction uses. Figure 15 shows the credible interval coverages and widths for synthpop, along with NAPSU-MQ results with ϵ = 1 from Figures 13 and 14. Synthpop has similar coverages as NAPSU-MQ, but the credible interval widths demonstrate the price of DP: sythpop results in much narrower intervals due to not needing noise, especially for the two small minority races that NAPSU-MQ struggles with. 5.3.3 Plotting Details The plotted densities for the non-DP posterior and the mixture of synthetic data posteriors use Laplace approximations like in the toy data experiment. The posterior from DPVI is a multivariate Gaussian, which is plotted as is. 5.4 Hierarchical Logistic Regression We test our theory on a downstream posterior with more complicated geometry with a hierarchical logistic regression model on the UCI Adult (Kohavi and Becker, 1996) dataset. Räisä, Jälkö and Honkela const, : 0.25 const, : 0.5 const, : 1.0 age_continuous, : 0.25 age_continuous, : 0.5 age_continuous, : 1.0 race_Amer-India..., : 0.25 race_Amer-India..., : 0.5 race_Amer-India..., : 1.0 race_Asian-Pac-..., : 0.25 race_Asian-Pac-..., : 0.5 race_Asian-Pac-..., : 1.0 race_Black, : 0.25 race_Black, : 0.5 race_Black, : 1.0 race_Other, : 0.25 race_Other, : 0.5 race_Other, : 1.0 0.00 0.25 0.50 0.75 1.00 Conf Level gender_Male, : 0.25 0.00 0.25 0.50 0.75 1.00 Conf Level gender_Male, : 0.5 0.00 0.25 0.50 0.75 1.00 Conf Level gender_Male, : 1.0 Syn. data pn(Q) Variance correction (n X * = n X) DPVI Figure 13: Credible interval coverages on the Adult logistic regression experiment. On Consistent Bayesian Inference from Synthetic Data const, : 0.25 const, : 0.5 const, : 1.0 age_continuous, : 0.25 age_continuous, : 0.5 age_continuous, : 1.0 race_Amer-India..., : 0.25 race_Amer-India..., : 0.5 race_Amer-India..., : 1.0 race_Asian-Pac-..., : 0.25 race_Asian-Pac-..., : 0.5 race_Asian-Pac-..., : 1.0 race_Black, : 0.25 race_Black, : 0.5 race_Black, : 1.0 race_Other, : 0.25 race_Other, : 0.5 race_Other, : 1.0 0.2 0.4 0.6 0.8 Conf Level gender_Male, : 0.25 0.2 0.4 0.6 0.8 Conf Level gender_Male, : 0.5 0.2 0.4 0.6 0.8 Conf Level gender_Male, : 1.0 Syn. data pn(Q) Variance correction (n X * = n X) DPVI Figure 14: Credible interval widths on the Adult logistic regression experiment. Räisä, Jälkö and Honkela const age_continuous race_Amer-India... race_Asian-Pac-... 0.0 0.5 1.0 Conf Level 0.0 0.5 1.0 Conf Level gender_Male NAPSU-MQ ( = 1) Synthpop (non-DP) age_continuous race_Amer-India... race_Asian-Pac-... 0.250.500.75 0.250.500.75 gender_Male NAPSU-MQ ( = 1) Synthpop (non-DP) Figure 15: Results on the Adult dataset with NAPSU-MQ (ϵ = 1) and synthpop (non-DP). The panels show (a) Credible interval coverages, (b) Credible interval widths. On Consistent Bayesian Inference from Synthetic Data We use the same preprocessing steps as in Section 5.3, except we drop two columns5 from the synthetic data generation to reduce the runtime. The remaining columns are shown as the graph nodes in Figure 16. As in Section 5.3, we take bootstrap samples of the real data before generating synthetic data to simulate draws from a population. We repeat the whole experiment 5 times with different initial bootstrap samples. The downstream task is logistic regression predicting the binary income variable from age, but with different coefficients for the two genders represented in the dataset. This type of model could be used if the analyst suspects the relationship between age and income to differ between men and women. Specifically, the model is µ N2(0, 10I2) v N2(0, I2) θF N2(µ, diag(τ 2)) θM N2(µ, diag(τ 2)) α = XT (IG=F θF + IG=MθM) y Bernoulli 1 1 + e α where X = (Age, 1) is the covariate age and a constant, y is the income, G {F, M} is the gender, θF and θM are the coefficients for both genders, µ R2 and τ R2 + are a common mean and standard deviation for the coefficients, and v R2 is an auxiliary variable that avoids to need to directly sample τ from a constrained space. In addition, diag(τ 2) is a diagonal matrix with τ 2 on the diagonal, IG=g is an indicator of the gender having a specific value g, and the operations ev and τ 2 on vectors v and τ are elementwise. Hierarchical models are known to have difficult, non-Gaussian posterior geometry (Betancourt and Girolami, 2013). We see that this is the case for this hierarchical logistic regression model in Figure 18, which shows that the (µ1, v1) and (µ2, v2) marginal distributions have funnel-like geometry. 5.4.1 Algorithms and Hyperparameters We ran a preliminary experiment with the same input queries to NAPSU-MQ that we used in Section 5.3, but we found those to give biased estimates of the coefficients in the hierarchical setting. To correct the bias, we added the 3-way marginal query (age, gender, income) to the input queries. We then removed two columns from the generated synthetic data to reduce runtime. Figure 16 shows the input queries and remaining columns. Otherwise, we used the same hyperparemeters for NAPSU-MQ as in Section 5.3. In particular, the privacy parameters are ϵ = 1, δ = n 2 X 4.7 10 10. We keep the size of the synthetic datasets at n X /n X = 10, but increase the number of synthetic datasets m = 200 due to the increased complexity of the hierarchical downstream posterior. We use NUTS (Hoffman and Gelman, 2014) to sample the downstream posterior, with 2 chains, 1000 kept samples, and 500 warmup samples. 5. These are workclass and marital-status. Räisä, Jälkö and Honkela gender hours-per-week race capital-gain capital-loss Figure 16: The graph representing the queries given to NAPSU-MQ in the hierarchical logistic regression experiment. The clique (age, gender, income) represents a 3-way marginal query, while the other edges represent 2-way marginal queries. 5.4.2 Results We compare the posterior from mixing the individual posterior from multiple DP synthetic datasets to the non-DP posterior from the real data. In Figure 17, we plot the onedimensional marginals of all downstream posterior components with the exception of v, which is a deterministic transform of τ. Each of the 5 repeats has the different line for synthetic data. In panel (a), we see that the female coefficient posteriors have a much larger variance in the synthetic data posterior compared to the real data posterior, which is to be expected due to DP noise and the fact that only about one third of the datapoints in the Adult dataset are women. The posteriors of the male coefficients are fairly similar between the synthetic and real data. In panels (b) and (c), we see that the posteriors for µ and τ are very similar between real and synthetic data. In Figure 18, we plot selected 2-dimensional posterior marginals from the first repeat of the experiment. We selected 3 pairs of components between µ and v for their non-Gaussian geometry, and the pairs comparing the same coefficient between men and women, which could be of interest to the analyst. In the 3 leftmost columns, we see that the posterior marginals with non-Gaussian geometry are very similar between real and synthetic data. In the rightmost 2 columns, we see the same result as with the one-dimensional marginals: the female coefficients have a larger variance in the synthetic data posterior due to the DP noise, but the synthetic data posterior still covers the bulk of the real data posterior. We checked for convergence of the downstream posterior sample wit the ˆR statistic (Gelman et al., 2014). We find that in each of the 5 repeats, there are between 6 and 13 synthetic datasets out of m = 200 with ˆR > 1.05 for at least one coefficient, which indicates a convergence issue for these synthetic datasets. Since this is a very small fraction of the synthetic datasets, these convergence issues do not affect the results significantly. On Consistent Bayesian Inference from Synthetic Data 4.0 3.5 3.0 2.5 2.0 const 0.00 0.01 0.02 0.03 0.04 0.05 age Real Data, gender: F Syn. Data, gender: F Syn. Data, gender: F Syn. Data, gender: F Syn. Data, gender: F Syn. Data, gender: F Real Data, gender: M Syn. Data, gender: M Syn. Data, gender: M Syn. Data, gender: M Syn. Data, gender: M Syn. Data, gender: M 10 5 0 5 10 0.0 Real Data Syn. Data Syn. Data Syn. Data Syn. Data Syn. Data 5 0 5 10 0.0 Real Data Syn. Data Syn. Data Syn. Data Syn. Data Syn. Data 0 10 20 30 40 0.00 Real Data Syn. Data Syn. Data Syn. Data Syn. Data Syn. Data 0 5 10 15 20 25 0.0 Real Data Syn. Data Syn. Data Syn. Data Syn. Data Syn. Data Figure 17: Marginal posterior distributions from hierarchical logistic regression on the Adult dataset, with 5 repeats of generating synthetic data. The coefficients relating to the female gender have wider posteriors due to the DP noise and the lower number of women in the dataset. For the other coefficients, the DP synthetic data posterior is very close to the non-DP real data posterior. The privacy parameters are ϵ = 1, δ 4.7 10 10. Räisä, Jälkö and Honkela gender: F[const] gender: M[const] gender: F[age] gender: M[age] 5 0 mu[const] 5 0 5 mu[const] 5 0 mu[age] 4 3 gender: F[const] gender: M[const] 0.02 0.04 gender: F[age] gender: M[age] Figure 18: Selected pairwise posterior marginals from hierarchical logistic regression on the Adult dataset from the first run of synthetic data generation. The leftmost 3 pairs with non-Gaussian geometry are very similar between the DP synthetic data and non-DP real data. The gender-related coefficients have higher variance in the synthetic data posterior due to the DP noise. The privacy parameters are ϵ = 1, δ 4.7 10 10. 5.4.3 Plotting Details The plots in Figure 17 use kernel density estimates obtained from posterior samples. The plots in Figure 18 are scatterplots of the posterior samples. The synthetic data scatterplots show a subsample of the posterior samples with an equal number of samples to the real data posterior. 6. Discussion Synthetic data are often considered as a substitute for real data that are sensitive. Since the data generation process is based on having access to Z, one might ask why is the synthetic data needed in first place. Why cannot we simply perform the downstream posterior analysis directly using Z? Our analysis allows Z to be an arbitrary, even noisy, representation of the data, and it might be difficult for the analyst to place a model for such generative process for Q. In most applications, the analyst does have a model for Q arising from the data. Therefore using the synthetic data as a proxy for the Z allows the analyst to use existing models and inference methods to perform the analysis. 6.1 Limitations A clear limitation of mixing posteriors from multiple synthetic data sets is the computational cost of analysing many large synthetic data sets. This may be substantial for more complex Bayesian downstream models, where even a single analysis can be computationally expensive. However, the separate analyses can be run in parallel. We also expect that the information On Consistent Bayesian Inference from Synthetic Data gained from sampling the posteriors from a few synthetic data sets could be used to speed up sampling the others, for example by using importance sampling, as they likely won t be too far from the sampled ones. Under DP, we need noise-aware synthetic data generation, which limits the settings in which the method can currently be applied. However, if new noise-aware methods are developed in the future, the method can immediately be used with them. Condition 7 limits the applicability of our theory to downstream analyses where the prior s influence vanishes as the sample size grows. This does not always happen for some models, such as some infinite-dimensional models, models where the number of parameters increases with data set size, and models with a support that heavily depends on the parameters. The method also requires congeniality, which basically requires the analyst s prior to be compatible with the data provider s. Our Gaussian examples in Section 4 show that it is sometimes possible to recover useful inferences even without congeniality. This is not always the case, so an important direction for future research is separating these two cases, and finding out what can be done in the latter case. 6.2 Conclusion We considered the problem of consistent Bayesian inference using multiple, potentially DP, synthetic data sets, and studied an inference method that mixes the posteriors from multiple large synthetic data sets while re-using existing analysis methods designed for real data. We proved, under congeniality and the general and well-understood regularity conditions of the Bernstein von Mises theorem, that the method is asymptotically exact as the sizes of the synthetic data sets grow. We studied the method in two examples: non-private Gaussian mean or variance estimation and DP logistic regression. In the former, we were able to use the analytically tractable structure of the setting to derive additional properties of the method, in particular examining what can happen without congeniality. In both settings, we experimentally validated our theory, and showed that the method works in practice. When examining what can go wrong when our assumptions are not met, we found that the method can still give sensible results in some cases, but not all, showing that the method should be applied with care. This greatly expands the understanding of Bayesian inference from synthetic data, filling a major gap in the synthetic data analysis literature. Acknowledgments This work was supported by the Research Council of Finland (Flagship programme: Finnish Center for Artificial Intelligence, FCAI and Grant 356499), the Strategic Research Council at the Research Council of Finland (Grant 358247) as well as the European Union (Project 101070617). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the granting authority can be held responsible for them. The authors wish to thank the Finnish Computing Competence Infrastructure (FCCI) for supporting this project with computational and data storage resources. We thank Tejas Kulkarni for providing the DP-GLM code. Räisä, Jälkö and Honkela Appendix A. Additional Background In this section, we collected background material we use in the rest of the Appendix. A.1 Definition of Marginal Query Let the data X consist of n points x1, . . . xn, where each xi is a d-tuple (xi,1, . . . , xi,d), and each xi,j Xj, with Xj finite. Let q {1, . . . , d} be a set of indices into xi, and let the possible values the elements of xi indexed by q can jointly take be v1, . . . , v K. A marginal query for variables q is a vector s NK with i=1 Ixi,q=vj. If q has k-elements, the query is called a k-way marginal query. In other words, a marginal query counts, for given variables, how many data points have each of the possible values for those variables. Note that Räisä et al. (2023) use use the term marginal query for a single element of s, and use the term full set of marginal queries for what we call a marginal query. A.2 Total Variation Distance Properties Recall the definition of total variation distance: Definition 1. The total variation distance between random variables (or distributions) P1 and P2 is TV(P1, P2) = sup A | Pr(P1 A) Pr(P2 A)|, where A is any measurable set. Lemma 18 (Kelbert, 2023). Properties of total variation distance: 1. For probability densities p1 and p2, TV(p1, p2) = 1 Z |p1(x) p2(x)| dx. 2. Total variation distance is a metric. 3. Pinsker s inequality: for distributions P1 and P2, 1 2KL(P1 || P2). 4. Invariance to bijections: if f is a bijection and P1 and P2 are random variables, TV(f(P1), f(P2)) = TV(P1, P2). We also occasionally write TV(p1, p2) for probability densities p1 and p2 as TV(p1, p2) = sup h Z h(x)p1(x) dx Z h(x)p2(x) dx where h is an indicator function of some measurable set. On Consistent Bayesian Inference from Synthetic Data A.3 Bernstein von Mises Theorem Regularity Conditions The version of the Bernstein von Mises theorem we use is from van der Vaart (1998). To state the regularity conditions, we need two definitions: Definition 19. A parametric probability density p Q is differentiable in quadratic mean at Q0 if there exists a measurable vector-valued function ℓQ0 such that, as Q Q0, Z q 2(Q Q0)T ℓQ0(x) q p Q0(x) 2 dx = o(||Q Q0||2 2). Definition 20. A randomised test is a function φ: X [0, 1]. The interepretation of φ(X) is the probability of rejecting some null hypothesis after observing data X. Now we can state the regularity conditions of Theorem 2: Condition 21 (van der Vaart, 1998). For true parameter value Q0 and observed data Xn: 1. The datapoints of Xn are i.i.d. 2. The likelihood p(x|Q) for a single datapoint x is differentiable in quadratic mean at Q0. 3. The Fisher information matrix of p(x|Q) is nonsingular at Q0. 4. For every β > 0, there exists a sequence of randomised tests φn such that p(Xn|Q0)φn(Xn) 0, sup ||Q Q0||2 β p(Xn|Q)(1 φn(Xn)) 0. 5. The prior p(Q) is absolutely continuous (as a measure) in a neighbourhood of Q0 with a continuous positive density at Q0. A.4 Bayesian Inference with Gaussian Models In this section, we collect well-known results on Bayesian inference of a Gaussian mean. See Gelman et al. (2014) for proofs. A.4.1 Scaled Inverse-Chi-Square Distribution This parameterisation of the inverse gamma distribution is convenient in this setting. Inv-χ2(ν, s2) = Inv-Gamma α = ν If θ Inv-χ2(ν, s2), θ > 0, E(θ) = ν ν 2s2, ν > 2 Var(θ) = 2ν2 (ν 2)2(ν 4)s4, ν > 4. Räisä, Jälkö and Honkela A.4.2 Gaussian Model with Known Variance When the variance of the data is known to be σ2 k, and only the mean is unknown, the conjugate prior is another Gaussian, and we get the following inference problem: µ N(µ0, σ2 0) xi|µ N(µ, σ2 k). The posterior with n datapoints with sample mean X is: µ|X N(µn, σ2 n) 1 σ2 0 µ0 + n σ2 k 1 σ2n = 1 A.4.3 Gaussian Model with Unknown Variance When the variance of the data is also unknown, the conjugate prior is a inverse-chi-squared for the variance, and Gaussian for the mean, which gives the following inference problem: σ2 Inv-χ2(ν0, σ2 0) µ|σ2 N µ0, σ2 xi|µ, σ2 N(µ, σ2). The joint posterior of µ and σ2 for n datapoints is: σ2|X Inv-χ2(νn, σ2 n) µ|σ2, X N µn, σ2 i=1 (xi X)2 µn = κ0 κ0 + nµ0 + n κ0 + n X κn = κ0 + n νn = ν0 + n νnσ2 n = ν0σ2 0 + (n 1)s2 + κ0n κ0 + n( X µ0)2. On Consistent Bayesian Inference from Synthetic Data The marginal posterior of µ is µn, σ2 n κn Appendix B. Missing Proofs This sections contains the missing proofs from the main text. The proof of our main Theorem 9 is in Appendix B.1, and proofs of the convergence rate-related Theorems B.2 and 26 are in Appendix 13. B.1 Consistency Proof For ease of reference, we repeat Theorem 2 and Condition 7: Theorem 2 (Bernstein von Mises, van der Vaart, 1998). Let n denote the size of the data set Xn. Under regularity conditions stated in Condition 21 in Appendix A.3, for true parameter value Q0, the posterior Q(Xn) p(Q|Xn) satisfies TV n( Q(Xn) Q0), N(µ(Xn), Σ) P 0 as n for some µ(Xn) and Σ, that do not depend on the prior, where the convergence in probability is over sampling Xn p(Xn|Q0). Recall that Q+ n p(Q|Z, X n), and Qn p(Q|X n). Condition 7. For the observed Z and all Q, there exist random variables Dn such that TV Q+ n , Dn P 0 and TV Qn, Dn P 0 as n , where the convergence in probability is over sampling X n p(X n|Z, Q). Lemma 8. Let the assumptions of Theorem 2 (stated in Condition 21) hold for the downstream analysis for all Q0, and the following assumptions: (1) Z and X are conditionally independent given Q; and (2) p(Z|Q) > 0 for all Q, hold. Then Condition 7 holds. Proof Under Assumption (1) p(Q|Z, X n) p(X n|Q)p(Z|Q)p(Q), so we can view both p(Q|Z, X n) and p(Q|X n) as the posteriors for the same Bayesian inference problem with observed data X n, and priors p(Q|Z) p(Z|Q)p(Q) and p(Q), respectively. Due to Condition 21 (5) and Assumption (2), p(Q|Z) has an everywhere positive density. Recall that Q+ n p(Q|Z, X n) and Qn p(Q|X n). Now, Theorem 2 gives TV n( Q+ n Q0), N(µn, Σ) P 0 Räisä, Jälkö and Honkela and TV n( Qn Q0), N(µn, Σ) P 0 as n , where µn, Σ are equal in the two cases because they do not depend on the prior. The probability is over X n p(X n|Q0). Because of Assumption (1), p(X n|Q0) = p(X n|Z, Q0), so the convergence also holds with probability over X n p(X n|Z, Q0). These hold for any Q0. Because the function fn(q) = n(q Q0) is a bijection and total variation distance is invariant to bijections, Condition 7 holds with Dn being the pushforward distribution Dn = f 1 n N(µn, Σ), with the Q of Condition 7 being Q0. Note that Dn is allowed to depend on Q in Condition 7 due to the order of quantifiers. Lemma 22. Under Condition 7, TV( Q+ n , Qn) P 0 as n , with the probability over X n p(X n|Z). Proof Total variation distance is a metric, so TV Q+ n , Qn TV Q+ n , Dn + TV Qn, Dn . Now, by Condition 7 TV Q+ n , Qn P 0 (23) as n , with the probability over X n p(X n|Z, Q). It remains to show (23) with the probability over X n p(X n|Z) instead of X n p(X n|Z, Q). With X n p(X n|Z), for any ϵ > 0, Pr X n|Z(TV( Q+ n , Qn) > ϵ) = Z Pr X n|Z,Q(TV( Q+ n , Qn) > ϵ)p(Q|Z) d Q. (23) holds for any Q, so lim n Pr X n|Z,Q(TV( Q+ n , Qn) > ϵ) = 0. The dominated convergence theorem then implies that lim n Pr X n|Z(TV( Q+ n , Qn) > ϵ) = 0, so TV( Q+ n , Qn) P 0 as n , with the probability over X n p(X n|Z). Lemma 23. Let yn Un be an arbitrary sequence of random variables with densities f Un with regards to some measure µ. Let S(yn), T(yn) be random variables that depend on yn, with respective density functions f S(yn) and f T(yn), with regards to some measure ν. If TV(S(yn), T(yn)) P 0 On Consistent Bayesian Inference from Synthetic Data as n , where the probability is over yn Un, then TV Z f S(yn)(x)f Un(yn)µ(dyn), Z f T(yn)(x)f Un(yn)µ(dyn) 0 Proof Let h be an indicator function of x over any measurable set and let ϵ > 0. Then Z h(x) Z f S(yn)(x)f Un(yn)µ(dyn)ν(dx) Z h(x) Z f T(yn)(x)f Un(yn)µ(dyn)ν( dx) Z h(x) Z f Un(yn) f S(yn)(x) f T(yn)(x) µ(dyn)ν(dx) Z f Un(yn) Z h(x) f S(yn)(x) f T(yn)(x) ν(dx)µ(dyn) Z h(x) f S(yn)(x) f T(yn)(x) ν(dx) µ(dyn) = Z f Un(yn) Z h(x)f S(yn)(x)ν(dx) Z h(x)f T(yn)(x)ν(dx) µ(dyn). Because TV(S(yn), T(yn)) P 0, for large enough n, there is a set Yn with TV(S(yn), T(yn)) < ϵ 2 for all yn Yn, and Pr(yn Y C n ) < ϵ TV(S(yn), T(yn)) = sup h Z h(x)f S(yn)(x)ν(dx) Z h(x)f T(yn)(x)ν(dx) 1, now Z f Un(yn) Z h(x)f S(yn)(x)ν(dx) Z h(x)f T(yn)(x)ν(dx) µ(dyn) Yn f Un(yn) Z h(x)f S(yn)(x) dx Z h(x)f T(yn)(x)ν(dx) µ(dyn) Y C n f Un(yn) Z h(x)f S(yn)(x) dx Z h(x)f T(yn)(x)ν(dx) µ(dyn) Yn f Un(yn) ϵ 2µ(dyn) + Z Y C n f Un(yn)µ(dyn) for large enough n. Now TV Z f S(yn)(x)f Un(yn)µ(dyn), Z f T(yn)(x)f Un(yn)µ(dyn) Z h(x) Z f S(yn)(x)f Un(yn)µ(dyn)ν(dx) Z h(x) Z f T(yn)(x)f Un(yn)µ(dyn)ν(dx) Räisä, Jälkö and Honkela for any ϵ > 0 with large enough n. Theorem 9. Under congeniality and Condition 7, TV (p(Q|Z), pn(Q)) 0 as n . Proof The claim follows from Lemma 23 with yn = X n, Un = p(X n|Z), S(yn) p(Q|X n) and T(yn) p(Q|Z, X n). These meet the condition for Lemma 23 due to Lemma 22. B.2 Convergence Rate Definition 10. A sequence of random variables Xn is uniformly integrable if lim M sup n E(|Xn|I|Xn|>M) = 0. Lemma 24. If |Xn| Yn almost surely and Yn is uniformly integrable, Xn is uniformly integrable. 0 lim M sup n E(|Xn|I|Xn|>M) lim M sup n E(Yn IYn>M) = 0. Lemma 25 (Billingsley, 1995, Section 16). If Xn and Yn are uniformly integrable, Xn + Yn is uniformly integrable. Condition 11. For the observed Z, there exist random variables Dn such that for a sequence R1, R2, > 0, Rn 0 as n , 1 Rn TV Q+ n , Dn and 1 Rn TV Qn, Dn are uniformly integrable when X n p(X n|Z). Theorem 13. Under congeniality and Condition 11, TV (p(Q|Z), pn(Q)) = O(Rn). Proof Total variation distance is a metric, so 1 Rn TV( Q+ n , Qn) 1 Rn TV Q+ n , Dn + 1 Rn TV Qn, Dn . Now Condition 11 and Lemmas 24 and 25 imply that 1 Rn TV( Q+ n , Qn) is uniformly integrable with X n p(X n|Z). On Consistent Bayesian Inference from Synthetic Data Recall that 1 Rn TV( Q+ n , Qn) = 1 Z h(Q)p(Q|Z, X n) d Q Z h(Q)p(Q|X n) d Q 1 Rn TV(p(Q|Z), pn(Q)) Z h(Q) Z p(Q|Z, X n)p(X n|Z) d X n d Q Z h(Q) Z p(Q|X n)p(X n|Z) d X n d Q , where h is an indicator function of some measurable set. For any indicator function h, using the start of the proof of Lemma 23 gives Z h(Q) Z p(Q|Z, X n)p(X n|Z) d X n d Q Z h(Q) Z p(Q|X n)p(X n|Z) d X n d Q Z p(X n|Z) 1 Z h(Q)p(Q|Z, X n) d Q Z h(Q)p(Q|X n) d Q d X n Z p(X n|Z) 1 Rn TV( Q+ n , Qn) d X n. Because R 1 n TV( Q+ n , Qn) is uniformly integrable when X n p(X n|Z), there exists an M such that for all n, Yn p(X n|Z) 1 Z h(Q)p(Q|Z, X n) d Q Z h(Q)p(Q|X n) d Q d X n 1, where Yn = {X n | 1 Rn TV( Q+ n , Qn) > M}. Now, for all n Z h(Q) Z p(Q|Z, X n)p(X n|Z) d X n d Q Z h(Q) Z p(Q|X n)p(X n|Z) d X n d Q Z p(X n|Z) 1 Z h(Q)p(Q|Z, X n) d Q Z h(Q)p(Q|X n) d Q d X n Yn p(X n|Z) 1 Z h(Q)p(Q|Z, X n) d Q Z h(Q)p(Q|X n) d Q d X n Y C n p(X n|Z) 1 Z h(Q)p(Q|Z, X n) d Q Z h(Q)p(Q|X n) d Q d X n Y C n p(X n|Z)M d X n so TV (p(Q|Z), pn(Q)) = O(Rn). Räisä, Jälkö and Honkela B.2.1 Convergence Rate in Gaussian Mean Estimation We start by proving the one-dimensional case. Theorem 26. When the upand downstream models are Gaussian mean estimations with known variance, and Dn = N( X, n 1σ2 k), n TV Q+ n , Dn and n TV Qn, Dn are uniformly integrable when X n p(X n|X). Proof When the downstream model is Gaussian mean estimation with known variance, Qn = N(µn, σ2 n) 1 σ2 0 µ0 + n We start with the proof for n TV Qn, Dn . By Pinsker s equality and the formula for KL-divergence between Gaussians, n TV Qn, Dn 1 2n KL( Qn || Dn) 1 4n σ2 k nσ2n + (µn X)2 σ2n 1 + ln nσ2n s 1 4n σ2 k nσ2n 1 + 1 σ2n + 1 4n ln nσ2n s 1 4n σ2 k nσ2n 1 + 1 4n(µn X)2 s 1 4n ln nσ2n The last inequality can be deduced from the fact that the L2-norm is upper bounded by the L1 norm. Denote C1(n) = n σ2 k nσ2n 1 = 1 σ2 k n = σ2 k σ2 0 + n n = σ2 k σ2 0 C2(n) = n ln nσ2 n σ2 k = n ln σ2 k nσ2n = n ln 1 = n ln σ2 k nσ2 0 + 1 = uσ2 k σ2 0 ln 1 u + 1 = σ2 k σ2 0 ln 1 On Consistent Bayesian Inference from Synthetic Data with n = uσ2 k σ2 0 . Because lim u σ2 k σ2 0 ln 1 u + 1 u = σ2 k σ2 0 , which implies that C2(n) is bounded. Futhermore, s 1 4n(µn X)2 Note that sn = O(n). Then s 1 4nn(µn X)2 nσ2n = sn|µn X|, sn|µn X| = sn n σ2 k 1 σ2 0 + n 1 σ2 0 |µ0| n σ2 k 1 σ2 0 + n 1 σ2 0 |µ0| n σ2 k 1 σ2 0 + n σ2 k 1 σ2 0 + n 1 σ2 0 |µ0| 1 σ2 0 1 σ2 0 + n 1 σ2 0 |µ0| 1 σ2 0 1 σ2 0 + n Because sn = O(n), we have C3(n) = O(1) and C4(n) = O(1), so C3(n) and C4(n) are bounded. Räisä, Jälkö and Honkela We now have n TV Qn, Dn 1 4|C1(n)| + 1 4|C2(n)| + C3(n) + C4(n)| X|. By Lemmas 24 and 25, it suffices to show that each of the terms on the right is uniformly integrable. The terms containing C1, C2 and C3 are non-random and bounded in n, so they are uniformly integrable. It remains to show that C4(n)| X| is uniformly integrable. C4(n) is bounded, so we only need to show that | X| is uniformly integrable. To bound the expectation in the definition of uniform integrability for | X|, we need some background facts. For geometric series, with a R and |r| < 1, i=0 ari = a 1 r, and differentiating both sides with regards to r gives i=0 a(i + 1)ri = a (1 r)2 . For a Gaussian random variable Y with mean µ and variance σ, Pr(Y > µ + t) e t2 2σ2 . X N µ, 1 nσ2 k , so this tail bound gives Pr( X > t + µ) 2e nt2 2σ2 k , Pr( X < µ t) 2e nt2 lim M sup n E | X|I| X|>M = lim M sup n Eµ E(| X|I| X|>M|µ) = lim M sup n Eµ i=0 E | X|IM+i<| X| M+i+1|µ ! lim M sup n Eµ i=0 E (M + i + 1)I| X|>M+i|µ ! = lim M sup n Eµ i=0 (M + i + 1) Pr | X| > M + i|µ ! = lim M sup n Eµ i=0 (M + i + 1) Pr X > M + i|µ + Pr X < M i|µ ! lim M sup n Eµ i=0 (M + i + 1) e n(M+i µ)2 2σ2 k + e n(µ+M+i)2 i=0 (M + i + 1) e (M+i µ)2 2σ2 k + e (µ+M+i)2 On Consistent Bayesian Inference from Synthetic Data lim M sup n E | X|I| X|>M lim M Eµ i=0 (M + i + 1)e (M+i µ)2 i=0 (M + i + 1)e (µ+M+i)2 Looking at the first term on the RHS of (24), when |M +i µ| 1, (M +i µ)2 M +i µ. It is possible that |M + i µ| < 1 for exactly two values of i that depend on µ. Let iµ1 and iµ2 be those values. We know that iµj < 1 + µ M for j {1, 2}. Now i=0 (M + i + 1)e (M+i µ)2 i=0,i =iµ1,i =iµ2 (M + i + 1)e (M+i µ)2 (M + iµ1 + 1)e (M+iµ1 µ)2 (M + iµ2 + 1)e (M+iµ2 µ)2 We can upper bound the series using (M +i µ)2 M +i µ and the formulas for geometric series. i=0,i =iµ1,i =iµ2 (M + i + 1)e (M+i µ)2 i=0,i =iµ1,i =iµ2 (M + i + 1)e (M+i µ) i=0 (M + i + 1)e (M+i µ) i=0 (M + i + 1)e (M µ) 2σ2 k e 1 2σ2 k i! i=0 Me (M µ) 2σ2 k e 1 2σ2 k i! i=0 (i + 1)e (M µ) 2σ2 k e 1 2σ2 k i! 1 e 1 2σ2 k 2σ2 k 1 e 1 2σ2 k 2 1 e 1 2σ2 k + M 1 e 1 2σ2 k 2 Räisä, Jälkö and Honkela For the expectation, we have 2σ2 k = e M 1 2σ2 k µ . 1 2σ2 k µ is finite, as it is an evaluation of the moment generating function of µ, which 1 e 1 2σ2 k + M 1 e 1 2σ2 k 2 For the two other terms on the RHS of (25) (M + iµj + 1)e (M+iµj µ)2 (M + 1 + µ M + 1)e (M+iµj µ)2 (µ + 2)e (M+iµj µ)2 as M by the dominated convergence theorem, as (µ + 2)e (M+iµj µ)2 2σ2 k (µ + 2)e 0 2σ2 k , and the right-hand-side has a finite expectation. We have now shown that the first limit on the RHS of (24) is 0. For the other limit, setting µ = µ, and using the reasoning above with µ replaced by µ , we have i=0 (M + i + 1) e (µ+M+i)2 i=0 (M + i + 1) e (M+i µ )2 so the RHS of (24) is 0. We have now shown that lim M sup n E | X|I| X|>M = 0, or, in other words, that | X| is uniformly integrable. As shown earlier, this concludes the proof that n TV Qn, Dn On Consistent Bayesian Inference from Synthetic Data is uniformly integrable when X n p(X n|X). To show that n TV Q+ n , Dn is uniformly integrable, as in the proof of Lemma 8, we have p(Q|X, X ) p(X |Q)p(X|Q)p(Q), so we can view both p(Q|X, X n) and p(Q|X n) as the posteriors for the same Bayesian inference problem with observed data X , and priors p(Q|X) p(X|Q)p(Q) and p(Q), respectively. p(Q|X) is Gaussian, so the uniform integrability of n TV Q+ n , Dn follows from the previous case with different values for µ0 and σ2 0. Lemma 27. Let A, B be positive-definite matrices. Then ||(A + n B) 1||F = O( 1 n), where || ||F is the Frobenious norm. Proof It suffices to show that n||(A + n B) 1||F is bounded. Since A and B are positivedefinite, the inverse (A + n B) 1 exists for all n 0. By the continuity of the matrix inverse and the Frobenius norm, n||(A + n B) 1||F = as n . The required boundedness follows from continuity. Theorem 12. When the upand downstream models are d-dimensional Gaussian mean estimations with known covariance Σk, and Dn N( X n, n 1Σk), n TV Q+ n , Dn and n TV Qn, Dn are uniformly integrable when X n p(X n|X). Proof When the downstream model is Gaussian mean estimation with known covariance, Qn = N(µn, Σn), µn = Σ 1 0 + nΣ 1 k 1 (Σ 1 0 µ0 + nΣ 1 k X n), Σ 1 n = Σ 1 0 + nΣ 1 k . We start with the proof for n TV Qn, Dn . By Pinsker s equality and the formula for KL-divergence between Gaussians, we have n TV Qn, Dn 1 2n KL(Dn || Qn) 1 4n tr(n 1Σ 1 n Σk) + (µn X n)T Σ 1 n (µn X n) d + ln det(Σn) det(n 1Σk) s 1 4C1(n) + s 1 4n(µn X n)T Σ 1 n (µn X n) + Räisä, Jälkö and Honkela C1(n) = n(tr(n 1Σ 1 n Σk) d) C2(n) = n ln det(Σn) det(n 1Σk). C1(n) = n(tr(n 1Σ 1 n Σk) d) = tr((Σ 1 0 + nΣ 1 k )Σk) nd = tr(Σ 1 0 Σk) + ntr(Id) nd = tr(Σ 1 0 Σk), which does not depend on n, so C1(n) is bounded. For C2(n), C2(n) = n ln det(Σn) det(n 1Σk) = n ln det(n 1Σk) nd det(Σk) det(Σ 1 0 + nΣ 1 k ) nd det(ΣkΣ 1 0 + n Id) det(ΣkΣ 1 0 + n Id) is the characteristic polynomial of ΣkΣ 1 0 with n as the variable, so we have det(ΣkΣ 1 0 + n Id) = j=1 (aj + n) where a1, . . . , ad are the eigenvalues of ΣkΣ 1 0 . Plugging this into C2(n), C2(n) = n ln 1 nd det(ΣkΣ 1 0 + n Id) j=1 (aj + n) On Consistent Bayesian Inference from Synthetic Data Substituting uj = n j=1 ujaj ln 1 j=1 aj ln 1 uj + 1 uj . Since limuj 1 uj + 1 uj = e, lim n C2(n) = so C2(n) is bounded. For the remaining term, we have (µn X n)T Σ 1 n (µn X n) = (Σ 1 2 n (µn X n))T (Σ 1 2 n (µn X n)) 2 n (µn X n)||2 2 2 n is the upper triangular part of the Cholesky decomposition of Σ 1 n . Now s 1 4n(µn X n)T Σ 1 n (µn X n) = 1 2 n (µn X n)||2. Using ||M||2 for a matrix M to denote the spectral 2-norm, we have 2 n (µn X n)||2 ||Σ 1 2 n ||2||µn X n||2 and using || ||F to denote Frobenius norm, we have 2 n ||2 ||Σ 1 2 n ||F = q tr(Σ 1 n ) = q tr(Σ 1 0 ) + ntr(Σ 1 k ). Denote sn = 1 tr(Σ 1 0 ) + ntr(Σ 1 k ). Clearly sn = O(n). Next, µn X n = (Σ 1 0 + nΣ 1 k ) 1Σ 1 0 µ0 + (Σ 1 0 + nΣ 1 k ) 1nΣ 1 k X n X n = (Σ 1 0 + nΣ 1 k ) 1Σ 1 0 µ0 + ((Σ 1 0 + nΣ 1 k ) 1nΣ 1 k Id) X n (Σ 1 0 + nΣ 1 k ) 1nΣ 1 k Id = (Σ 1 0 + nΣ 1 k ) 1(nΣ 1 k (Σ 1 0 + nΣ 1 k )) = (Σ 1 0 + nΣ 1 k ) 1Σ 1 0 Räisä, Jälkö and Honkela so, denoting Mn = (Σ 1 0 + nΣ 1 k ) 1Σ 1 0 ||µn X n||2 ||Mn||F ||µ0||2 + ||Mn||F || X n||2. C3(n) = sn||Mn||F ||µ0||2 C4(n) = sn||Mn||F By Lemma 27, ||Mn||F = O( 1 n), so C3(n) and C4(n) are bounded. We now have n TV Qn, Dn s 1 4C1(n) + s 1 4C2(n) + C3(n) + C4(n)|| X n||2. By Lemmas 24 and 25, it suffices to show that each of the terms on the right is uniformly integrable. The terms containing C1, C2 and C3 are non-random and bounded in n, so they are uniformly integrable. Additionally, C4(n) is bounded, so we only need to show that || X n||2 is uniformly integrable. We have j=1 |X n,j| (26) so it suffices to show uniform integrability for a univariate Gaussian absolute value |X n,j|, which is done in the proof of Theorem 26. To show that n TV Q+ n , Dn is uniformly integrable, as in the proof of Lemma 8, we have p(Q|X, X ) p(X |Q)p(X|Q)p(Q), so we can view both p(Q|X, X n) and p(Q|X n) as the posteriors for the same Bayesian inference problem with observed data X , and priors p(Q|X) p(X|Q)p(Q) and p(Q), respectively. p(Q|X) is Gaussian, so the uniform integrability of n TV Q+ n , Dn follows from the previous case with different values for µ0 and Σ0. B.3 Convergence with Asymptotic Congeniality Theorem 14. If asymptotic congeniality and Condition 7 for all Zn Z, with the probabilities of Condition 7 taken conditional to IS, hold, lim n Z lim n X TV p(Q|Zn Z, IA), pn X (Q) = 0. (16) Proof By the triangle inequality of total variation distance, TV(p(Q|Zn Z, IA), pn X(Q)) TV(p(Q|Zn Z, IA), p(Q|Zn Z, IS)) + TV(p(Q|Zn Z, IS), pn X(Q)). On Consistent Bayesian Inference from Synthetic Data Asymptotic congeniality implies lim n Z TV(p(Q|Zn Z, IA), p(Q|Zn Z, IS)) = 0. Next, we examine p(Q|Z, X n X , IS) and p(Q|X n X , IA). We have TV(p(Q|Zn Z, X n X , IS), p(Q|X n X , IA)) TV(p(Q|Zn Z, X n X , IS), p(Q|X n X , IS)) + TV(p(Q|X n X , IS), p(Q|X n X , IA)). Asymptotic congeniality implies that lim n X TV(p(Q|X n X , IS), p(Q|X n X , IA)) = 0 for all X , which means that the limit holds in probability under all distributions for X n X . Combining this with Lemma 22, we get TV(p(Q|Zn Z, X n X , IS), p(Q|X n X , IA)) P 0 as n X , with the probability over X n X p(X n X |Zn Z, IS). Now we can apply Lemma 23 to TV(p(Q|Zn Z, IS), pn X(Q)), since p(Q|Zn Z, IS) = Z p(Q|Zn Z, X n X , IS)p(X n X |Zn Z, IS) d X n X , pn X(Q) = Z p(Q|X n X , IA)p(X n X |Zn Z, IS) d X n X . This yields TV(p(Q|Zn Z, IS), pn X(Q)) 0 as n X for all Zn Z, which concludes the proof. B.4 Non-private Gaussian Example Proofs Proposition 15. If µ is a sample from pn(µ) in Gaussian mean estimation with known variance, µ has a Gaussian distribution, and as n X , Eµ (µ ) µn X, Varµ (µ ) σ2 n X. (19) Proof We begin by checking where the mean and variance of µ pn(µ) converge when n X : Eµ (µ ) = EX (Eµ |X (µ )) = EX ˆµn X = EX 1 ˆσ2 0 ˆµ0 + n X 1 ˆσ2 0 + n X 1 ˆσ2 0 ˆµ0 + n X ˆσ2 k EX ( X ) 1 ˆσ2 0 + n X ˆσ2 k EX ( X ) = µn X Räisä, Jälkö and Honkela as n X . For the variance, Varµ (µ ) = EX (Varµ |X (µ )) + Var X (Eµ |X (µ )) = EX (ˆσ2 n X ) + Var X (ˆµn X ), EX (ˆσ2 n X ) = EX Var X ˆµn X = Var X ˆσ2 k X + ˆµ0 Var X ( X ), Var X ( X ) = E µ(Var X | µ( X )) + Var µ(EX | µ( X )) = 1 n X E µ(Varx | µ(x )) + Var µ( µ) Var µ( µ) = σ2 n X as n X , where x is a single element of X . Putting these together, Eµ (µ ) µn X, Varµ (µ ) σ2 n X as n X . Next, we show that µ also has a Gaussian distribution. In µ Z p(µ|X n)p(X n|X)d X , both p(µ|X n) = N(ˆµn X , ˆσ2 n X ) and p(X n|X) are Gaussian. Since ˆµn X is a linear function of X n, p(ˆµn X |X) is also Gaussian. Since ˆσ2 n X does not depend on X , µ is the sum of a random variable with distribution N(0, ˆσ2 n X ) and ˆµn X , which is also Gaussian, meaning that µ is Gaussian. Proposition 16. If µ is a sample from pn(µ) in Gaussian mean estimation with synthetic data generation assuming unknown variance, but downstream analysis assuming known variance, Eµ (µ ) µn X, Varµ (µ ) σ2 0 κn X (20) Proof Checking where the mean and variance of µ pn(µ) converge when n X : Eµ (µ ) = EX (Eµ |X (µ )) = EX ˆµn X = EX 1 ˆσ2 0 ˆµ0 + n X 1 ˆσ2 0 + n X 1 ˆσ2 0 ˆµ0 + n X ˆσ2 k EX ( X ) 1 ˆσ2 0 + n X ˆσ2 k EX ( X ) = µn X On Consistent Bayesian Inference from Synthetic Data as n X . For the variance, Varµ (µ ) = EX (Varµ |X (µ )) + Var X (Eµ |X (µ )) = EX (ˆσ2 n X ) + Var X (ˆµn X ), EX (ˆσ2 n X ) = EX Var X ˆµn X = Var X ˆσ2 k X + ˆµ0 Var X ( X ), Var X ( X ) = E µ, σ2(Var X | µ, σ2( X )) + Var µ, σ2(EX | µ, σ2( X )) = 1 n X E σ2( σ2) + Var µ( µ) Var µ( µ) = σ2 0 κn X as n X . Putting these together, Eµ (µ ) µn X, Varµ (µ ) σ2 0 κn X Proposition 17. If σ2 is a sample from pn(σ) in Gaussian variance estimation with known mean, Eσ2 (σ2 ) E σ2( σ2) + ( µk ˆµk)2, (22) Proof We have Eσ2 (σ2 ) = EX (Eσ2 |X (σ2 )) = EX ˆνn X ˆνn X 2 ˆσ2 n X = ˆν0 + n X ˆν0 + n X 2EX ˆσ2 n X = ˆν0 + n X ˆν0 + n X 2 ˆν0ˆσ2 0 + n X EX (ˆv) ˆν0 + n X = ˆν0ˆσ2 0 + n X EX (ˆv) ˆν0 + n X 2 , Räisä, Jälkö and Honkela EX (ˆv) = 1 n X i=1 Ex i ((x i ˆµk)2) = 1 n X i=1 Ex i ((x i )2 2x i ˆµk + ˆµ2 k) Ex i ((x i )2) 2 µkˆµk + ˆµ2 k = ˆµ2 k 2 µkˆµk + 1 n X i=1 Ex i ((x i )2) = ˆµ2 k 2 µkˆµk + 1 n X i=1 (Ex i (x i )2 + Varx i (x i )) = ˆµ2 k 2 µkˆµk + 1 n X i=1 ( µ2 k + Varx i (x i )) = µ2 k + ˆµ2 k 2 µkˆµk + Varx i (x i ) = ( µk ˆµk)2 + Varx i (x i ), Varx i (x i ) = Var σ2(Ex i | σ2(x i )) + E σ2(Varx i | σ2(x i )) = E σ2( σ2). Putting these together, Eσ2 (σ2 ) E σ2( σ2) + ( µk ˆµk)2, Appendix C. Sampling the Exact Posterior in the Toy Data Experiment In order to sample the exact posterior p(Q| s), we use another decomposition: p(Q| s) = Z p(Q| s, X)p(X| s) d X = Z p(Q|X)p(X| s) d X, (27) where p(Q| s, X) = p(Q|X) due to the independencies of the graphical model in Figure 1. It remains to sample p(X| s). This is not tractable in general, but is possible in the toy data setting due to using the 3-way marginal query that covers all possible values of a datapoint, and the simplicity of the toy data. We can decompose p(X| s) = Z p(s| s)p(X|s) dθ d X = Z p(X|s) Z p(s, θ| s) dθ d X, so we can sample (s, θ) p(s, θ|s) and then sample X p(X|s) to obtain a sample from p(X| s). Due to the simplicity of the toy data setting, sampling both p(s, θ|s) and p(X|s) is possible. NAPSU-MQ uses the following Bayesian inference problem: X MEDn θ s = a(X) s N(s, σ2 DP ), On Consistent Bayesian Inference from Synthetic Data where a are the marginal queries, σ2 DP is the noise variance of the Gaussian mechanism, and MEDn θ is the maximum entropy distribution (Räisä et al., 2023) with point probability p(x) = exp(θT a(x) θ0(θ)), where θ0 is the log-normalising constant. In the toy data setting, a is the 3-way marginal query for all of the 3 variables. In other words, a(x) is the one-hot coding of x, so s = a(X) is a vector of counts of how many times each of the 8 possible values is repeated in X. This means that sampling p(X|s) is simple: 1. For each possible value of a datapoint, find the corresponding count from s, and repeat that datapoint according the that count. 2. Shuffle the datapoints to a random order. As the downstream analysis p(Q|X) doesn t depend on the order of the datapoints, the second step is not actually needed. To sample p(s, θ| s), we use a Metropolis-within-Gibbs sampler (Gilks et al., 1995) that sequentially updates s and θ while keeping the other fixed. The proposal for θ is obtained from Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Neal, 2011). The proposal for s is obtained by repeatedly choosing a random index in s to increment and another to decrement. It is possible to obtain negative values in s from this proposal, but those will always be rejected by the acceptance test, as the likelihood for them is 0. To initialise the sampler, we pick an initial value for θ from a Gaussian distribution, and pick the initial s by rounding s to integer values, changing the rounded values such that they sum to n while ensuring that all values are non-negative. The step size for the HMC we used is 0.05, and the number of steps is 20. In the s proposal, we repeat the combination of an increment and a decrement 30 times. We take 20000 samples in total from 4 parallel chains, and drop the first 20% as warmup samples. The method described in this section is similar to the noise-aware Bayesian inference method of Ju et al. (2022). The difference between the two is that Ju et al. (2022) use X instead of s as the auxiliary variable, and they sample the X proposals from the model, changing one datapoint at a time. This makes their algorithm more generalisable. Martín Abadi, Andy Chu, Ian J. Goodfellow, H. Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, October 24-28, 2016, pages 308 318. ACM, 2016. Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2623 2631. ACM, 2019. Borja Balle and Yu-Xiang Wang. Improving the Gaussian Mechanism for Differential Privacy: Analytical Calibration and Optimal Denoising. In Proceedings of the 35th International Räisä, Jälkö and Honkela Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 394 403. PMLR, 2018. Garrett Bernstein and Daniel Sheldon. Differentially Private Bayesian Inference for Exponential Families. In Advances in Neural Information Processing Systems, volume 31, pages 2924 2934, 2018. Garrett Bernstein and Daniel Sheldon. Differentially Private Bayesian Linear Regression. In Advances in Neural Information Processing Systems, volume 32, pages 523 533, 2019. Michael. J. Betancourt and Mark Girolami. Hamiltonian Monte Carlo for Hierarchical Models, 2013. ar Xiv:1312.0906. Patrick Billingsley. Probability and Measure. Wiley Series in Probability and Mathematical Statistics. Wiley, New York, NY, 3rd edition, 1995. P. G. Bissiri, C. C. Holmes, and S. G. Walker. A general framework for updating belief distributions. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 78 (5):1103 1130, 2016. Peter Bühlmann. Discussion of Big Bayes Stories and Bayes Bag. Statistical Science, 29(1): 91 94, 2014. Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between high-dimensional Gaussians with the same mean, 2022. ar Xiv:1810.08693. Simon Duane, Anthony D. Kennedy, Brian J. Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics letters B, 195(2):216 222, 1987. Cynthia Dwork. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation, pages 1 19. Springer, 2008. Cynthia Dwork and Aaron Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211 407, 2014. Cynthia Dwork, Krishnaram Kenthapadi, Frank Mc Sherry, Ilya Mironov, and Moni Naor. Our Data, Ourselves: Privacy Via Distributed Noise Generation. In Advances in Cryptology - EUROCRYPT, volume 4004 of Lecture Notes in Computer Science, pages 486 503. Springer, 2006a. Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam D. Smith. Calibrating Noise to Sensitivity in Private Data Analysis. In Third Theory of Cryptography Conference, volume 3876 of Lecture Notes in Computer Science, pages 265 284. Springer, 2006b. Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian Data Analysis. Chapman & Hall/CRC Texts in Statistical Science Series. Chapman & Hall/CRC, Boca Raton, FL, 2. edition, 2004. Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. Bayesian Data Analysis. Chapman & Hall/CRC Texts in Statistical Science Series. CRC Press, Boca Raton, FL, 3. edition, 2014. On Consistent Bayesian Inference from Synthetic Data Sahra Ghalebikesabi, Harry Wilde, Jack Jewson, Arnaud Doucet, Sebastian J. Vollmer, and Chris C. Holmes. Mitigating statistical bias within differentially private synthetic data. In Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, volume 180 of Proceedings of Machine Learning Research, pages 696 705. PMLR, 2022. W. R. Gilks, N. G. Best, and K. K. C. Tan. Adaptive Rejection Metropolis Sampling Within Gibbs Sampling. Journal of the Royal Statistical Society Series C: Applied Statistics, 44 (4):455 472, 1995. C. Hipp and R. Michel. On the Bernstein-v. Mises Approximation of Posterior Distributions. The Annals of Statistics, 4(5):972 980, 1976. Matthew D Hoffman and Andrew Gelman. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1): 1593 1623, 2014. Jonathan H. Huggins and Jeffrey W. Miller. Robust Inference and Model Criticism Using Bagged Posteriors, 2020. ar Xiv:1912.07104. Jonathan H. Huggins and Jeffrey W. Miller. Reproducible Model Selection Using Bagged Posteriors. Bayesian analysis, 18(1):79 104, 2023. Joonas Jälkö, Onur Dikmen, and Antti Honkela. Differentially Private Variational Inference for Non-conjugate Models. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2017. Robert I. Jennrich. Asymptotic Properties of Non-Linear Least Squares Estimators. The Annals of Mathematical Statistics, 40(2):633 643, 1969. Nianqiao Ju, Jordan Awan, Ruobin Gong, and Vinayak Rao. Data Augmentation MCMC for Bayesian Inference from Privatized Data. In Advances in Neural Information Processing Systems, volume 35, pages 12732 12743, 2022. Mark Kelbert. Survey of Distances between the Most Popular Distributions. Analytics, 2(1): 225 245, 2023. Ronny Kohavi and Barry Becker. Adult. UCI Machine Learning Repository, 1996. Tejas Kulkarni, Joonas Jälkö, Antti Koskela, Samuel Kaski, and Antti Honkela. Differentially Private Bayesian Inference for Generalized Linear Models. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5838 5849. PMLR, 2021. Chong K. Liew, Uinam J. Choi, and Chung J. Liew. A data distortion by probability distribution. ACM Transactions on Database Systems, 10(3):395 411, 1985. Xiao-Li Meng. Multiple-Imputation Inferences with Uncongenial Sources of Input. Statistical Science, 9(4), 1994. Radford M. Neal. MCMC using Hamiltonian dynamics. In Handbook of Markov Chain Monte Carlo. Chapman & Hall / CRC Press, 2011. Räisä, Jälkö and Honkela Beata Nowok, Gillian M. Raab, and Chris Dibben. Synthpop: Bespoke Creation of Synthetic Data in R. Journal of Statistical Software, 74:1 26, 2016. Lukas Prediger, Niki Loppi, Samuel Kaski, and Antti Honkela. D3p - A Python Package for Differentially-Private Probabilistic Programming. Proceedings on Privacy Enhancing Technologies, 2022(2):407 425, 2022. Trivellore E. Raghunathan, Jerome P. Reiter, and Donald B. Rubin. Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19(1):1, 2003. Ossi Räisä, Joonas Jälkö, Samuel Kaski, and Antti Honkela. Noise-aware statistical inference with differentially private synthetic data. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 3620 3643. PMLR, 2023. Arun Rajkumar and Shivani Agarwal. A differentially private stochastic gradient descent algorithm for multiparty classification. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume 22 of JMLR Proceedings, pages 933 941. JMLR.org, 2012. Jerome P. Reiter. Satisfying disclosure restrictions with synthetic data sets. Journal of Official Statistics, 18(4):531, 2002. Donald B. Rubin. The Bayesian Bootstrap. The Annals of Statistics, 9(1):130 134, 1981. Donald B. Rubin. Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New York, 1987. Donald B. Rubin. Discussion: Statistical disclosure limitation. Journal of Official Statistics, 9(2):461 468, 1993. Shuang Song, Kamalika Chaudhuri, and Anand D. Sarwate. Stochastic gradient descent with differentially private updates. In 2013 IEEE Global Conference on Signal and Information Processing, pages 245 248. IEEE, 2013. Boris van Breugel, Zhaozhi Qian, and Mihaela van der Schaar. Synthetic data, real errors: How (not) to publish and use synthetic data. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 34793 34808. PMLR, 2023a. Boris van Breugel, Hao Sun, Zhaozhi Qian, and Mihaela van der Schaar. Membership Inference Attacks against Synthetic Data through Overfitting Detection. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, pages 3493 3514. PMLR, 2023b. A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, repr. 2000 edition, 1998. On Consistent Bayesian Inference from Synthetic Data Harrison Wilde, Jack Jewson, Sebastian J. Vollmer, and Chris Holmes. Foundations of Bayesian Learning from Synthetic Data. In The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 541 549. PMLR, 2021. Xianchao Xie and Xiao-Li Meng. Dissecting multiple imputation from a multi-phase inference perspective: What happens when God s, imputer s and analyst s models are uncongenial? Statistica Sinica, 2016.