# bayesian_pseudocoresets__16507427.pdf Bayesian Pseudocoresets Dionysis Manousakas Department of Computer Science & Technology University of Cambridge dm754@cam.ac.uk Zuheng Xu Department of Statistics University of British Columbia zuheng.xu@stat.ubc.ca Cecilia Mascolo Department of Computer Science & Technology University of Cambridge cm542@cam.ac.uk Trevor Campbell Department of Statistics University of British Columbia trevor@stat.ubc.ca Standard Bayesian inference algorithms are prohibitively expensive in the regime of modern large-scale data. Recent work has found that a small, weighted subset of data (a coreset) may be used in place of the full dataset during inference, taking advantage of data redundancy to reduce computational cost. However, this approach has limitations in the increasingly common setting of sensitive, high-dimensional data. Indeed, we prove that there are situations in which the Kullback-Leibler (KL) divergence between the optimal coreset and the true posterior grows with data dimension; and as coresets include a subset of the original data, they cannot be constructed in a manner that preserves individual privacy. We address both of these issues with a single unified solution, Bayesian pseudocoresets a small weighted collection of synthetic pseudodata along with a variational optimization method to select both pseudodata and weights. The use of pseudodata (as opposed to the original datapoints) enables both the summarization of high-dimensional data and the differentially private summarization of sensitive data. Real and synthetic experiments on high-dimensional data demonstrate that Bayesian pseudocoresets achieve significant improvements in posterior approximation error compared to traditional coresets, and that pseudocoresets provide privacy without a significant loss in approximation quality. 1 Introduction Large-scale data which has become the norm in many scientific and commercial applications of statistical machine learning creates an inherently difficult setting for the modern data analyst. Exploring such data is difficult because it cannot all be obtained and directly visualized at once; one is typically limited to accessing potentially nonrepresentative random subsets of data. Exploring models is similarly hard, as training even a single model can be a computationally expensive, slow, and unreliable process. And as many sources of large-scale data contain sensitive information about individuals (e.g., electronic health records and social network data), these challenges are coupled with growing privacy concerns that preclude direct access to individual datapoints completely. Large-scale data does offer one reprieve to the analyst: it often exhibits a significant degree of redundancy. Most data are not unique or particularly informative for modelling and exploration. Based on this notion, data summarization methods have been developed that provide the practitioner with a compressed but still statistically representative version of the large dataset for analysis. Summarizations have been developed for a variety of purposes, e.g., reducing the cost of computing with kernel matrices via Nyström-type approximations [11, 32, 3] or sparse pseudo-input parameterizations [37], Bayesian inference [23, 22, 8, 9, 7], maximum likelihood parameter estimation 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. [12, 30], linear regression [43, 20], geometric shape approximation [2], clustering [16, 29, 4, 6], and dimensionality reduction [18]. A common form of summarization is that of a sparse, weighted subset of the original dataset a coreset [2]. Coresets have two distinct advantages over other possible summarization modalities: they are easily interpreted, and can often be used as the input to standard data analysis algorithms without modification. But as the dimensionality of a dataset grows, its constituent datapoints tend to become more unique and cannot represent one another well. Indeed, in the context of Bayesian inference the focus of the present work we show that the optimal coreset posterior approximation to the true posterior has KL divergence that scales with the dimension of the data in a simple problem setting (Proposition 1). Furthermore, directly releasing a subset of the original data precludes any possibility of individual privacy under the current standard of differential privacy [14, 15]. Past work addresses this issue in the context of clustering and computational geometry [17, 19], while the idea of releasing private dataset compressions has been also pursued in kernel mean embeddings [5], sparse regression [43], and compressive learning [36] with the remarkable property that the privatized compressed data may be queried ad infinitum without loss of privacy but no such method exists for Bayesian posterior inference. In this work, we develop a novel technique for data summarization in the context of Bayesian inference under the constraints that the method is scalable and easy to use, creates an intuitive summarization, applies to high-dimensional data, and enables privacy control. Inspired by past work [30, 43, 37], instead of using constituent datapoints, we use synthetic pseudodata to summarize the large dataset, resulting in a pseudocoreset. We show that in the high-dimensional problem setting of Proposition 1, the optimal pseudocoreset with just one pseudodata point recovers the exact posterior, a significant improvement upon the optimal standard coreset of any size. As in past work on Bayesian coresets [7], we formulate pseudocoreset construction as variational inference, and provide a stochastic optimization method. As a consequence of the use of pseudodata as well as privacy-preserving stochastic gradient descent mechanisms [1, 34, 27] we show that our method can easily be modified to output a privatized pseudocoreset. The paper concludes with experimental results demonstrating the performance of pseudocoresets on real and synthetic data. 2 Bayesian Coresets In this work, the goal is to approximate expectations under a density π(θ), θ Θ expressed as the product of N potentials (f(xn, θ))N n=1 and a base density π0(θ): n=1 f(xn, θ) In the setting of Bayesian inference with conditionally independent data, the potentials are data log-likelihoods, i.e. f(xn, θ) := log π(xn|θ), π0 is the prior density, π is the posterior, and Z is the marginal likelihood of the data. Rather than working directly with π(θ) for posterior inference which requires a Θ(N) computation per evaluation a Bayesian coreset approximation of the form πw(θ) := 1 Z(w) exp n=1 wnf(xn, θ) for w RN, w 0 may be used in most popular posterior inference schemes [33, 28, 35]. If the number of nonzero entries w 0 of w is small, this results in a significant reduction in computational burden. Recent work has formulated the problem of constructing a Bayesian coreset of size M N as sparse variational inference [7], w = arg min w RN DKL (πw||π) s.t. w 0, w 0 M, (1) and showed that the objective can be minimized using stochastic estimates of w DKL (πw||π) based on samples from the coreset posterior πw. 2.1 High-dimensional data Coresets, as formulated in Eq. (1), are limited to using the original datapoints themselves to summarize the whole dataset. Proposition 1 shows that this is problematic when summarizing highdimensional data; in the common setting of posterior inference for a Gaussian mean, the KL divergence DKL (πw ||π) of the optimal coreset of any size scales with the dimension of the data. The proof may be found in Supp. A. Figure 1: Gaussian mean inference under pseudocoreset (PSVI) against standard coreset (Sparse VI) summarization for N = 1, 000 datapoints. (a) Progression of PSVI vs. Sparse VI construction for coreset sizes M = 0, 1, 5, 12, 30, 100, in 500 dimensions (displayed are datapoint projections on 2 random dimensions). PSVI and Sparse VI coreset predictive 3σ ellipses are displayed in red and blue respectively, while the true posterior 3σ ellipse is shown in black. PSVI has the ability to immediately move pseudopoints towards the true posterior mean, while Sparse VI has to add a larger number of existing points in order to obtain a good posterior approximation. See Fig. 2b for the quantitative KL comparison. (b) Optimal coreset KL divergence lower bound from Proposition 1 as a function of dimension with δ = 0.5, and coreset size M evenly spaced from 0 to 100 in increments of 5. Proposition 1. Suppose we use (Xn)N n=1 i.i.d. N(0, I) in Rd to perform posterior inference in a Bayesian model with prior µ N(0, I) and likelihood (Xn)N n=1 i.i.d. N(µ, I). Then M < d and δ [0, 1], with probability at least 1 δ the optimal size-M coreset w satisfies DKL (πw ||π) 1 1 + N F 1 d M where Fk is the CDF of a χ2 random variable with k degrees of freedom. The bound in Proposition 1 depends on d through the χ2 distribution inverse CDF. Although difficult to see directly, the bound is reasonably large for typical values of N, M, d, δ, and increasing linearly in d; Fig. 1b visualizes the value of the lower bound as a function of dimension d for various coreset sizes M. Note that the above bound requires the data to be high-dimensional such that d > M; if d M the proof technique in Supp. A results in a vacuous DKL (πw ||π) = 0 lower bound. 3 Bayesian Pseudocoresets Proposition 1 shows that there is room for improvement in coreset construction in the highdimensional data regime. Indeed, consider again the same problem setting; the coreset posterior distribution is a Gaussian with mean µw and covariance Σw, n=1 wn Xn. (2) Examining Eq. (2), we can replicate any coreset posterior exactly by using a single synthetic pseudodata point U = PN n=1 wn 1 PN n=1 wn Xn with weight PN n=1 wn. In particular, the true posterior is equivalent to the posterior conditioned on the single pseudodata point U = 1 N PN n=1 Xn with weight N (with corresponding KL divergence equal to 0). This is not surprising; the mean of the data is precisely a sufficient statistic for the data in this simple setting. However, it does illustrate that carefully-chosen pseudodata may be able to represent the overall dataset as approximate sufficient statistics far better than any reasonably small collection of the original data. This intuition has been used before, e.g., for scalable Gaussian process inference [37, 38], privacy-preserving compression in linear regression [43], herding [42, 10, 25], and deep generative models [39]. In this section, we extend the realm of applicability of pseudopoint compression methods to the general class of Bayesian posterior inference problems with conditionally independent data, resulting in Bayesian pseudocoresets. Building on recent work [7], we formulate pseudocoreset construction as a variational inference problem where both the weights and pseudopoint locations are parameters of the variational posterior approximation, and develop a stochastic algorithm to solve the optimization. 3.1 Pseudocoreset variational inference A Bayesian pseudocoreset takes the form πu,w(θ) = 1 Z(u, w) exp m=1 wmf(um, θ) where u := (um)M m=1 are M pseudodata points um Rd, (wm)M m=1 are nonnegative weights, f : Rd Θ R is a potential function parametrized by a pseudodata point, and Z(u, w) is the corresponding normalization constant rendering πu,w a probability density. In the setting of Bayesian posterior inference, um will take the same form as the data, while the potentials are the log-likelihood functions, i.e., f(um, θ) = log π(um|θ). We construct a coreset by minimizing the KL divergence over both the pseudodata locations and weights, u , w = arg min u Rd M,w RM + DKL (πu,w||π) . (3) As opposed to previous Bayesian coreset construction optimization problems [8, 9, 7], we do not need an explicit sparsity constraint; the coreset size is limited to M directly through the selection of the number of pseudodata and weights. Denote the vectors of original data potentials f(θ) RN and synthetic pseudodata potentials f(θ) RM as f(θ) := [f1(θ) . . . f N(θ)]T and f(θ) := [f(u1, θ) . . . f(u M, θ)]T respectively, where we suppress the (θ) for brevity where clear from context. Denote Eu,w and Covu,w to be the expectation and covariance operator for the pseudocoreset posterior πu,w. Then we may write the KL divergence in Eq. (3) as DKL (πu,w||π) =Eu,w[log πu,w(θ)] Eu,w[log π(θ)] = log Z(1) log Z(u, w) 1T Eu,w[f] + w T Eu,w[ f], (4) where 1 RN is the vector of all 1 entries, and w RM is the vector of pseudocoreset weights. As we will employ gradient descent steps as part of our algorithm to minimize the variational objective over the parameters u, w, we need to evaluate the derivative of the KL divergence Eq. (4). Despite the presence of the intractable normalization constants and expectations, we show in Supp. B that gradients can be expressed using moments of the pseudodata and original data potential vectors. In particular, the gradients of the KL divergence with respect to the weights w and to a single pseudodata location um are w DKL = Covu,w[ f, f T 1 f T w], um DKL = wm Covu,w h h(um), f T 1 f T w i , (5) where h( , θ) := uf( , θ), and the θ argument is again suppressed for brevity. 3.2 Stochastic optimization The gradients in Eq. (5) involve expectations of (gradient) log-likelihoods from the model. Although there are a few particular Bayesian models where these can be evaluated in closed-form (e.g. the synthetic experiment in Section 4; see also Supp. C.1), this is not usually the case. In order to make the proposed pseudocoreset method broadly applicable, in this section we develop a black-box stochastic optimization scheme (Alg. 1) for Eq. (3). To initialize the pseudocoreset, we subsample M datapoints from the large dataset and reweight them to match the overall weight of the full dataset, um xbm, wm N/M, m = 1, . . . , M B Unif Subset ([N], M) , B := {b1, . . . , b M} . After initializing the pseudodata locations and weights, we simultaneously optimize Eq. (3) over both. Each optimization iteration t {1, . . . , T} consists of a stochastic gradient descent step with a learning rate γt t 1, wm max 0, wm γt( ˆ w)m , um um γt ˆ um, 1 m M. Algorithm 1 Pseudocoreset Variational Inference 1: procedure PSVI(f( , ), π0, x, M, B, S, T, (γt) t=1) Initialize the pseudocoreset using a uniformly chosen subset of the full dataset 2: N # datapoints in x, B Unif Subset ([N], M) , B := {b1, . . . , b M} 3: um xbm, wm N/M, m = 1, . . . , M 4: for t = 1, . . . , T do Take S samples from current pseudocoreset posterior 5: (θ)S s=1 i.i.d. πu,w where πu,w(θ) exp PM m=1 wmf(um, θ) π0(θ) 6: B Unif Subset ([N], B) Obtain a minibatch of B datapoints from the full dataset 7: for s = 1, . . . , S do Compute (gradient) log-likelihood discretizations 8: gs f(xb, θs) 1/S PS s =1 f(xb, θs ) 9: gs f(um, θs) 1/S PS s =1 f(um, θs ) M 10: for m = 1, . . . , M do 11: hm,s uf(um, θs) 1/S PS s =1 uf(um, θs ) Rd 12: ˆ w 1/S PS s=1 gs N/Bg T s 1 g T s w Compute Monte-Carlo gradients for w 13: for m = 1, . . . , M do and (um)M m=1 14: ˆ um wm1/S PS s=1 hm,s N/Bg T s 1 g T s w 15: w max(w γt ˆ w, 0) Take stochastic gradient step in w 16: for m = 1, . . . , M do and (um)M m=1 17: um um γt ˆ um 18: return w, (um)M m=1 The stochastic gradient estimates ˆ w RM and ˆ um Rd are based on S N samples θs i.i.d. πu,w from the coreset approximation and a minibatch of B N datapoints from the full dataset, B g T s 1 g T s w , ˆ um := wm 1 S B g T s 1 g T s w , hm,s := uf(um, θs) 1 s =1 uf(um, θs ), gs := s =1 f(θs ) gs := f(θs) 1 s =1 f(θs ), B Unif Subset ([N], B) , and ( )|B denotes restriction of a vector to only those indices in B [N]. Crucially, note that this computation does not scale with N, but rather with the number of coreset points M, the sample and minibatch sizes S and B, and the dimension d. Obtaining θs i.i.d. πu,w efficiently via Markov chain Monte Carlo sampling algorithms [21, 26] is (roughly) O(M) per sample, because the coreset is always of size M; and we need not compute the entire vector gs RN per sample s, but rather only those B N indices in the minibatch B, resulting in a cost of O(B). Aside from that, all computations involving gs RM and hm,s Rd are at most O(Md). Each of these computations are repeated S times over the coreset posterior samples. 3.3 Differentially Private Scheme Beyond better summarizations of high-dimensional data, pseudocoresets enable the generation of a data summarization that ensures the statistical privacy of individual datapoints under the model of (approximate) differential privacy. In this setting, a trusted curator holds an aggregate dataset of N datapoints, x X N, X Rd, and builds and releases a pseudocoreset (u, w), u X M, w RM + via a randomized mechanism satisfying Definition 2 [14, 13]. Definition 2 ((ε, δ)-Differentially Private Coreset). Fix ε 0, δ [0, 1]. A pseudocoreset construction algorithm M : X N RM + X M is (ε, δ)-differentially private if for every pair of adjacent datasets x x and all events A RM + X M, P[M(x) A] eεP[M(x ) A] + δ. We consider two datasets x, x as adjacent (denoted x x ) if x can be obtained from x by adding or removing an element. ε controls the effect that removal or addition of an element can have on the output distribution of M, while δ captures the failure probability, and is preferably o(1/N). In this section, we develop a differentially private version of pseudocoreset construction. Beyond modifying our initialization scheme, private pseudocoreset construction comes as natural extension of Alg. 1 via replacing gradient computation involving points of the true dataset with its differentially private counterpart. Pseudodata points initialization In the standard (nonprivate) pseudocoreset construction (Alg. 1), pseudopoints are initialized from the dataset itself, incurring a privacy penalty. In differentially private pseudocoreset construction, we simply initialize pseudopoints by generating synthetic data from the statistical model at no privacy cost. Optimization Examining lines 4 19 of Alg. 1, the only steps that involve handling the original data occur at lines 8, 12, and 14, when we use the minibatch subsample to compute log-likelihoods and gradients. Due to the post-processing property of differential privacy [15], all of the other computations in Alg. 1 (e.g. sampling from the pseudocoreset posterior, computing pseudopoint log-likelihoods, etc.) incur no privacy cost. Therefore, we need only to control the influence of private data entering the gradient computation through the vector of (g T s 1)S s=1 terms. To accomplish this we do repeated applications of the subsampled Gaussian mechanism, since this also allows us to use a moments accountant technique to keep tight estimates of privacy parameters [1, 41]. As in the nonprivate scheme, in each optimization step we uniformly subsample a minibatch B = {x1, . . . , x B} of private datapoints. We then replace the g T s 1 term in lines 12 and 14 with a randomized privatization: replace (g T s 1)S s=1 with Z + max 1, ||Gi||2 C , Z N(0, σ2C2I), (6) where Gi := f(xi, θs) 1 S PS s =1 f(xi, θs ) S s=1 RS xi B, and C, σ > 0 are parameters controlling the amount of privacy. This modification to Alg. 1 has been shown in past work to obtain the privacy guarantee provided in Corollary 3; crucially, the privacy cost of our construction is independent of the pseudocoreset size. It also does not introduce any significant amount of additional computation. No sensitivity computation for privatisation noise calibration is required, as boundedness is enforced via clipping in Eq. (6). Finally, a manageable number of privacy specific hyperparameters is introduced: the clipping bound C and noise level σ. Corollary 3 ([1]). There exist constants c1, c2 such that Alg. 1 modified per Eq. (6) is (ε, δ)- differentially private for any ε < c1q2T, δ > 0, and σ c2q p T log(1/δ) /ε, where q := B N is the fraction of data in a minibatch and T is the number of optimization steps. 4 Experimental Results In this section, we evaluate the posterior approximation quality achieved by pseudocoreset sparse VI (PSVI) compared against uniform random subsampling (Uniform), Hilbert coresets (GIGA [8]) and Sparse VI greedy coreset construction [7]. For black-box constructions of Sparse VI and PSVI we used S = 100 Monte Carlo samples per gradient estimation. For GIGA we used a 100-dimensional random projection from a Gaussian approximate posterior ˆπ with two choices for mean and covariance: one set to the exact posterior (Optimal), which is not tractable to obtain in practice and forms an optimistic estimate of achievable approximation quality; and one with mean and covariance set to a random point on the interpolant between the prior and the exact posterior point estimates, and subsequently corrupted with 75% additive relative noise (Realistic). Notably, Hilbert coresets and Sparse VI develop incremental schemes for construction, while PSVI relies on batch optimization with random initialization (Alg. 1), and does not use any information from pseudocoresets of smaller size. An incremental scheme for Sparse VI is included in Supp. C. Code for the presented experiments is available at https://github.com/trevorcampbell/pseudocoresets-experiments. Gaussian mean inference We first evaluate the performance of PSVI on a synthetic dataset of N = 103 datapoints, where we aim to infer the posterior mean θ N(µ0, Σ0) of a d-dimensional Gaussian conditioned on Gaussian observations (Xn)N n=1 i.i.d. N(θ, Σ). In this example, the exact (a) Gauss. mean inf., d = 200 (b) Gauss. mean inf., d = 500 (c) Bayes. lin. regr., d = 100 Figure 2: Comparison of (pseudo)coreset approximate posterior quality for experiments on synthetic datasets over 10 trials. Solid lines display the median KL divergence, with shaded areas showing 25th and 75th percentiles of KL divergence. In Fig. 2c, KL divergence is normalized by the prior. pseudocoreset posterior for any set of weights (wm)M m=1 and pseudopoint locations (um)M m=1 is available in closed-form: Σu,w = (Σ 1 0 + m=1 wmΣ 1) 1 µu,w = Σu,w(Σ 1 0 µ0 + Σ 1 M X Using the exact posterior, we derive the exact moments used in the gradient formulae from Eq. (5) in closed form (see Supp. C.1), Covu,w[fn, fm] = v T n Ψvm + 1/2 tr ΨT Ψ, Covu,w[ fn, fm] = v T n Ψvm + 1/2 tr ΨT Ψ, Covu,w[h(ui), fn] = Q T Ψvn, Covu,w[h(ui), fn] = Q T Ψ vn, where Q is the lower triangular matrix of the Cholesky decomposition of Σ (i.e. Σ = QQT ), Ψ := Q 1Σu,w Q T , vn := Q 1(xn µu,w), and vm := Q 1(um µu,w). We vary the pseudocoreset size from M = 1 to 200, and set the total number of iterations to T = 500. We use learning rates γt(M) = α(M)t 1, where α(M) = 1 for Sparse VI and α(M) = max(1.1 0.005M, 0.2) for PSVI. As verified in Figs. 2a and 2b, Hilbert coresets provide poor quality summarizations in the high-dimensional regime, even for large coreset sizes. Despite showing faster decrease of approximation error for a larger range of coreset sizes, Sparse VI is also fundamentally limited by the use of the original datapoints, per Proposition 1. Furthermore, we observe that the quality of all previous coreset methods when d = 500 is significantly lower compared to d = 200. On the other hand, the KL divergence for PSVI decreases significantly more quickly, giving a near perfect approximation for the true posterior with a single pseudodata point, regardless of data dimension. As shown earlier in Fig. 1a, PSVI has the capacity to move the pseudodata points in order to capture the true posterior very efficiently. Bayesian linear regression In the second experiment, we use a set of N = 2, 000 101-dimensional datapoints (xn, yn)N n=1 generated as follows: (xn)N n=1 i.i.d. N(0, I), (yn)N n=1 [1, xn]T θ + ϵn, (ϵn)N n=1 i.i.d. N(0, σ2), and aim to infer θ R101. We assume a prior θ N(µ0, σ2 0I), where µ0, σ2 0 are the dataset empirical mean and second moment, and set the noise parameter σ to the variance of (yn)N n=1. We apply stochastic optimization for PSVI construction (also see Supp. C.2.1). We use learning rates γt = t 1 for Sparse VI, and γt = 0.1t 1 for PSVI, B = 200, T = 1000, while selection step for Sparse VI is carried out over the full dataset. Fig. 2c shows that Hilbert coresets cannot improve posterior approximation in this setting with 100 random projections (see Supp. C.2.2), while PSVI achieves the fastest decay rate over sizes 100 M < 250, surpassing Sparse VI. Bayesian logistic regression Finally, we compare (pseudo)coreset construction methods on Bayesian logistic regression applied to 3 large (8.4 100K datapoints, 50 237 dimensions) datasets. For brevity, equations and gradients for the logistic regression model, additional experiments on 3 smaller-scale datasets, full dataset descriptions, hyperparameter selection, time performance evaluation and results on an incremental scheme for pseudocoreset construction are deferred to Supp. C.3. For PSVI and Sparse VI we use minibatch size B = 200, number of gradient updates T = 500, and learning rate schedules γt = αt 1. For TRANSACTIONS, CHEMREACT100 and MUSIC, α is respectively set to 0.1, 0.1, 10 for Sparse VI, and 1, 10, 10 for PSVI. In the selection step of (a) TRANSACTIONS (d = 50) (b) CHEMREACT100 (d = 100) (c) MUSIC (d = 237) Figure 3: Comparison of (pseudo)coreset approximate posterior quality vs coreset size for logistic regression over 10 trials on 3 large-scale datasets. Presented differentially private pseudocoresets correspond to (0.2, 1/N)-DP. Reverse KL divergence is displayed normalized by the prior. (a) TRANSACTIONS (b) CHEMREACT100 Figure 4: Approximate posterior quality over decreasing differential privacy guarantees for private pseudocoresets of varying size (DP-PSVI) plotted against private variational inference (DP-VI, [27]). δ is always kept fixed at 1/N. Markers on the right end of each plot display the errorbar of approximation achieved by the corresponding nonprivate posteriors. Results are displayed over 5 trials for each construction. Sparse VI we use a uniform subsample of 1, 000 datapoints. For the differentially private pseudocoreset constructions (DP-PSVI), we use a subsampling ratio q = 2 10 3. At each iteration we adapt the clipping norm value C to the median norm of (f(um, θs) 1 S PS s =1 f(um, θs ))S s=1 computed over pseudodata point values um, and use noise level σ = 5. Our hyperparameters choice implies privacy parameters ε = 0.2 and δ = 1/N for each of the datasets. We initialise each pseudocoreset of size M via sampling (xm)M m=1 i.i.d. N(0, I), and sampling θ, (ym)M m=1 from the statistical model. Results presented in Fig. 3 demonstrate that PSVI achieves consistently the smallest posterior approximation error in the small coreset size regime, offering improvement compared to Sparse VI and being competitive with GIGA (Optimal), without the requirement for specifying a weighting function. In Fig. 3a, for M d GIGA (Optimal) follows a much steeper decrease in KL divergence, reflecting the dependence of its approximation quality on dataset dimension per Proposition 1. In contrast, PSVI typically reaches its minimum at M < d. The difference in approximation quality becomes clearer in higher dimensions (e.g. MUSIC, where d = 237). Perhaps surprisingly, the private pseudocoreset construction has only marginally worse approximation quality compared to nonprivate PSVI and generally achieves better peformance in comparison to the other state-of-the-art nonprivate coreset constructions. In Fig. 4 we present the achieved posterior approximation quality via DP-PSVI, against a competitive state-of-the-art method for general-purpose private inference (DP-VI, [27]). The plots display the behaviour of methods over a wide range of ε values, achieved using varying levels of privatization noise, and δ always set to 1/N. For logistic regression, DP-VI infers an approximate posterior from the family of Gaussians with diagonal covariance via ADVI [28], followed by an additional Laplace approximation. Note that by design, DP-VI is constrained by the usual Gaussian variational approximation, while DP-PSVI is more flexible and can approach the true posterior as M increases this effect is reflected in nonprivate posteriors as well, as data dimensionality grows (see for example Fig. 4c). Indeed, we verify that in the high-privacy regime DP-PSVI, for sufficient pseudocoreset size (which is typically small for tested real-world datasets), offers posterior approximation with better KL divergence compared to DP-VI. Our findings indicate that private PSVI offers efficient releases of big data via informative pseudopoints, which enable arbitrary post processing (e.g. running any nonprivate black-box algorithm for Bayesian inference), under strong privacy guarantees and without reducing the quality of inference. 5 Conclusion We introduced a new variational formulation for Bayesian coreset construction, which yields efficient summarizations for big and high-dimensional datasets via simultaneously learning pseudodata points locations and weights. We proved limitations of existing variational formulations for coresets and demonstrated that they can be resolved with our new methodology. We proposed an efficient construction scheme via black-box stochastic optimization and showed how it can be adapted for differentially private Bayesian summarization. Finally, we demonstrated the applicability of our methodology on synthetic and real-world datasets, and practical statistical models. Broader Impact Pseudocoreset variational inference is a general-purpose Bayesian inference algorithm, hence shares implications mostly encountered in approximate inference methods. For example, replacing the full dataset with a pseudocoreset has the potential to cause inferential errors; these can be partially tempered by using a pseudocoreset of larger size. Note also that the optimization algorithm in this work aims to reduce KL divergence: however the proposed variational objective might be misleading in many applications and lead to incorrect conclusions in certain statistical models (e.g. point estimates and uncertainties might be far off despite KL being almost zero [24]). Moreover, Bayesian inference in general is prone to model misspecification. Therefore, a pseudocoreset summarization based on a wrong statistical model will lead to non-representative compression for inferential purposes. Constructing the coreset on a statistical model suited for robust inference instead of the original one [31, 40], can offer protection against modelling mismatches. Naturally, the utility of generated dataset summary becomes task-dependent, as it has been optimized for a specific learning objective, and cannot be fully transferable to multiple different inference tasks on the same dataset. Our learnable pseudodata are also generally not as interpretable as the points of previous coreset methods, as they are not real data. And the level of interpretability is model specific. This creates a risk of misinterpretation of pseudocoreset points in practice. On the other hand, our optimization framework does allow the introduction of interpretability constraints (e.g. pseudodata sparsity) to explicitly capture interpretability requirements. Pseudocoreset-based summarization is susceptible to reproducing potential biases and unfairness existing in the original dataset. Majority-group datapoints in the full dataset which capture information relevant to the statistical task of interest are expected to remain over-represented in the learned summary; while minority-group datapoints might be eliminated, if their distinguishing features are not related to inference. Amending the initialization step to contain such datapoints, or using a prior that strongly favors a debiased version of the dataset, could both mitigate these concerns; but more study is warranted. Acknowledgments and Disclosure of Funding ZX and TC were supported by a National Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant and an NSERC Discovery Launch Supplement. DM and CM were partially supported by Nokia Bell Labs through their donation for the Centre of Mobile, Wearable Systems and Augmented Intelligence to the University of Cambridge. DM also gratefully acknowledges the financial support from Lundgren Fund and Darwin College Cambridge. The authors thank Rik Sarkar for useful conversations in the early stages of this work. [1] M. Abadi, A. Chu, I. Goodfellow, H. B. Mc Mahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In ACM SIGSAC Conference on Computer and Communications Security, 2016. [2] P. Agarwal, S. Har-Peled, and K. Varadarajan. Geometric approximation via coresets. Combinatorial and computational geometry, 52, 2005. [3] R. Agrawal, T. Campbell, J. Huggins, and T. Broderick. Data-dependent compression of random features for large-scale kernel approximation. In International Conference on Artificial Intelligence and Statistics, 2019. [4] O. Bachem, M. Lucic, and A. Krause. Coresets for nonparametric estimation the case of DP-means. In International Conference on Machine Learning, 2015. [5] M. Balog, I. Tolstikhin, and B. Schölkopf. Differentially private database release via kernel mean embeddings. In International Conference on Machine Learning, 2018. [6] V. Braverman, D. Feldman, and H. Lang. New frameworks for offline and streaming coreset constructions. ar Xiv:1612.00889, 2016. [7] T. Campbell and B. Beronov. Sparse variational inference: Bayesian coresets from scratch. In Advances in Neural Information Processing Systems, 2019. [8] T. Campbell and T. Broderick. Bayesian coreset construction via greedy iterative geodesic ascent. In International Conference on Machine Learning, 2018. [9] T. Campbell and T. Broderick. Automated scalable Bayesian inference via Hilbert coresets. Journal of Machine Learning Research, 20(15), 2019. [10] Y. Chen, M. Welling, and A. Smola. Super-samples from kernel herding. In Uncertainty in Artificial Intelligence, 2010. [11] P. Drineas and M. Mahoney. On the Nyström method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6, 2005. [12] W. Du Mouchel, C. Volinsky, T. Johnson, C. Cortes, and D. Pregibon. Squashing flat files flatter. In ACM Conference on Knowledge Discovery and Data Mining, 1999. [13] C. Dwork, K. Kenthapadi, F. Mc Sherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In International Conference on The Theory and Applications of Cryptographic Techniques, 2006. [14] C. Dwork, F. Mc Sherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Conference on Theory of Cryptography, 2006. [15] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4), 2014. [16] D. Feldman, M. Faulkner, and A. Krause. Scalable training of mixture models via coresets. In Advances in Neural Information Processing Systems, 2011. [17] D. Feldman, A. Fiat, H. Kaplan, and K. Nissim. Private coresets. In ACM Symposium on Theory of Computing, 2009. [18] D. Feldman, M. Volkov, and D. Rus. Dimensionality reduction of massive sparse datasets using coresets. In Advances in Neural Information Processing Systems, 2016. [19] D. Feldman, C. Xiang, R. Zhu, and D. Rus. Coresets for differentially private k-means clustering and applications to privacy in mobile sensor networks. In International Conference on Information Processing in Sensor Networks, 2017. [20] R. Guhaniyogi and D. Dunson. Bayesian compressed regression. Journal of the American Statistical Association, 110(512), 2015. [21] M. D. Hoffman and A. Gelman. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593 1623, 2014. [22] J. Huggins, R. Adams, and T. Broderick. PASS-GLM: Polynomial approximate sufficient statistics for scalable Bayesian GLM inference. In Advances in Neural Information Processing Systems, 2017. [23] J. Huggins, T. Campbell, and T. Broderick. Coresets for scalable Bayesian logistic regression. In Advances in Neural Information Processing Systems, 2016. [24] J. Huggins, T. Campbell, M. Kasprzak, and T. Broderick. Validated variational inference via practical posterior error bounds. In International Conference on Artificial Intelligence and Statistics, 2020. [25] F. Huszár and D. Duvenaud. Optimally-weighted herding is Bayesian quadrature. In Uncertainty in Artificial Intelligence, 2012. [26] P. Jacob, J. O Leary, and Y. Atchadé. Unbiased Markov chain Monte Carlo with couplings. ar Xiv:1708.03625, 2017. [27] J. Jälkö, O. Dikmen, and A. Honkela. Differentially private variational inference for nonconjugate models. In Uncertainty in Artificial Intelligence, 2017. [28] A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. M. Blei. Automatic differentiation variational inference. Journal of Machine Learning Research, 18(14), 2017. [29] M. Lucic, O. Bachem, and A. Krause. Strong coresets for hard and soft Bregman clustering with applications to exponential family mixtures. In International Conference on Artificial Intelligence and Statistics, 2016. [30] D. Madigan, N. Raghavan, and W. Du Mouchel. Likelihood-based data squashing: a modeling approach to instance construction. Data Mining and Knowledge Discovery, 6, 2002. [31] J. W. Miller and D. B. Dunson. Robust Bayesian inference via coarsening. Journal of the American Statistical Association, 114(527), 2019. [32] C. Musco and C. Musco. Recursive sampling for the Nyström method. In Advances in Neural Information Processing Systems, 2017. [33] R. Neal. MCMC using Hamiltonian dynamics. In S. Brooks, A. Gelman, G. Jones, and X.-L. Meng, editors, Handbook of Markov chain Monte Carlo, chapter 5. CRC Press, 2011. [34] M. Park, J. R. Foulds, K. Chaudhuri, and M. Welling. Variational Bayes in private settings (VIPS). J. Artif. Intell. Res., 68, 2020. [35] R. Ranganath, S. Gerrish, and D. Blei. Black Box Variational Inference. In International Conference on Artificial Intelligence and Statistics, 2014. [36] V. Schellekens, A. Chatalic, F. Houssiau, Y.-A. De Montjoye, L. Jacques, and R. Gribonval. Differentially private compressive k-means. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2019. [37] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, 2005. [38] M. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artificial Intelligence and Statistics, 2009. [39] J. M. Tomczak and M. Welling. VAE with a Vamp Prior. In International Conference on Artificial Intelligence and Statistics, 2018. [40] Y. Wang, A. Kucukelbir, and D. M. Blei. Robust probabilistic modeling with Bayesian data reweighting. In International Conference on Machine Learning, 2017. [41] Y.-X. Wang, B. Balle, and S. P. Kasiviswanathan. Subsampled Rényi differential privacy and analytical moments accountant. In International Conference on Artificial Intelligence and Statistics, 2019. [42] M. Welling. Herding dynamical weights to learn. In International Conference on Machine Learning, 2009. [43] S. Zhou, J. Lafferty, and L. Wasserman. Compressed regression. In Advances in Neural Information Processing Systems, 2008.