# differentially_private_sliced_wasserstein_distance__51e1072c.pdf Differentially Private Sliced Wasserstein Distance Alain Rakotomamonjy 1 2 Liva Ralaivola 1 Developing machine learning methods that are privacy preserving is today a central topic of research, with huge practical impacts. Among the numerous ways to address privacy-preserving learning, we here take the perspective of computing the divergences between distributions under the Differential Privacy (DP) framework being able to compute divergences between distributions is pivotal for many machine learning problems, such as learning generative models or domain adaptation problems. Instead of resorting to the popular gradient-based sanitization method for DP, we tackle the problem at its roots by focusing on the Sliced Wasserstein Distance and seamlessly making it differentially private. Our main contribution is as follows: we analyze the property of adding a Gaussian perturbation to the intrinsic randomized mechanism of the Sliced Wasserstein Distance, and we establish the sensitivity of the resulting differentially private mechanism. One of our important findings is that this DP mechanism transforms the Sliced Wasserstein distance into another distance, that we call the Smoothed Sliced Wasserstein Distance. This new differentially private distribution distance can be plugged into generative models and domain adaptation algorithms in a transparent way, and we empirically show that it yields highly competitive performance compared with gradient-based DP approaches from the literature, with almost no loss in accuracy for the domain adaptation problems that we consider. 1. Introduction Healthcare and computational advertising are examples of domains that could find a tremendous benefit from the con- 1Criteo AI Lab, Paris, France 2LITIS EA4108, Universit e de Rouen Normandie, Saint-Etienne du Rouvray, France. Correspondence to: Alain Rakotomamonjy . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). tinous advances made in Machine Learning (ML). However, as ethical and regulatory concerns become prominent in these areas, there is the need to devise privacy preserving mechanisms allowing i) to prevent the access to individual and critical data and ii) to still leave the door open to the use of elaborate ML methods. Differential privacy (DP) offers a sound privacy-preserving framework to tackle both issues and effective DP mechanisms have been designed for, e.g., logistic regression and Support Vector Machines (Rubinstein et al., 2009; Chaudhuri et al., 2011). Here, we address the problem of devising a differentially private distribution distance with, in the hindsight, tasks such as learning generative models and domain adaptation which both may rely on a relevant distribution distance (Lee et al., 2019; Deshpande et al., 2018). In particular, we propose and analyze a mechanism that transforms the sliced Wasserstein distance (SWD) (Rabin et al., 2011) into a differentially private distance while retaining the scalability advantages and metric properties of the base SWD. The key ingredient to our contribution: to take advantage of the combination of the embedded sampling process of SWD and the so-called Gaussian mechanism. Our contributions are as follows: i) we analyze the effect of a Gaussian mechanism on the sliced Wasserstein distance and we establish the DP-compliance of the resulting mechanism DP-SWD; ii) we show that DP-SWD boils down to what we call Gaussian smoothed SWD, that inherits some of the key properties of a distance, a novel result that has value on its own; iii) extensive empirical analysis on domain adaptation and generative modeling tasks show that the proposed DP-SWD is competitive, as we achieve DP guarantees without almost no loss in accuracy in domain adaptation, while being the first to present a DP generative model on the 64 64 RGB Celeb A dataset. Outline. Section 2 states the problem we are interested in and provides background on differential privacy and the sliced Wasserstein distance. In Section 3, we analyze the DP guarantee of random direction projections and we characterize the resulting Gaussian Smoothed Sliced Wasserstein distance. Section 4 discusses how this distance can be plugged into domain adaptation and generative model algorithms. After discussing related works in Section 5, Section 6 presents empirical results, showing our ability to Differentially Private Sliced Wasserstein Distance effectively learn under DP constraints. 2. Problem Statement and Background 2.1. Privacy, Gaussian Mechanism and Random Direction Projections We start by stating the main problem we are interested in: to show the privacy properties of the random mechanism M(X) = XU + V, where X Rn d is a matrix (a dataset), U Rd k a random matrix made of k uniformly distributed unit-norm vectors of Rd and V Rn k a matrix of k zero-mean Gaussian vectors (also called the Gaussian Mechanism). We show that M is differentially private and that it is the core component of the Sliced Wassertein Distance (SWD) computed thanks to random projection directions (the unitnorm matrix U) and, in turn, SWD inherits1 the differential private property of M. In the way, we show that the population version of the resulting differentially private SWD is a distance, that we dub the Gaussian Smoothed SWD. 2.2. Differential Privacy (DP) DP is a theoretical framework to analyze the privacy guarantees of algorithms. It rests on the following definitions. Definition 1 (Neighboring datasets). Let X (e.g. X = Rd) be a domain and D .= + n=1 X n. D, D D are neighboring datasets if |D| = |D | and they differ from one record. Definition 2 (Dwork (2008)). Let ε, δ > 0. Let A : D Im A be a randomized algorithm, where Im A is the image of D through A. A is (ε, δ)-differentially private, or (ε, δ)- DP, if for all neighboring datasets D, D D and for all sets of outputs O Im A, the following inequality holds: P[A(D) O] eεP[A(D ) O] + δ where the probability relates to the randomness of A. Remark 1. Note that given D D and a randomized algorithm A : D Im A, A(D) defines a distribution πD : Im A [0, 1] on (a subspace of) Im A with O Im A, πD(O) P[A(D) O], where means equality up to a normalizing factor. The following notion of privacy, proposed by Mironov (2017), which is based on R enyi α-divergences and its connections to (ε, δ)-differential privacy will ease the exposition of our results (see also (Asoodeh et al., 2020; Balle & Wang, 2018; Wang et al., 2019)): 1This is a slight abuse of vocabulary as the Sliced Wasserstein Distance takes two inputs and not only one. Definition 3 (Mironov (2017)). Let ε > 0 and α > 1. A randomized algorithm A is (α, ε)-R enyi differential private or (α, ε)-RDP, if for any neighboring datasets D, D D, Dα (A(D) A(D )) ε where Dα( ) is the R enyi α-divergence (R enyi, 1961) between two distributions (cf. Remark 1). Proposition 1 (Mironov (2017), Prop. 3). An (α, ε)-RDP mechanism is also (ε + log(1/δ) α 1 , δ)-DP, δ (0, 1). Remark 2. A folk method to make up an (R)DP algorithm based a function f : X Rd is the Gaussian mechanism Mσ defined as follows: Mσf( ) = f( ) + v where v N(0, σ2Id). If f has 2- (or ℓ2-) sensitivity 2f .= max D,D neighbors f(D) f(D ) 2, then Mσ is α, α 2 2f 2σ2 -RDP. As we shall see, the role of f will be played by the Random Direction Projections operation or the Sliced Wasserstein Distance (SWD), a randomized algorithm itself, and the mechanism to be studied is the composition of two random algorithms, SWD and the Gaussian mechanism. Proving the (R)DP nature of this mechanism will rely on a high probability bound on the sensitivy of the Random Direction Projections/SWD combined with the result of Remark 2. 2.3. Sliced Wasserstein Distance Let Ω Rd be a probability space and P(Ω) the set of all probability measures over Ω. The Wasserstein distance between two measures µ, ν P(Ω) is based on the so-called Kantorovitch relaxation of the optimal transport problem, which consists in finding a joint probability distribution γ P(Ω Ω) such that γ .= arg min γ Π(µ,ν) Ω Ω c(x, x )dγ(x, x ) (1) where c( , ) is a metric on Ω, known as the ground cost (which in our case will be the Euclidean distance), Π(µ, ν) .= {γ P(Ω Ω)|π1#γ = µ, π2#γ = ν} and π1, π2 are the marginal projectors of γ on each of its coordinates. The minimizer of this problem is the optimal transport plan and for q 1, the q-Wasserstein distance is Wq(µ, ν) = inf γ Π(µ,ν) Ω Ω c(x, x )qdγ(x, x ) 1 A case of prominent interest for our work is that of onedimensional measures, for which it was shown by Rabin Differentially Private Sliced Wasserstein Distance et al. (2011); Bonneel et al. (2015) that the Wasserstein distance admits a closed-form solution which is Wq(µ, ν) .= Z 1 0 |F 1 µ (z) F 1 ν (z)|qdz 1 where F 1 is the inverse cumulative distribution function of the related distribution. This combines well with the idea of projecting high-dimensional probability distributions onto random 1-dimensional spaces and then computing the Wasserstein distance, an operation which can be theoretically formalized through the use of the Radon transform (Bonneel et al., 2015), leading to the so-called Sliced Wasserstein Distance SWDq q(µ, ν) .= Z Sd 1 W q q (Ruµ, Ruν)ud(u)du where Ru is the Radon transform of a probability distribution so that Ruµ( ) = Z µ(s)δ( s u)ds (3) with u Sd 1 .= {u Rd : u 2 = 1} be the ddimensional hypersphere and ud the uniform distribution on Sd 1. In practice, we only have access to µ and ν through samples, and the proxy distributions of µ and ν to handle are ˆµ .= 1 n Pn i=1 δxi and ˆν .= 1 m Pm i=1 δx i. By plugging those distributions into Equation 3, it is easy to show that the Radon transform depends only the projection of x on u. Hence, computing the sliced Wasserstein distance amounts to computing the average of 1D Wasserstein distances over a set of random directions {uj}k j=1, with each 1D probability distribution obtained by projecting a sample (of ˆµ or ˆν) on ui by x ui. This gives the following empirical approximation of SWD i=1 δxi uj, 1 i=1 δx i uj given U a matrix of Rd k of unit-norm column uj. 3. Private and Smoothed Sliced Wasserstein Distance We now introduce how we obtain a differentially private approximation of the Sliced Wasserstein Distance. To achieve this goal, we take advantage of the intrinsic randomization process that is embedded in the Sliced Wasserstein distance. 3.1. Sensitivity of Random Direction Projections In order to uncover its (ϵ, δ)-DP , we analyze the sensitivity of the random direction projection in SWD. Let us consider Algorithm 1 Private and Smoothed Sliced Wasserstein Distance Input: A public {Xs} and private {Xt} matrix both in Rn d, σ the standard deviation of a Gaussian distribution, k the number of direction in SWD, q the power in the SWD. 1: // random projection 2: construct random projection matrix U Rd k with unit-norm columns. 3: construct two random Gaussian, with standard deviation σ noise, matrices Vs and Vt of size n k 4: // Gaussian mechanism 5: compute M(Xs) = Xs U+Vs, M(Xt) = Xt U+Vt 6: DPσSWDq q compute Equation (4) using M(Xs) and M(Xt) as the locations of the Diracs. 7: return DPσSWDq q the matrix X Rn d representing a dataset composed of n examples in dimension d organized in row (each sample being randomly drawn from the distribution µ). One mechanism of interest is Mu(X) = X u u 2 + v. where v is a vector whose entries are drawn from a zeromean Gaussian distribution. Let X and X be two matrices in Rn d that differ only on one row, say i and such that Xi,: X i,: 2 1, where Xi,: Rd and X i,: Rd are the i-th row of X and X , respectively. For ease of notation, we will from now on use z .= (Xi,: X i,:) . Lemma 1. Assume that z Rd is a unit-norm vector and u Rd a vector where each entry is drawn independently from N(0, σ2 u). Then Y .= z u u 2 2 B(1/2, (d 1)/2) where B(α, β) is the beta distribution of parameters α, β. Proof. See appendix. Instead of considering a randomized mechanism that projects only according to a single random direction, we are interested in the whole set of projected (private) data according to the random directions sampled through the Monte-Carlo approximation of the Sliced Wasserstein distance computation (4). Our key interest is therefore in the mechanism M(X) = XU + V and in the sensitivity of XU. Because of its randomness, we are interested in a probabilistic tail-bound of XU Differentially Private Sliced Wasserstein Distance X U F , where the matrix U has columns independently drawn from Sd 1. Lemma 2. Let X and X be two matrices in Rn d that differ only in one row, and for that row, say i, Xi,: X i,: 2 1. Denote U Rd k and U has columns independently drawn from Sd 1. With probability at least 1 δ, we have XU X U 2 F w(k, δ), (5) w(k, δ) .= k Proof. See appendix. The above bound on the squared sensitivity has been obtained by first showing that the random variable XU X U 2 F is the sum of k iid Beta-distributed random variables and then by using a Bernstein inequality. This bound, referred to as the Bernstein bound, is very conservative as soon as δ is very small. By calling the Central Limit Theorem (CLT), assuming that k is large enough (k > 30), we get under the same hypotheses (proof is in the appendix) that w(k, δ) = k where z1 δ = Φ 1(1 δ) and Φ is the cumulative distribution function of a zero-mean unit variance Gaussian distribution. This bound is far tighter but is not rigourous due to the CLT approximation. Figure 1 presents an example of the probability distribution histogram of (X X ) U 2 F = (Xi,: X i,:) U 2 2 for two fixed arbitrary Xi,:, X i,: and for 10000 random draws of U. It shows that the CLT bound is numerically far smaller than the Bernstein bound of Lemma 2. Then, using the w(k, δ)- based bounds jointly with the Gaussian mechanism property gives us the following proposition. Proposition 2. Let α > 1 and δ [0, 1/2], given a random direction projection matrix U Rd k, then the Gaussian mechanism M(X) = XU + V, where V is a Gaussian matrix in Rn k with entries drawn from N(0, σ2) is ( αw(k,δ/2) 2σ2 + log(2/δ) α 1 , δ)-DP. Proof. The claim derives immediately by the relation between RDP and DP and by Lemma 2 with δ The above DP guarantees apply to the full dataset. Hence, when learning through mini-batches, we benefit from the so-called privacy amplification by the subsampling principle, which ensures that a differentially private mechanism run on a random subsample of a population leads to Figure 1. Estimated density probability of Pk i Yi and the Normal distribution of same mean and standard deviation. Here, we have k = 200 and d = 784 which corresponds to the dimensionality of MNIST digits and the number of random projections we use in the experiments. We illustrate also some bounds (on the squared sensitivity) that can be derived from this Normal distribution as well as our CLT bound. Note that the Bernstein bound is above 1 and in this example, that the CLT bound, which is numerically equal to the inverse CDF of the Normal distribution at desired δ. higher privacy guarantees than when run on the full population (Balle et al., 2018). On the contrary, gradient clipping/sanitization acts individually one each gradient and thus do not fully benefit from the subsampling amplification, as its DP property may still depend on the batch size (Chen et al., 2020). This Gaussian mechanism on the random direction projections M(X) can be related to the definition of the empirical SWD as each x i uj corresponds to one entry of XU. Hence, by adding a Gaussian noise to each projection, we naturally derive our empirical DP Sliced Wasserstein distance, which inherits the differential property of M(X), owing to the post-processing proposition (Dwork et al., 2014). 3.2. Metric Properties of DP-SWD We have analyzed the sensitivity of the random direction projection central to SWD and we have proposed a Gaussian mechanism to obtain a differentially private SWD (DP-SWD) which steps are depicted in Algorithm 1. In our use-cases, DP-SWD is used in a context of learning to match two distributions (one of them requiring to be privately protected). Hence, the utility guarantees of our DPSWD is more related to the ability of the mechanism to distinguish two different distributions rather than on the equivalence between SWD and DP-SWD. Our goal in this section is to investigate the impact of adding Gaussian noise to the source µ and target ν distributions in terms of distance property in the population case. Differentially Private Sliced Wasserstein Distance Since Ru, as defined in Equation (3), is a push-forward operator of probability distributions, the Gaussian mechanism process implies that the Wasserstein distance involved in SWD compares two 1D probability distributions which are respectively the convolution of a Gaussian distribution and Ruµ and Ruν. Hence, we can consider DP-SWD uses as a building block the 1D smoothed Wasserstein distance between Ruµ and Ruν with the smoothing being ensured by Nσ and its formal definition being DPσSWDq q(µ, ν) .= Z Sd 1 W q q (Ruµ Nσ, Ruν Nσ)ud(u)du While some works have analyzed the theoretical properties of the Smoothed Wasserstein distance (Goldfeld et al., 2020; Goldfeld & Greenewald, 2020), as far as we know, no theoretical result is available for the smoothed Sliced Wasserstein distance, and we provide in the sequel some insights that help its understanding. The following property shows that DP-SWD preserves the identity of indiscernibles. Property 1. For continuous probability distributions µ and ν, we have DPσSWDq q(µ, ν) = 0 µ = ν σ > 0. Proof. Showing that µ = ν = DPσSWDq q(µ, ν) = 0 is trivial as the Radon transform and the convolution are two well-defined maps. We essentially would like to show that DPσSWDq q(µ, ν) = 0 implies µ = ν. If DPσSWDq q(µ, ν) = 0 then Ruµ Nσ = Ruν Nσ for almost every u Sd 1. As convolution yields to multiplication in the Fourier domain and because, the Fourier transform of a Gaussian is also a Gaussian and thus is always positive, one can show that we have for all u equality of the Fourier transforms of Ruµ and Ruν. Then, owing to the continuity of µ and ν and by the Fourier inversion theorem, we have Ruµ = Ruν. Finally, as for the SWD proof (Bonnotte, 2013, Prop 5.1.2), this implies that µ = ν, owing to the projection nature of the Radon Transform and because the Fourier transform is injective. Property 2. DPσSWDq q(µ, ν) is symmetric and satisfies the triangle inequality. Proof. The proof easily derives from the metric properties of Smoothed Wasserstein distance (Goldfeld & Greenewald, 2020) and details are in the appendix. These properties are strongly relevant in the context of our machine learning applications. Indeed, while they do not tell us how the value of DP-SWD compares with SWD, at fixed σ > 0 or when σ 0, they show that they can properly act as (for any σ > 0) loss functions to minimize if we aim to match distributions (at least in the population Algorithm 2 Differentially private DANN with DP-SWD Input: {Xs, ys}, {Xt}, respectively the public and private domain, σ standard deviation of the Gaussian mechanism 1: Initialize representation mapping g, the classifier h with parameters θg, θh 2: repeat 3: sample minibatches {xs B, ys B} from {xs i, ys i } 4: compute g(xs B) 5: compute the classification loss Lc = P i B L(ys i , h(g(xs i))) 6: θh θh αh θh Lc 7: // Private steps : g(xt B) is computed in a private way. g( ) is either transferred or has shared weights between public and private clients. 8: sample minibatches {xt B} from {xt B} 9: compute g(xt B) 10: normalize each sample g(xs B) wrt 2 maxj g(xs B,j) 2 11: normalize each sample g(xt B) wrt 2 maxj g(xt B,j) 2 12: compute DPσSWD(g(xs B), g(xt B)) 13: publish θg DPσSWD 14: // public step 15: θg θg αg θg Lc αg θg DPσSWD 16: until a convergence condition is met case). Naturally, there are still several theoretical properties of DPσSWDq q that are worth investigating but that are beyond the scope of this work. 4. DP-Distribution Matching Problems There exists several machine learning problems where distance between distributions is the key part of the loss function to optimize. In domain adaptation, one learns a classifier from public source dataset but looks to adapt it to private target dataset (target domain examples are available only through a privacy-preserving mechanism). In generative modelling, the goal is to generate samples similar to true data which are accessible only through a privacypreserving mechanism. In the sequel, we describe how our DP-SWD distance, with q = 1, can be instantiated into these two learning paradigms for measuring adaptation or for measuring similarity between generated an true samples. For unsupervised domain adaptation, given source examples Xs and their label ys and unlabeled private target examples Xt, the goal is to learn a classifier h( ) trained on the source examples that generalizes well on the target ones. One usual technique is to learn a representation mapping g( ) that leads to invariant latent representations, Differentially Private Sliced Wasserstein Distance invariance being measured as some distance between empirical distributions of mapped source and target samples. Formally, this leads to the following learning problem min g,h Lc(h(g(Xs)), ys) + DPσSWD(g(Xs), g(Xt)) (7) where Lc can be any loss function of interest and DPσSWD = DPσSWD1 1. We solve this problem through stochastic gradient descent, similarly to many approaches that use Sliced Wasserstein Distance as a distribution distance (Lee et al., 2019), except that in our case, the gradient of DPσSWD involving the target dataset is (ε, δ)-DP. Note that in order to compute the DPσSWD, one needs the public dataset Xs and the public generator. In practice, this generator can either be transferred, after each update, from the private client curating Xt or can be duplicated on that client. The resulting algorithm is presented in Algorithm 2. In the context of generative modeling, we follow the same steps as Deshpande et al. (2018) but use our DPσSWD instead of SWD. Assuming that we have some examples of data Xt sampled from a given distribution, the goal of the learning problem is to learn a generator g( ) to output samples similar to those of the target distribution, with at its input a given noise vector. This is usually achieved by solving min g DPσSWD(Xt, g(z)) (8) where z is for instance a Gaussian vector. In practice, we solve this problem using a mini-batching stochastic gradient descent strategy, following a similar algorithm than the one for domain adaptation. The main difference is that the private target dataset does not pass through the generator. Tracking the privacy loss Given that we consider the privacy mechanism within a stochastic gradient descent framework, we keep track of the privacy loss through the RDP accountant proposed by Wang et al. (2019) for composing subsampled private mechanisms. Hence, we used the Py Torch package (Xiang, 2020) that they made available for estimating the noise standard deviation σ given the (ε, δ) budget, a number of epoch, a fixed batch size, the number of private samples, the dimension d of the distributions to be compared and the number k of projections used for DPσSWD. Some examples of Gaussian noise standard deviation are reported in Table 4 in the appendix. 5. Related Works 5.1. DP Generative Models Most recent approaches (Fan, 2020) that proposed DP generative models considered it from a GAN perspective and applied DP-SGD (Abadi et al., 2016) for training the model. The main idea for introducing privacy is to appropriately clip the gradient and to add calibrated noise into the model s parameter gradient during training (Torkzadehmahani et al., 2019; Chen et al., 2020; Xie et al., 2018). This added noise make those models even harder to train. Furthermore, since the DP mechanism applies to each single gradient, those approaches do not fully benefit from the amplification induced by subsampling (mini-batching) mechanism (Balle et al., 2018). The work of Chen et al. (2020) uses gradient sanitization and achieves privacy amplification by training multiple discriminators, as in (Jordon et al., 2018), and sampling on them for adversarial training. While their approach is competitive in term of quality of generated data, it is hardly tractable for large scale dataset, due to the multiple (up to 1000 in their experiments) discriminator trainings. Instead of considering adversarial training, some DP generative model works have investigated the use of distance on distributions. Harder et al. (2020) proposed random feature based maximum-mean embedding distance for computing distance between empirical distributions. Cao et al. (2021) considered the Sinkhorn divergence for computing distance between true and generated data and used gradient clipping and noise addition for privacy preservation. Their approach is then very similar to DP-SGD in the privacy mechanism. Instead, we perturb the Sliced Wasserstein distance by smoothing the distributions to compare. This yields a privacy mechanism that benefits subsampling amplification, as its sensitivity does not depend on the number of samples, and that preserves its utility as the smoothed Sliced Wasserstein distance is still a distance. 5.2. Differential Privacy with Random Projections Sliced Wasserstein Distance leverages on Radon transform for mapping high-dimensional distributions into 1D distributions. This is related to projection on random directions and the sensitivity analysis of those projections on unitnorm random vector is key. The first use of random projection for differential privacy has been introduced by Kenthapadi et al. (2013). Their approach was linked to the distance preserving property of random projections induced by the Johnson-Lindenstrauss Lemma. As a natural extension, Le Tien et al. (2019) and Gondara & Wang (2020) have applied this idea in the context of optimal transport and classification. The fact that we project on unit-norm random vector, instead of any random vector as in Kenthapadi et al. (2013), requires a novel sensitivity analysis and we show that this sensitivity scales gracefully with ratio of the number of projections and dimension of the distributions. 6. Numerical Experiments In this section, we provide some numerical results showing how our differentially private Sliced Wasserstein Dis- Differentially Private Sliced Wasserstein Distance Figure 2. Comparing SWD and DP-SWD by measuring the distance between two normal distributions (averaged over 5 draws of all samples). The comparison holds when the distance between the means of the Gaussians increases linearly, for different noise amplitudes of the Gaussian mechanism and different number of samples. (left) σ = 1. (right) σ = 3. tance works in practice. The code for reproducing some of the results is available in https://github.com/ arakotom/dp_swd. 6.1. Toy Experiment The goal of this experiment is to illustrate the behaviour of the DP-SWD compared with the SWD in controlled situations. We consider the source and target distributions as isotropic Normal distributions of unit variance with added privacy-inducing Gaussian noise of different variances. Both distributions are Gaussian of dimension 5 and the means of the source and target are respectively mµ = 0 and mν = c1 with c [0, 1]. Figure 2 presents the evolution of the distances averaged over 5 random draws of the Gaussian and noise. When source and target distributions are different, this experiment shows that DP-SWD follows the same increasing trend as SWD. This suggests that the order relation between distributions as evaluated using SWD is preserved by DP-SWD, and that the distance DP-SWD is minimized when µ = ν, which are important features when using DP-SWD as a loss. 6.2. Domain Adaptation We conduct experiments for evaluating our DP-SWD distance in the context of classical unsupervised domain adaptation (UDA) problems such as handwritten digit recognitions (MNIST/USPS), synthetic to real object data (Vis DA 2017) and Office 31 datasets. Our goal is to analyze how DP-SWD performs compared with its public counterpart SWD (Lee et al., 2019), with one DP deep domain adaptation algorithm DP-DANN that is based on gradient clipping (Wang et al., 2020) and with the classical non-private DANN algorithm. Note that we need not compare with (Le Tien et al., 2019) as their algorithm does not learn representation and does not handle large-scale problems, as the OT transport matrix coupling need be computed on the full dataset. For all methods and for each dataset, we used the Figure 3. Evolution of the target domain accuracy in UDA with respect to the ε parameter for fixed value of δ, for 3 different datasets. Sensitivity of DP-SWD has been computed using the Bernstein bound. same neural network architecture for representation mapping and for classification. Approaches differ only on how distance between distributions have been computed. Details of problem configurations as well as model architecture and training procedure can be found in the appendix. Sensitivity has been computed using the Bernstein bound of Lemma 2. Table 1 presents the accuracy on the target domain for all methods averaged over 10 iterations. We remark that our private model outperforms the DP-DANN approach on all problems except on two difficult ones. Interestingly, our method does not incur a loss of performance despite the private mechanism. This finding is confirmed in Figure 3 where we plot the performance of the model with respect to the noise level σ (and thus the privacy parameter ε). Our model is able to keep accuracy almost constant for ε [3, 10]. 6.3. Generative Models In the context of generative models, our first task is to generate synthetic samples for MNIST and Fashion MNIST dataset that will be afterwards used for learning a classifier. We compare with different gradient-sanitization strategies like DP-CGAN (Torkzadehmahani et al., 2019), and GSWGAN (Chen et al., 2020) and a model MERF (Harder et al., 2020) that uses MMD as distribution distance. We report results for our DP-SWD using two ways for computing the sensitivity, by using the CLT bound and the Bernstein bound, respectively noted as DP-SWD-c and DP-SWD-b. All models are compared with the same fixed budget of privacy (ε, δ) = (10, 10 5). Our implementation is based on the one of MERF (Harder et al., 2020), in which we just plugged our DP-SWD loss in place of the MMD loss. The architecture of ours and MERF s generative model is Differentially Private Sliced Wasserstein Distance Table 1. Table of accuracy on the private target domain for different domain adaptation problems M-U, U-M refers to MNISTUSPS and USPS-MNIST the first listed data being the source domain. (D,W,A) refers to domains in the Office31 dataset. For all the problems, ε = 10 and δ depends on the number of examples in target domain. δ has been respectively set to 10 3,10 5,10 6 for Office31, MNIST-USPS and Vis DA. Methods Data DANN SWD DP-DANN DP-SWD M-U 93.9 0 95.5 1 87.1 2 94.0 0 U-M 86.2 2 84.8 2 73.5 2 83.4 2 Vis DA 57.4 1 53.8 1 49.0 1 47.0 1 D - W 90.9 1 90.7 1 88.0 1 90.9 1 D - A 58.6 1 59.4 1 56.5 1 55.2 2 A - W 70.4 3 74.5 1 68.7 1 72.6 1 A - D 78.6 2 78.5 1 73.7 1 79.8 1 W - A 54.7 3 59.1 0 56.0 1 59.0 1 W - D 91.1 0 95.7 1 63.4 3 92.6 1 composed of few layers of convolutional neural networks and upsampling layers with approximately 180K parameters while the one of GS-WGAN is based on a Res Net with about 4M parameters. MERF s and our models have been trained over 100 epochs with an Adam optimizer and batch size of 100. For our DP-SWD we have used 1000 random projections and the output dimension is the classical 28 28 = 784. Table 2 reports some quantitative results on the task. We note that MERF and our DP-SWD perform on par on these problems (with a slight advantage for MERF on Fashion MNIST and for DP-SWD on MNIST). Note that our results on MERF are better than those reported in (Chen et al., 2020). We also remark that GS-WGAN performs the best at the expense of a model with 20-fold more parameters and a very expensive training time (few hours just for training the 1000 discriminators, while our model and MERF s take less than 10min). Figure 4 and 5 present some examples of generated samples for MNIST and Fashion MNIST. We can note that the samples generated by DP-SWD present some diversity and are visually more relevant than those of MERF, although they do not lead to better performance in the classification task. Our samples are a bit blurry compared to the ones generated by the non-private SWD: this is an expected effect of smoothing. We also evaluate our DP-SWD distance for training generative models on large RGB datasets such as the 64 64 3 Celeb A dataset. To the best of our knowledge, no DP generative approaches have been experimented on such a dataset. For instance, the GS-WGAN of (Chen et al., 2020) has been evaluated only on grayscale MNIST-like prob- Figure 4. Examples of generate MNIST samples from (top) nonprivate SWD (middle) DP-SWD-b (bottom) MERF. Table 2. Comparison of DP generative models on MNIST and Fashion MNIST at privacy level (ε, δ) = (10, 10 5). The downstream task is a 10-class classification problems using the synthetic generated dataset. We report the accuracy of different classifiers. Results are averaged over 5 runs of generation. SWD is the non-private version of our generative model. MNIST Fashion MNIST Method MLP Log Reg MLP Log Reg SWD 87 82 77 76 GS-WGAN 79 79 65 68 DP-CGAN 60 60 50 51 DP-MERF 76 75 72 71 DP-SWD-c 77 78 67 66 DP-SWD-b 76 77 67 66 lems. For training the model, we followed the same approach (architecture and optimizer) as the one described in Nguyen et al. (2020). In that work, in order to reduce the dimension of the problems, distributions are compared in a latent space of dimension d = 8192. We have used k = 2000 projections which leads to a ratio k d < 0.25. Noise variance σ and privacy loss over 100 iterations have been evaluated using the Py Torch package of (Wang et al., 2019) and have been calibrated for ϵ = 10 and δ = 10 6, since the number of training samples is of the order of 170K. Details are in the appendix. Figure 6 presents some examples of samples generated from our DP-SWD and SWD. We note that in this high-dimensional context, the sensitivity bound plays a key role, as we get a FID score of 97 vs 158 respectively using CLT bound and Bernstein bound, the former being smaller than the latter. Differentially Private Sliced Wasserstein Distance Figure 5. Examples of generate Fashion MNIST samples from (top) non-private SWD (middle) DP-SWD-b (bottom) MERF. 7. Conclusion This paper presents a differentially private distance on distributions based on the sliced Wasserstein distance. We applied a Gaussian mechanism on the random projection inherent to SWD and analyzed its properties. We proved that a bound ( a la Bernstein) on sensitivity of the mechanism as an inverse dependence on the problem dimension and that a Central limit theorem bound, although approximate, gives a tighter bound. One of our key findings is that the privacy-inducing mechanism we proposed turns the SWD into a smoothed sliced Wasserstein distance, which inherits all the properties of a distance. Hence, our privacypreserving distance can be plugged seamlessly into domain adaptation or generative model algorithms to give effective privacy-preserving learning procedures. Abadi, M., Chu, A., Goodfellow, I., Mc Mahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308 318, 2016. Asoodeh, S., Liao, J., Calmon, F. P., Kosut, O., and Sankar, L. Three variants of differential privacy: Lossless conversion and applications. ar Xiv preprint ar Xiv:2008.06529, 2020. Balle, B. and Wang, Y.-X. Improving the Gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Pro- Figure 6. Images generated on Celeb A dataset. From top to bottom. Non-private SWD, DP-SWD with noise calibrated according to Gaussian approximation (CLT bound), DP-SWD with noise calibrated according to the Bernstein bound. The FID score computed over 10000 generated examples of this three models are respectively 58, 97 and 149. ceedings of Machine Learning Research, pp. 394 403, Stockholmsm assan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http://proceedings.mlr. press/v80/balle18a.html. Balle, B., Barthe, G., and Gaboardi, M. Privacy amplification by subsampling: Tight analyses via couplings and divergences. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31, pp. 6277 6287. Curran Associates, Inc., 2018. URL https://proceedings. neurips.cc/paper/2018/file/ 3b5020bb891119b9f5130f1fea9bd773-Paper. pdf. Bonneel, N., Rabin, J., Peyr e, G., and Pfister, H. Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22 45, 2015. Bonnotte, N. Unidimensional and evolution methods for optimal transportation. Ph D thesis, Paris 11, 2013. Cao, T., Bie, A., Kreis, K., and Fidler, S. Differentially private generative models through optimal transport, 2021. URL https://openreview.net/forum? id=zg MPc_48Zb. Chaudhuri, K., Monteleoni, C., and Sarwate, A. D. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(3), 2011. Differentially Private Sliced Wasserstein Distance Chen, D., Orekondy, T., and Fritz, M. Gs-wgan: A gradient-sanitized approach for learning differentially private generators. In Neural Information Processing Systems (Neur IPS), 2020. Deshpande, I., Zhang, Z., and Schwing, A. G. Generative modeling using the sliced wasserstein distance. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3483 3491, 2018. Dwork, C. Differential privacy: A survey of results. In International conference on theory and applications of models of computation, pp. 1 19. Springer, 2008. Dwork, C., Roth, A., et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211 407, 2014. Fan, L. A survey of differentially private generative adversarial networks. In The AAAI Workshop on Privacy Preserving Artificial Intelligence, 2020. Goldfeld, Z. and Greenewald, K. Gaussian-smoothed optimal transport: Metric structure and statistical efficiency. In International Conference on Artificial Intelligence and Statistics, pp. 3327 3337. PMLR, 2020. Goldfeld, Z., Greenewald, K., and Kato, K. Asymptotic guarantees for generative modeling based on the smooth wasserstein distance. Advances in Neural Information Processing Systems, 33, 2020. Gondara, L. and Wang, K. Differentially private small dataset release using random projections. In Conference on Uncertainty in Artificial Intelligence, pp. 639 648. PMLR, 2020. Harder, F., Adamczewski, K., and Park, M. Differentially private mean embeddings with random features (dp-merf) for simple & practical synthetic data generation. ar Xiv preprint ar Xiv:2002.11603, 2020. Jordon, J., Yoon, J., and Van Der Schaar, M. Pate-gan: Generating synthetic data with differential privacy guarantees. In International Conference on Learning Representations, 2018. Kenthapadi, K., Korolova, A., Mironov, I., and Mishra, N. Privacy via the johnson-lindenstrauss transform. Journal of Privacy and Confidentiality, 5(1), 2013. Lee, C.-Y., Batra, T., Baig, M. H., and Ulbricht, D. Sliced wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10285 10295, 2019. Le Tien, N., Habrard, A., and Sebban, M. Differentially private optimal transport: Application to domain adaptation. In IJCAI, pp. 2852 2858, 2019. Mironov, I. R enyi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pp. 263 275, 2017. doi: 10.1109/CSF.2017.11. Nguyen, K., Ho, N., Pham, T., and Bui, H. Distributional sliced-wasserstein and applications to generative modeling. ar Xiv preprint ar Xiv:2002.07367, 2020. Nguyen, K., Ho, N., Pham, T., and Bui, H. Distributional sliced-wasserstein and applications to generative modeling. In International Conference on Learning Representations, 2021. URL https://openreview.net/ forum?id=QYj O70ACDK. Nietert, S., Goldfeld, Z., and Kato, K. From smooth wasserstein distance to dual sobolev norm: Empirical approximation and statistical applications. ar Xiv preprint ar Xiv:2101.04039, 2021. Rabin, J., Peyr e, G., Delon, J., and Bernot, M. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pp. 435 446. Springer, 2011. Rubinstein, B. I., Bartlett, P. L., Huang, L., and Taft, N. Learning in a large function space: Privacypreserving mechanisms for svm learning. ar Xiv preprint ar Xiv:0911.5708, 2009. R enyi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pp. 547 561, Berkeley, Calif., 1961. University of California Press. URL https://projecteuclid.org/ euclid.bsmsp/1200512181. Torkzadehmahani, R., Kairouz, P., and Paten, B. Dp-cgan: Differentially private synthetic data and label generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 0 0, 2019. Tu, S. Differentially private random projections. Wang, Q., Li, Z., Zou, Q., Zhao, L., and Wang, S. Deep domain adaptation with differential privacy. IEEE Transactions on Information Forensics and Security, 15:3093 3106, 2020. doi: 10.1109/TIFS.2020.2983254. Wang, Y.-X., Balle, B., and Kasiviswanathan, S. P. Subsampled r enyi differential privacy and analytical moments accountant. In The 22nd International Conference Differentially Private Sliced Wasserstein Distance on Artificial Intelligence and Statistics, pp. 1226 1235. PMLR, 2019. Xiang, Y. Autodp : Automating differential privacy computation, 2020. URL https://github.com/ yuxiangw/autodp. Xie, L., Lin, K., Wang, S., Wang, F., and Zhou, J. Differentially private generative adversarial network. ar Xiv preprint ar Xiv:1802.06739, 2018.