# privacy_amplification_by_mixing_and_diffusion_mechanisms__6318e505.pdf Privacy Amplification by Mixing and Diffusion Mechanisms Borja Balle Gilles Barthe MPI for Security and Privacy IMDEA Software Institute Marco Gaboardi Boston University Joseph Geumlek University of California, San Diego A fundamental result in differential privacy states that the privacy guarantees of a mechanism are preserved by any post-processing of its output. In this paper we investigate under what conditions stochastic post-processing can amplify the privacy of a mechanism. By interpreting post-processing as the application of a Markov operator, we first give a series of amplification results in terms of uniform mixing properties of the Markov process defined by said operator. Next we provide amplification bounds in terms of coupling arguments which can be applied in cases where uniform mixing is not available. Finally, we introduce a new family of mechanisms based on diffusion processes which are closed under post-processing, and analyze their privacy via a novel heat flow argument. On the applied side, we generalize the analysis of privacy amplification by iteration in Noisy SGD and show it admits an exponential improvement in the strongly convex case, and study a mechanism based on the Ornstein Uhlenbeck diffusion process which contains the Gaussian mechanism with optimal post-processing on bounded inputs as a special case. 1 Introduction Differential privacy (DP) [15] has arisen in the last decade into a strong de-facto standard for privacypreserving computation in the context of statistical analysis. The success of DP is based, at least in part, on the availability of robust building blocks (e.g., the Laplace, exponential and Gaussian mechanisms) together with relatively simple rules for analyzing complex mechanisms built out of these blocks (e.g., composition and robustness to post-processing). The inherent tension between privacy and utility in practical applications has sparked a renewed interest into the development of further rules leading to tighter privacy bounds. A trend in this direction is to find ways to measure the privacy introduced by sources of randomness that are not accounted for by standard composition rules. Generally speaking, these are referred to as privacy amplification rules, with prominent examples being amplification by subsampling [9, 18, 20, 6, 5, 8, 2, 27], shuffling [16, 10, 3] and iteration [17]. Motivated by these considerations, in this paper we initiate a systematic study of privacy amplification by stochastic post-processing. Specifically, given a DP mechanism M producing (probabilistic) outputs in X and a Markov operator K defining a stochastic transition between X and Y, we are interested in measuring the privacy of the post-processed mechanism K M producing outputs in Y. The standard post-processing property of DP states that K M is at least as private as M. Our goal is to understand under what conditions the post-processed mechanism K M is strictly more private than M. Roughly speaking, this amplification should be non-trivial when the operator K forgets information about the distribution of its input M(D). Our main insight is that, at least when Y = X, 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. the forgetfulness of K from the point of view of DP can be measured using similar tools to the ones developed to analyze the speed of convergence, i.e. mixing, of the Markov process associated with K. In this setting, we provide three types of results, each associated with a standard method used in the study of convergence for Markov processes. In the first place, Section 3 provides DP amplification results for the case where the operator K satisfies a uniform mixing condition. These include standard conditions used in the analysis of Markov chains on discrete spaces, including the well-known Dobrushin coefficent and Doeblin s minorization condition [19]. Although in principle uniform mixing conditions can also be defined in more general non-discrete spaces [12], most Markov operators of interest in Rd do not exhibit uniform mixing since the speed of convergence depends on how far apart the initial inputs are. Convergence analyses in this case rely on more sophisticated tools, including Lyapunov functions [22], coupling methods [21] and functional inequalities [1]. Following these ideas, Section 4 investigates the use of coupling methods to quantify privacy amplification by post-processing under Rényi DP [23]. These methods apply to operators given by, e.g., Gaussian and Laplace distributions, for which uniform mixing does not hold. Results in this section are intimately related to the privacy amplification by iteration phenomenon studied in [17] and can be interpreted as extensions of their main results to more general settings. In particular, our analysis unpacks the shifted Rényi divergence used in the proofs from [17] and allows us to easily track the effect of iterating arbitrary noisy Lipschitz maps. As a consequence, we show an exponential improvement on the privacy amplification by iteration of Noisy SGD in the strongly convex case which follows from applying this generalized analysis to strict contractions. Our last set of results concerns the case where K is replaced by a family of operators (Pt)t 0 forming a Markov semigroup [1]. This is the natural setting for continuous-time Markov processes, and includes diffusion processes defined in terms of stochastic differential equations [25]. In Section 5 we associate (a collection of) diffusion mechanisms (Mt)t 0 to a diffusion semigroup. Interestingly, these mechanisms are, by construction, closed under post-processing in the sense that Ps Mt = Ms+t. We show the Gaussian mechanism falls into this family since Gaussian noise is closed under addition and also present a new mechanism based on the Ornstein-Uhlenbeck process which has better mean squared error than the standard Gaussian mechanism (and matches the error of the optimally postprocessed Gaussian mechanism with bounded inputs). Our main result on diffusion mechanisms provides a generic Rényi DP guarantee based on an intrinsic notion of sensitivity derived from the geometry induced by the semigroup. The proof relies on a heat flow argument reminiscent of the analysis of mixing in diffusion processes based on functional inequalities [1]. 2 Background We start by introducing notation and concepts that will be used throughout the paper. We write [n] = {1, . . . , n}, a b = min{a, b} and [a]+ = max{a, 0}. Probability. Let X = (X, Σ, λ) be a measurable space with sigma-algebra Σ and base measure λ. We write P(X) to denote the set of probability distributions on X. Given a probability distribution µ P(X) and a measurable event E X we write µ(E) = P[X E] for a random variable X µ, denote its expectation under f : X Rd by E[f(X)], and can get back its distribution as µ = Law(X). Given two distributions µ, ν (or, in general, arbitrary measures) we write µ ν to denote that µ is absolutely continuous with respect to ν, in which case there exists a Radon-Nikodym derivative dµ dν . We shall reserve the notation pµ = dµ dλ to denote the density of µ with respect to the base measure. We also write C(µ, ν) to denote the set of couplings between µ and ν; i.e. π C(µ, ν) is a distribution on P(X X) with marginals µ and ν. The support of a distribution is supp(µ). Markov Operators. We will use K(X, Y) to denote the set of Markov operators K : X P(Y) defining a stochastic transition map between X and Y and satisfying that x 7 K(x)(E) is measurable for every measurable E Y. Markov operators act on distributions µ P(X) on the left through (µK)(E) = R K(x)(E)µ(dx), and on functions f : Y R on the right through (Kf)(x) = R f(y)K(x, dy), which can also be written as (Kf)(x) = E[f(X)] with X K(x). The kernel of a Markov operator K (with respect to λ) is the function k(x, ) = d K(x) dλ associating with x the density of K(x) with respect to a fixed measure. Divergences. A popular way to measure dissimilarity between distributions is to use Csiszár divergences Dφ(µ ν) = R φ( dµ dν )dν, where φ : R+ R is convex with φ(1) = 0. Taking 2|u 1| yields the total variation distance TV(µ, ν), and the choice φ(u) = [u eε]+ with ε 0 gives the hockey-stick divergence Deε, which satisfies Deε(µ ν) = Z dµ + dν = Z [pµ eεpν]+dλ = sup E X (µ(E) eεν(E)) . It is easy to check that ε 7 Deε(µ ν) is monotonically decreasing and D1 = TV. All Csiszár divergences satisfy joint convexity D((1 γ)µ1 + γµ2 (1 γ)ν1 + γν2) (1 γ)D(µ1 ν1) + γD(µ2 ν2) and the data processing inequality D(µK νK) D(µ ν) for any Markov operator K. Rényi divergences1 are another way to compare distributions. For α > 1 the Rényi divergence of order α is defined as Rα(µ ν) = 1 α 1 log R ( dµ dν )αdν, and also satisfies the data processing inequality. Finally, to measure similarity between µ, ν P(Rd) we sometimes use the -Wasserstein distance: W (µ, ν) = inf π C(µ,ν) inf{w 0 : X Y w holds almost surely for (X, Y ) π} . Differential Privacy. A mechanism M : Dn P(X) is a randomized function that takes a dataset D Dn over some universe of records D and returns a (sample from) distribution M(D). We write D D to denote two databases differing in a single record. We say that M satisfies2 (ε, δ)-DP if sup D D Deε(M(D) M(D )) δ [15]. Furthermore, we say that M satisfies (α, ϵ)-RDP if sup D D Rα(M(D) M(D )) ϵ [23]. 3 Amplification From Uniform Mixing We start our analysis of privacy amplification by stochastic post-processing by considering settings where the Markov operator K satisfies one of the following uniform mixing conditions. Definition 1. Let K K(X, Y) be a Markov operator, γ [0, 1] and ε 0. We say that K is: (1) γ-Dobrushin if supx,x TV(K(x), K(x )) γ, (2) (γ, ε)-Dobrushin if supx,x Deε(K(x) K(x )) γ, (3) γ-Doeblin if there exists a distribution ω P(Y) such that K(x) (1 γ)ω for all x X, (4) γ-ultra-mixing if for all x, x X we have K(x) K(x ) and d K(x) d K(x ) 1 γ. Most of these conditions arise in the context of mixing analyses in Markov chains. In particular, the Dobrushin condition can be tracked back to [13], while Doeblin s condition was introduced earlier [14] (see also [24]). Ultra-mixing is a strengthening of Doeblin s condition used in [12]. The (γ, ε)-Dobrushin is, on the other hand, new and is designed to be a generalization of Dobrushin tailored for amplification under the hockey-stick divergence. It is not hard to see that Dobrushin s is the weakest among these conditions, and in fact we have the implications summarized in Figure 1 (see Lemma 9). This explains why the amplification bounds in the following result are increasingly stronger, and in particular why the first two only provide amplification in δ, while the last two also amplify the ε parameter. Theorem 1. Let M be an (ε, δ)-DP mechanism. For a given Markov operator K, the post-processed mechanism K M satisfies: (1) (ε, δ )-DP with δ = γδ if K is γ-Dobrushin, (2) (ε, δ )-DP with δ = γδ if K is (γ, ε)-Dobrushin with3 ε = log(1 + eε 1 δ ), (3) (ε , δ )-DP with ε = log(1 + γ(eε 1)) and δ = γ(1 eε ε(1 δ)) if K is γ-Doeblin, (4) (ε , δ )-DP with ε = log(1 + γ(eε 1)) and δ = γδeε ε if K is γ-ultra-mixing. A few remarks about this result are in order. First we note that (2) is stronger than (1) since the monotonicity of hockey-stick divergences implies TV = D1 De ε. Also note how in the results above we always have ε ε, and in fact the form of ε is the same as obtained under amplification 1Rényi divergences do not belong to the family of Csiszár divergences. 2This divergence characterization of DP is due to [4]. 3We take the convention ε = whenever δ = 0, in which case the (γ, )-Dobrushin condition is obtained with respect to the divergence D (µ ν) = µ(supp(µ) \ supp(ν)). γ-ultra-mixing γ-Doeblin γ-Dobrushin (γ, ε)-Dobrushin Figure 1: Implications between mixing conditions Mixing Condition Local DP Condition γ-Dobrushin (0, γ)-LDP (γ, ε)-Dobrushin (ε, γ)-LDP γ-Doeblin Blanket condition4 γ-ultra-mixing (log 1 1 γ , 0)-LDP Table 1: Relation between mixing conditions and local DP by subsampling when, e.g., a γ-fraction of the original dataset is kept. This is not a coincidence since the proofs of (3) and (4) leverage the overlapping mixtures technique used to analyze amplification by subsampling in [2]. However, we note that for (3) we can have δ > 0 even with δ = 0. In fact the Doeblin condition only leads to an amplification in δ if γ δeε (1 δ)(eε 1). We conclude this section by noting that the conditions in Definition 1, despite being quite natural, might be too stringent for proving amplification for DP mechanisms on, say, Rd. One way to see this is to interpret the operator K : X P(Y) as a mechanism and to note that the uniform mixing conditions on K can be rephrased in terms of local DP (LDP) [18] properties (see Table 1 for property4 translations)where the supremum is taken over any pair of inputs (instead of neighboring ones). This motivates the results on next section, where we look for finer conditions to prove amplification by stochastic post-processing. 4 Amplification From Couplings In this section we turn to coupling-based proofs of amplification by post-processing under the Rényi DP framework. Our first result is a measure-theoretic generalization of the shift-reduction lemma in [17] which does not require the underlying space to be a normed vector space. The main differences in our proof are to use explicit couplings instead of the shifted Rényi divergence which implicitly relies on the existence of a norm (through the use of W ), and replace the identity U + W W = U between random variables which depends on the vector-space structure with a transport operators Hπ and Hπ which satisfy µHπ Hπ = µ in a general measure-theoretic setting. Given a coupling π C(µ, ν) with µ, ν P(X), we construct a transport Markov operator Hπ : X P(X) with kernel5 hπ(x, y) = pπ(x,y) pµ(x) , where pπ = dπ dλ λ and pµ = dµ dλ. It is immediate to verify from the definition that Hπ is a Markov operator satisfying the transport property µHπ = ν (see Lemma 16). Theorem 2. Let α 1, µ, ν P(X) and K K(X, Y). For any distribution ω P(X) and coupling π C(ω, µ) we have Rα(µK νK) Rα(ω ν) + sup x supp(ν) Rα((HπK)(x) K(x)) . (1) Note that this result captures the data-processing inequality for Rényi divergences since taking ω = µ and the identity coupling yields Rα(µK νK) Rα(µ ν). The next examples illustrate the use of this theorem to obtain amplification by operators corresponding to the addition of Gaussian and Laplace noise. Example 1 (Iterated Gaussian). We can show that (1) is tight and equivalent to the shift-reduction lemma [17] on Rd by considering the simple scenario of adding Gaussian noise to the output of a Gaussian mechanism. In particular, suppose M(D) = N(f(D), σ2 1I) for some function f with global L2-sensitivity and the Markov operator K is given by K(x) = N(x, σ2 2I). The post-processed mechanism is given by (K M)(D) = N(f(D), (σ2 1 + σ2 2)I), which satisfies (α, α 2 2(σ2 1+σ2 2))-RDP. We now show how this result also follows from Theorem 2. Given two datasets D D we write µ = M(D) = N(u, σ2 1I) and ν = M(D ) = N(v, σ2 1I) with u v . We 4The blanket condition is a necessary condition for LDP introduced in [3] to analyze privacy amplification by shuffling. 5Here we use the convention 0 take ω = N(w, σ2 1I) for some w to be determined later, and couple ω and µ through a translation τ = u w, yielding a coupling π with pπ(x, y) exp( x w 2 2σ2 1 )I[y = x + τ] and a transport operator Hπ with kernel hπ(x, y) = I[y = x + τ]. Plugging these into (1) we get Rα(µK νK) α w v 2 2σ2 1 + sup x Rd Rα(K(x + τ) K(x)) = α σ2 1 + u w 2 Finally, taking w = θu + (1 θ)v with θ = (1 + σ2 2 σ2 1 ) 1 yields Rα(µK νK) α 2 2(σ2 1+σ2 2). Example 2 (Iterated Laplace). To illustrate the flexibility of this technique, we also apply it to get an amplification result for iterated Laplace noise, in which Laplace noise is added to the output of a Laplace mechanism. We begin by noting a negative result that there is no amplification in the (ε, 0)-DP regime. Lemma 3. Let M(D) = Lap(f(D), λ1) for some function f : D R with global L1-sensitivity and let the Markov operator K be given by K(x) = Lap(x, λ2). The post-processed mechanism (K M) does not achieve (ε, 0)-DP for any ε < max{λ1,λ2}. Note that M achieves ( λ1 , 0)-DP and K(f(D)) achieves ( λ2 , 0)-DP. However, the iterated Laplace mechanism K M above still offers additional privacy in the relaxed RDP setting. An application of (1) allows us to identify some of this improvement. Recall from [23, Corollary 2] that M satisfies (α, 1 α 1 log gα( λ1 ))-RDP with gα(z) = α 2α 1 exp(z(α 1)) + α 1 2α 1 exp( zα). As in Example 1, we take ω = Lap(w, λ1) for some w to be determined later, and couple ω and µ through a translation τ = u w. Through (1) we obtain Rα(µK νK) 1 α 1 log gα + sup x R Rα(K(x + τ) K(x)) = 1 α 1 log gα In the simple case where λ1 = λ2, an amplification result is observed from the log-convexity of gα, since gα(a)gα(b) gα(a + b). When λ1 = λ2, certain values of w still result in amplification, but they depend nontrivially on α. However, we also observe that this improvement vanishes as α , since the necessary convexity also vanishes. In the limit, the lowest upper bound offered by (1) for R (which reduces to (ε, 0)-DP) matches the max{λ1,λ2} result of Lemma 3. Example 3 (Lipschitz Kernel). As a warm-up for the results in Section 4.1, we now re-work Example 1 with a slightly more complex Markov operator. Suppose ψ is an L-Lipschitz map6 and let K(x) = N(ψ(x), σ2 2I). Taking M to be the Gaussian mechanism from Example 1, we will show that the post-processed mechanism K M satisfies (α, α 2 2σ2 )-RDP with σ2 = σ2 1 + σ2 2 L2 . To prove this bound, we instantiate the notation from Example 1, and use the same coupling strategy to obtain Rα(µK νK) α σ2 1 + sup x Rd ψ(x + τ) ψ(x) 2 σ2 1 + L2 u w 2 where the second inequality uses the Lipschitz property. As before, the result follows from taking w = θu + (1 θ)v with θ = (1 + σ2 2 L2σ2 1 ) 1. This example shows that we get amplification (i.e. σ2 > σ2 1) for any L < and σ2 > 0, although the amount of amplification decreases as L grows. On the other hand, for L < 1 the amplification is stronger than just adding Gaussian noise (Example 1). 4.1 Amplification by Iteration in Noisy Projected SGD with Strongly Convex Losses Now we use Theorem 2 and the computations above to show that the proof of privacy amplification by iteration [17, Theorem 22] can be extended to explicitly track the Lipschitz coefficients in a noisy iteration algorithm. In particular, this allows us to show an exponential improvement on the rate of privacy amplification by iteration in noisy SGD when the loss is strongly convex. To obtain this result we first provide an iterated version of Theorem 2 in Rd with Lipschitz Gaussian 6That is, ψ(x) ψ(y) L x y for any pair x, y. kernels. This version of the analysis introduces an explicit dependence on the W distances along an interpolating path between the initial distributions µ, ν P(Rd) which could later be optimized for different applications. In our view, this helps to clarify the intuition behind the previous analysis of amplification by iteration. Theorem 4. Let α 1, µ, ν P(Rd) and let K Rd be a convex set. Suppose K1, . . . , Kr K(Rd, Rd) are Markov operators where Yi Ki(x) is obtained as7 Yi = ΠK(ψi(x)+Zi) with Zi N(0, σ2I), where the maps ψi : K Rd are L-Lipschitz for all i [r]. For any µ0, µ1, . . . , µr P(Rd) with µ0 = µ and µr = ν we have Rα(µK1 Kr νK1 Kr) αL2 i=1 L2(r i)W (µi, µi 1)2 . (2) Furthermore, if L 1 and W (µ, ν) = , then Rα(µK1 Kr νK1 Kr) α 2Lr+1 Note how taking L = 1 in the bound above we obtain α 2 2rσ2 = O(1/r), which matches [17, Theorem 1]. On the other hand, for L strictly smaller than 1, the analysis above shows that the amplification rate is O(Lr+1/r) as a consequence of the maps ψi being strict contractions, i.e. ψi(x) ψi(y) < x y . For L > 1 this result is not useful since the sum will diverge; however, the proof could easily be adapted to handle the case where each ψi is Li-Lipschitz with some Li > 1 and some Li < 1. We now apply this result to improve the per-person privacy guarantees of noisy projected SGD (Algorithm 1) in the case where the loss function is smooth and strongly convex. Algorithm 1: Noisy Projected Stochastic Gradient Descent Noisy Proj SGD(D, ℓ, η, σ, ξ0) Input: Dataset D = (z1, . . . , zn), loss function ℓ: K D R, learning rate η, noise parameter σ, initial distribution ξ0 P(K) Sample x0 ξ0 for i [n] do xi ΠK (xi 1 η( xℓ(xi 1, zi) + Z)) with Z N(0, σ2I) return xn A function f : K Rd R defined on a convex set is β-smooth if it is continuously differentiable and f is β-Lipschitz, i.e., f(x) f(y) β x y , and is ρ-strongly convex if the function g(x) = f(x) ρ 2 x 2 is convex. When we say that a loss function ℓ: K D R satisfies a property (e.g. smoothness) we mean the property is satisfied by ℓ( , z) for all z D. Furthermore, we recall from [17] that a mechanism M : Dn X satisfies (α, ϵ)-RDP at index i if Rα(M(D) M(D )) ϵ holds for any pair of databases D and D differing on the ith coordinate. Theorem 5. Let ℓ: K D R be a C-Lipschitz, β-smooth, ρ-strongly convex loss function. If η 2 β+ρ, then Noisy Proj SGD(D, ℓ, η, σ, ξ0) satisfies (α, αϵi)-RDP at index i, where ϵn = 2C2 σ2 and ϵi = 2C2 (n i)σ2 (1 2ηβρ β+ρ ) n i+1 2 for 1 i n 1. Since [17, Theorem 23] shows that for smooth Lipschitz loss functions the guarantee at index i of Noisy Proj SGD is given by ϵi = O( C2 (n i)σ2 ), our result provides an exponential improvement in the strongly convex case. This implies, for example, that using the technique in [17, Corollary 31] one can show that, in the strongly convex setting, running Θ(log(d)) additional iterations of Noisy Proj SGD on public data is enough to attain (up to constant factors) the same optimization error as non-private SGD while providing privacy for all individuals. 5 Diffusion Mechanisms Now we go beyond the analysis from previous sections and simultaneously consider a family of Markov operators P = (Pt)t 0 indexed by a continuous parameter t and satisfying the semigroup 7Here ΠK(x) = arg miny K x y denotes the projection operator onto the convex set K Rd. property Pt Ps = Pt+s. Such P is called a Markov semigroup and can be used to define a family of output perturbation mechanisms M f t (D) = Pt(f(D)) which are closed under post-processing by P in the sense that Ps M f t = M f t+s. The semigroup property greatly simplifies the analysis of privacy amplification by post-processing, since, for example, if we show that M f t satisfies (α, ϵ(t))-RDP, then this immediately provides RDP guarantees for any post-processing of Mt by any number of operators in P. The main result of this section provides such privacy analysis for mechanisms arising from symmetric diffusion Markov semigroups in Euclidean space. We will show this class includes the well-known Gaussian mechanism, and also identify another interesting mechanism in this class arising from the Ornstein-Uhlenbeck diffusion process. Roughly speaking, a diffusion Markov semigroup P = (Pt)t 0 on Rd corresponds to the case where Xt Pt(x) defines a Markov process (Xt)t 0 arising from a (time-homogeneous Itô) stochastic differential equation (SDE) of the form X0 = x and d Xt = u(Xt)dt + v(Xt)d Wt, where Wt is a standard d-dimensional Wiener process, and the drift u : Rd Rd and diffusion v : Rd Rd d coefficients satisfy appropriate regularity assumptions.8 In this paper, however, we shall follow [1] and take a more abstract approach to Markov diffusion semigroups. We synthesize this approach by making a number of hypotheses on P that we discuss after introducing two core concepts from the theory of Markov semigroups. In the context of a Markov semigroup P, the action of the Markov operators Pt on functions can be used to define the generator L of the semigroup as the operator given by Lf = d dt(Ptf)|t=0. In particular, for a diffusion semigroup arising from the SDE d Xt = u(Xt)dt + v(Xt)d Wt it is well-known that one can write the generator as Lf = u, f + 1 2 vv , H(f) , where H(f) is the Hessian of f and the second term is a Frobenius inner product. Using the generator one also defines the so-called carré du champ operator Γ(f, g) = 1 2(L(fg) f Lg g Lf). This operator is bilinear and non-negative in the sense that Γ(f) Γ(f, f) 0. The carré du champ operator operator can be interpreted as a device to measure how far L is from being a first-order differential operator, since, e.g., if L = P xi then L(fg) = f Lg + g Lf and therefore Γ(f, g) = 0. The operator Γ can also be related to notions of curvature/contractivity of the underlying semigroup [1]. Below we illustrate these concepts with the example of Brownian motion; but first we formally state our assumptions on the semigroup. Assumption 1. Suppose the Markov semigroup P = (Pt)t 0 K(Rd, Rd) satisfies the following: (1) There exists a unique non-negative invariant measure λ; that is, λPt = λ for all t 0. When the invariant measure is finite we normalize it to be a probability measure. (2) The operators Pt admit a symmetric kernel pt(x, y) = pt(y, x) with respect to the invariant measure. Equivalently, the invariant measure λ is reversible for the Markov process Xt. (3) The generator L satisfies the diffusion property Lφ(f) = φ (f)Lf + φ (f)Γ(f) for any differentiable φ : R R. This is a chain rule property saying that L is a second-order differential operator without constant terms. Example 4 (Brownian Motion). The simplest diffusion process is the Brownian motion given by the simple SDE d Xt = 2d Wt., which corresponds to the semigroup P given by Pt(x) = N(x, 2t). In this case, the mechanism M f t (D) = Pt(f(D)) is a Gaussian mechanism with variance σ2 = 2t and therefore satisfies (α, α 2 4t )-RDP, where is the global L2-sensitivity of f. A direct substitution with u = 0 and v = 2I shows that the semigroup s generator is the standard Laplacian in Rd, L = 2 = Pd i=1 2 x2 i , and a simple calculation yields the expression Γ(f, g) = f, g for the carré du champ operator. Now we check that P satisfies the conditions in Assumption 1. First, we recall that Brownian motion has the Lebesgue measure λ on Rd as its unique invariant measure; this happens to be a non-finite measure. With respect to λ, the semigroup has kernel pt(x, y) exp( x y 2 4t ) which is clearly symmetric. Finally, we use the chain rule for the gradient to verify that Lf = 2φ(f) = (φ (f) f) = φ (f) f, f + φ (f) 2f = φ (f)Γ(f) + φ (f)Lf . Now we turn to the main result of this section, which provides a privacy analysis for the diffusion mechanism M f t associated with an arbitrary symmetric diffusion Markov semigroup. The key insight 8The details are not relevant here since we work directly with semigroups satisfying Assumption 1. We refer to [25] for details. behind this result is that the carré du champ operator of the semigroup provides a measure Λ(t) of intrinsic sensitivity for the mechanism M f t defined as: Λ(t) = sup D D t κf(D),f(D )(s)ds , where κx,x (t) = sup y Rd Γ log pt(x, y) Theorem 6. Let f : Dn Rd and let P = (Pt)t 0 by a Markov semigroup on Rd satisfying Assumption 1. If the mechanism M f t (D) = Pt(f(D)) has intrinsic sensitivity Λ(t), then it satisfies (α, αΛ(t))-RDP for any α > 1 and t > 0. Example 5 (Brownian Motion, Continued). To illustrate the use of Theorem 6 we show how it can be used to recover the privacy guarantees of the Gaussian mechanism through its connection with Brownian motion. We let P be the semigroup from Example 4 and start by using Γ(f) = f 2 to compute κx,x (t) as follows: Γ log pt(x, y) x y 2 x y 2 Now we use R t 1 s2 ds = 1 t and 2 = sup D D f(D) f(D ) 2 to see that the mechanism associated with P has intrinsic sensitivity Λ(t) = 2 4t , yielding the privacy guarantee from Example 4. 5.1 The Ornstein-Uhlenbeck Mechanism Beyond Brownian motion, another well-known diffusion process is the Ornstein-Uhlenbeck process with parameters θ, ρ > 0 given by the SDE d Xt = θXtdt + 2ρd Wt. This diffusion process is associate with the semigroup P = (Pt)t 0 given by Pt(x) = N(e θtx, ρ2 θ (1 e 2θt)I). One interpretation of this diffusion process is to think of Xt as a Brownian motion with variance ρ2 applied to a mean reverting flow that pulls a particle towards the origin at a rate θ. In particular, the mechanism M f t (D) is given by releasing e θtf(D) + N(0, ρ2 θ (1 e 2θt)). Taking the limit t one sees that the (unique) invariant measure of P is the Gaussian distribution λ = N(0, ρ2 θ I). From the SDE characterization of this process it is easy to check that its generator is Lf = ρ2 2f θ x, f and the associated carré du champ operator is Γ(f, g) = ρ2 f, g . Thus, P satisfies conditions (1) and (3) in Assumption 1. To check the symmetry condition we apply a change of measure to the Gaussian density pt(x, y) of Pt with respect to the Lebesgue measure to get its density w.r.t. λ: pt(x, y) = pt(x, y) pλ(y) exp θ y e θtx 2 2ρ2(1 e 2θt) 2ρ2 = exp θ x 2 2eθt x, y + y 2 2ρ2(e2θt 1) where pλ is the density of λ w.r.t. the Lebesgue measure. Thus, Theorem 6 yields the following. Corollary 7. Let f : Dn Rd have global L2-sensitivity and P = (Pt)t 0 be the Ornstein Uhlenbeck semigroup with parameters θ, ρ. For any α > 1 and t > 0 the mechanism M f t (D) = Pt(f(D)) satisfies (α, αΛ(t))-RDP with Λ(t) = θ 2 2ρ2(e2θt 1). The Ornstein-Uhlenbeck mechanism is not an unbiased mechanism since E[M f t (D)] = e θtf(D). This bias is the reason why the privacy guarantee in Corollary 7 exhibits a rate O(e 2θt), while, for example, the Brownian motion mechanism only exhibits a rate O(t 1). In particular, the Ornstein Uhlenbeck mechanism achieves its privacy not only by introducing noise, but also by shrinking f(D) towards a data-independent point (the origin in this case); this effectively corresponds to reducing the sensitivity of f from to e θt . This provides a way to trade-off variance and bias in the mean-squared error (MSE) incurred by privately releasing f(D) in a similar way that can be achieved by post-processing the Gaussian mechanism when f(D) is known to be bounded. To formalize this result we define the mean squared error EOU(θ, ρ, t) of the Ornstein-Uhlenbeck mechanism with parameters θ, ρ at time t, which is given by: EOU(θ, ρ, t) E[ f(D) M f t (D) 2] = (1 e θt)2 f(D) 2 + dρ2 θ (1 e 2θt) . (4) Similarly, we define EGM(θ, ρ, t) as the mean squared error of a Gaussian mechanism with the same privacy guarantees as M f t with parameters θ, ρ. In particular, we have EGM(θ, ρ, t) = d σ2, where σ2 ρ2(e2θt 1) θ (cf. Corollary 7). We also note the post-processed Gaussian mechanism (PGM) D 7 β(f(D) + N(0, σ2I)) which multiplies the output by a scalar β optimized to minimize the MSE under the condition f(D) R yields error EPGM(θ, ρ, t) EGM(θ, ρ, t)(1 + d σ2 Theorem 8. Suppose f : Dn Rd has global L2-sensitivity and satisfies sup D f(D) R. If θR2 4dρ2 then we have EOU(θ,ρ,t) EGM(θ,ρ,t) 1 for all t 0 and limt EOU(θ,ρ,t) EGM(θ,ρ,t) = 0. In particular, taking θ = log 1 + d 2 2ϵR2 and ρ2 = θ 2 2ϵ(e2θ 1) with ϵ > 0, the mechanism M f t satisfies (α, αϵ)-RDP at time t = 1 and we have EOU(θ,ρ,1) EGM(θ,ρ,1) 1 + d 2 This result not only shows that the Ornstein-Uhlenbeck mechanism is uniformly better than the Gaussian mechanism for any level of privacy, but also shows that in this mechanism the error always stays bounded and can attain the same level of error as the Gaussian mechanism with optimal postprocessing. To see this note that with the choices of parameters made in the second statement give EGM(θ, ρ, 1) = d 2 2ϵ and therefore EOU(θ, ρ, 1) d 2R2 2ϵR2+d 2 , which behaves like O(R2) with constant and either ϵ 0 or d . 6 Conclusion We have undertaken a systematic study of amplification by post-processing. Our results yield improvements over recent work on amplification by iteration, and introduce a new Ornstein-Uhlenbeck mechanism which is more accurate than the Gaussian mechanism. In the future it would be interesting to study applications of amplification by post-processing. One promising application is Hierarchical Differential Privacy, where information is released under increasingly strong privacy constraints (e.g. to a restricted group within a company, globally within a company, and finally to outside parties). Acknowledgements MG was partially supported by NSF grant CCF-1718220. [1] Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and geometry of Markov diffusion operators, volume 348. Springer Science & Business Media, 2013. [2] Borja Balle, Gilles Barthe, and Marco Gaboardi. Privacy amplification by subsampling: Tight analyses via couplings and divergences. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montréal, Canada., pages 6280 6290, 2018. [3] Borja Balle, James Bell, Adrià Gascón, and Kobbi Nissim. The privacy blanket of the shuffle model. Co RR, abs/1903.02837, 2019. [4] Gilles Barthe and Federico Olmedo. Beyond differential privacy: Composition theorems and relational logic for f-divergences between probabilistic programs. In International Colloquium on Automata, Languages, and Programming, pages 49 60. Springer, 2013. [5] Amos Beimel, Hai Brenner, Shiva Prasad Kasiviswanathan, and Kobbi Nissim. Bounds on the sample complexity for private learning and private data release. Machine learning, 94(3):401 437, 2014. [6] Amos Beimel, Kobbi Nissim, and Uri Stemmer. Characterizing the sample complexity of private learners. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pages 97 110. ACM, 2013. [7] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends R in Machine Learning, 8(3-4):231 357, 2015. [8] Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil Vadhan. Differentially private release and learning of threshold functions. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 634 649. IEEE, 2015. [9] Kamalika Chaudhuri and Nina Mishra. When random sampling preserves privacy. In Annual International Cryptology Conference, pages 198 213. Springer, 2006. [10] Albert Cheu, Adam D. Smith, Jonathan Ullman, David Zeber, and Maxim Zhilyaev. Distributed differential privacy via shuffling. In Advances in Cryptology - EUROCRYPT 2019 - 38th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Darmstadt, Germany, May 19-23, 2019, Proceedings, Part I, pages 375 403, 2019. [11] Joel E Cohen, Yoh Iwasa, Gh Rautu, Mary Beth Ruskai, Eugene Seneta, and Gh Zbaganu. Relative entropy under mappings by stochastic matrices. Linear algebra and its applications, 179:211 235, 1993. [12] P Del Moral, M Ledoux, and L Miclo. On contraction properties of Markov kernels. Probability theory and related fields, 126(3):395 420, 2003. [13] Roland L Dobrushin. Central limit theorem for nonstationary Markov chains. I. Theory of Probability & Its Applications, 1(1):65 80, 1956. [14] W. Doeblin. Sur les proprietes asymptotiques de mouvements rÉgis par certains types de chaÎnes simples (suite et fin). Bulletin mathématique de la Société Roumaine des Sciences, 39(2):3 61, 1937. [15] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography, pages 265 284. Springer, 2006. [16] Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Abhradeep Thakurta. Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2468 2479. SIAM, 2019. [17] Vitaly Feldman, Ilya Mironov, Kunal Talwar, and Abhradeep Thakurta. Privacy amplification by iteration. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 521 532. IEEE, 2018. [18] Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. What can we learn privately? SIAM Journal on Computing, 40(3):793 826, 2011. [19] David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017. [20] Ninghui Li, Wahbeh Qardaji, and Dong Su. On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy. In Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security, pages 32 33. ACM, 2012. [21] Torgny Lindvall. Lectures on the coupling method. Courier Corporation, 2002. [22] Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science & Business Media, 2012. [23] Ilya Mironov. Rényi differential privacy. In 30th IEEE Computer Security Foundations Symposium, CSF 2017, Santa Barbara, CA, USA, August 21-25, 2017, pages 263 275, 2017. [24] Esa Nummelin. General irreducible Markov chains and non-negative operators, volume 83. Cambridge University Press, 2004. [25] Bernt Øksendal. Stochastic differential equations. In Stochastic differential equations, pages 65 84. Springer, 2003. [26] Maxim Raginsky. Strong data processing inequalities and Φ-Sobolev inequalities for discrete channels. IEEE Transactions on Information Theory, 62(6):3355 3389, 2016. [27] Yu-Xiang Wang, Borja Balle, and Shiva Kasiviswanathan. Subsampled rényi differential privacy and analytical moments accountant. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.