# exactly_minimaxoptimal_locally_differentially_private_sampling__a6e369cb.pdf Exactly Minimax-Optimal Locally Differentially Private Sampling Hyun-Young Park School of Electrical Engineering KAIST phy811@kaist.ac.kr Shahab Asoodeh Department of Computing and Software Mc Master University asoodeh@mcmaster.ca Si-Hyeon Lee School of Electrical Engineering KAIST sihyeon@kaist.ac.kr The sampling problem under local differential privacy has recently been studied with potential applications to generative models, but a fundamental analysis of its privacy-utility trade-off (PUT) remains incomplete. In this work, we define the fundamental PUT of private sampling in the minimax sense, using the fdivergence between original and sampling distributions as the utility measure. We characterize the exact PUT for both finite and continuous data spaces under some mild conditions on the data distributions, and propose sampling mechanisms that are universally optimal for all f-divergences. Our numerical experiments demonstrate the superiority of our mechanisms over baselines, in terms of theoretical utilities for finite data space and of empirical utilities for continuous data space. 1 Introduction Privacy leakage is a pressing concern in the realm of machine learning (ML), spurring extensive research into privacy protection techniques [1 4]. Among these, local differential privacy (LDP) [5] stands out as a standard model and has been deployed in industry, e.g., by Google [6, 7], Apple [8], Microsoft [9]. In the LDP framework, individual clients randomize their data on their own devices and send it to a potentially untrusted aggregator for analysis, thus preventing the user data from being inferred. However, this perturbation inherently diminishes data utility. Consequently, the central challenge in privacy mechanism design lies in optimizing utility while preserving the desired level of privacy protection. This goal involves characterizing the optimal balance between privacy parameter and utility, referred to as the privacy-utility trade-off (PUT). The analysis of the PUT and the proposal of privacy mechanisms have been actively conducted for various settings of statistical inference and machine learning [10 27]. Most research in this field focuses on scenarios where each client has only a single data point. However, there are increasingly more applications where each client has a large local dataset with multiple data records. One can formulate the privacy requirement in these cases by assuming that clients have datasets of the same size, generated independently from an underlying distribution [28 34]. This probabilistic assumption, however, restricts practical flexibility. The work [35] explored a scenario where clients have large datasets that may vary in size and seek to privately release another dataset that closely resembles their original dataset. In this scenario, local datasets can be represented by an empirical distribution, allowing each client to be seen as holding a probability distribution and 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Original dist. Sampling dist. [35] Sampling dist. (ours) Figure 1: Original Gaussian ring distribution and the sampling distributions of the baseline [35] and our proposed mechanism for privacy budget ϵ = 0.5. The implementation details are in Appendix F. generating a private sample from it. This setup, which is called private sampling, is the main focus of this paper. Private sampling has recently found applications in the private fine-tuning of large language models [36]. Additionally, private sampling is connected to the challenge of learning private generative models, a topic often explored in the central DP model [37 44]. While there exist studies on private generative models within the local model [45 48], all these works assume a single data point per client. Very recently, the work [49] considered a setup where each client holds a probability distribution, but in a different context of query estimation. The private sampling mechanism in [35] can be described as follows. Initially, given a probability distribution P representing a local dataset, the mechanism assumes a fixed reference distribution. It then constructs what is termed the relative mollifier , a closed ball centered around the reference distribution with a radius equivalent to half of the privacy budget, within the space of probability distributions. Subsequently, the mechanism computes the projection of P onto the relative mollifier, utilizing the Kullback-Leibler (KL) divergence. This projected distribution serves as the sampling distribution for generating a sample (see Section 2.3 for more details). However, this mechanism has a notable shortcoming: the sampling distribution is only locally optimal within the relative mollifier. This, in turn, implies that the optimality of the sampling distribution depends on the choice of the reference distribution. A more fundamentally intriguing goal would be to formulate and characterize the PUT without such an ambiguity in the choice of reference distribution. In this paper, we establish the optimality of locally private sampling in the minimax sense, and identify optimal private samplers. Our primary contributions are summarized as follows: The fundamental PUT of private sampling is rigorously defined in terms of minimax utility, which is commonly used in the literature of private estimation [10, 11, 13, 30]. We impose some minimal assumptions on client s distributions as in [38] (which studies sampling under central DP, a weaker privacy model than the local model [50]). For utility measure, we use the f-divergence [51, 52] between the original and the sampling distributions, that includes KL divergence, total variation distance, squared Hellinger distance, and χ2-divergence as special cases. We characterize the exact PUT for both finite and continuous data spaces, and present optimal sampling mechanisms achieving the PUT. Surprisingly, our mechanisms are universally optimal under any choice of f-divergence for utility measure. We numerically demonstrate that our proposed mechanism outperforms the baseline method presented in [35]. Specifically, for finite data spaces, we derive a closed-form expression for the utility of both our mechanism and the baseline, allowing for an exact comparison of their utilities. In the case of continuous data spaces, a closed-form expression for the baseline is not available, so we use empirical utility for the comparison. Figure 1 illustrates our proposed mechanism outputs a distribution closer to the original, than the baseline. All codes for experiments and figures are attached as a supplementary material, and can be found at the online repository1. The instructions to reproduce the results in the paper are in Appendix H. 1https://github.com/phy811/Optimal-LDP-Sampling. 2 Problem Formulation 2.1 Notations and preliminaries Notations. For a sample space X, let P(X) be the set of all probability distributions on X. For each n N, let C(Rn) be the set of all continuous probability distributions on Rn. For each positive integer k N, let [k] := {1, 2, , k}. For a subset A X, 1A : X {0, 1} denotes the indicator function, defined as 1A(x) = 1 for x A and 1A(x) = 0 for x / A. Also, for s1 s2 and x R, let clip(x; s1, s2) := max{s1, min{s2, x}}. We refer to Appendix A for the rigorous measure-theoretic assumptions underlying the paper. f-divergence. For a convex function f : (0, ) R satisfying f(1) = 0 and two probability distributions P, Q P(X) on the same sample space X, let Df (P Q) denote the f-divergence [51, 52]. The general definition of f-divergence is given in Appendix B. For X = Rn and P, Q C(Rn) with P Q (that is, P(A) = 0 whenever Q(A) = 0), it is defined as Df (P Q) = Z x:q(x)>0 q(x)f p(x) where p, q are pdfs of P, Q, respectively, and we define f(0) = limx 0+ f(x) ( , ]. For finite X, we can replace the integral with the sum and replace p, q with P, Q. Several well-known distance measures between distributions are examples of f-divergence with different convex functions. For instance, KL divergence (relative entropy), total variation distance, squared Hellinger distance, and χ2-divergence are f-divergences with f(x) = x log x, f(x) = |x 1|/2, f(x) = (1 x)2, and f(x) = x2 1, respectively. Two important properties of general f-divergence are keys for this work. First, Df (P Q) 0, and equality holds if P = Q. Furthermore, we have Df (P Q) Mf := lim x 0+ f(x) + xf(1/x), (2) where equality holds if P and Q are mutually singular (that is, they have disjoint supports). For a more comprehensive list of such f-divergences and their properties, we refer the readers to [52]. We denote the KL divergence and the total variation distance as DKL (P Q) and DTV (P, Q), respectively. We note that the total variation distance is in fact a metric on P(X). 2.2 System model Suppose a client has access to a distribution P P(X) over a sample space X, and wants to produce a sample in X which looks like being drawn from P and to send it to a data curator. We assume that there are some constraints on the possible data distribution P, so that P is restricted to be in some subset P P(X), and both the client and the curator know X and P. However, it is required that a sampled element does not leak the privacy about the original distribution P. For this purpose, the client and the curator agree a private sampling mechanism Q, which is a conditional distribution from P to X. After that, the client produces a sample following the distribution Q( |P). To guarantee the privacy protection, we impose Q to satisfy the local differential privacy (LDP) [10, 35]. Definition 2.1. Let ϵ > 0. A private sampling mechanism Q is said to satisfy ϵ-LDP, or Q is an ϵ-LDP mechanism, if for any P, P P and A X, we have Q(A|P) eϵQ(A|P ). (3) For convenience, for each P P, let Q(P) P(X) denote the distribution of X given P through Q, that is Q(P)(A) = Q(A|P) for each A X. In this way, we equivalently see Q as a function Q : P P(X). Let QX, P,ϵ denote the set of all ϵ-LDP mechanisms Q : P P(X). As the utility loss of the private sampling, we use the f-divergence between the original distribution and the sampling distribution, Df (P Q(P)). Since the sampling procedure can be performed across many clients who may have different data distributions, we measure the utility loss of Q by the worst-case f-divergence, Rf(Q) = sup P P Df (P Q(P)) . (4) Given X, P, ϵ, and f, our goal is to find the smallest possible worst-case f-divergence, R(X, P, ϵ, f) = inf Q QX, P,ϵ Rf(Q), (5) and to find a mechanism Q QX, P,ϵ achieving it. We say that Q QX, P,ϵ is optimal for (X, P, ϵ) under Df if Rf(Q) = R(X, P, ϵ, f). 2.3 Related work The most closely related work to our work is [35]. The system models are the same as this paper, except the formulation of PUT. They first fix a reference probability distribution Q0 P(X), and only consider mechanisms Q satisfying e ϵ/2Q0(A) Q(A|P) eϵ/2Q0(A) for all P P and A X. In other words, let Mϵ,Q0 = {Q P(X) : e ϵ/2Q0(A) Q(A) eϵ/2Q0(A), A X}. Then, they only consider Q such that Q(P) Mϵ,Q0. Note that this guarantees ϵ-LDP. For each P P, they sought to find Q Mϵ,Q0 which minimizes DKL (P Q), and set this Q to be Q(P). First, they claimed to find a closed-form expression of such a minimizer Q for finite X, given by Q(x) = clip P(x)/r P ; e ϵ/2Q0(x), eϵ/2Q0(x) , (6) where r P > 0 is a constant depending on P ensuring P x X Q(x) = 1. Second, they presented an algorithm, called Mollified Boosted Density Estimation (MBDE), to approximate the optimal solution for continuous X. However, the utility varies over the choice of reference distribution Q0, and they left the question of choosing a best Q0 to achieve the best performance in both practice and theory. Moreover, we found that the closed-form in (6) is incomplete, because for some (P, Q0), there may be no r P > 0 such that the RHS of (6) does not sum to one. As an example, when P, Q0 are point masses at different points, then we can easily see that the sum of (6) is e ϵ/2 for any r P > 0. 3 Main Results 3.1 Optimal private sampling over finite space First, we consider the finite case, where X = [k] for some k N. A natural setup for P is that P = P([k]), i.e. there is no restriction on the client distribution P P([k]). In this case, we completely characterize the optimal worst-case f-divergence R(X, P, ϵ, f) and find an optimal private sampling mechanism. Surprisingly, for each k N and ϵ > 0, we found a single mechanism which is universally optimal for every f-divergence. Theorem 3.1. For each k N, ϵ > 0, and an f-divergence Df, we have R([k], P([k]), ϵ, f) = eϵ eϵ + k 1f eϵ + k 1 + k 1 eϵ + k 1f(0). (7) Moreover, the mechanism Q k,ϵ constructed as below satisfies ϵ-LDP and is optimal for (X = [k], P = P([k]), ϵ) under any Df: Q k,ϵ(x|P) = max 1 r P P(x), 1 eϵ + k 1 x [k], P P([k]), (8) where r P > 0 is a constant depending on P so that Pk x=1 Q k,ϵ(x|P) = 1. Furthermore, r P can be chosen such that 1 r P (eϵ + k 1)/eϵ. By definition, we have Q k,ϵ(x|P) 1 eϵ+k 1 for all x X. This also implies that Q k,ϵ(x|P) = 1 P x X\{x} Q k,ϵ(x |P) 1 k 1 eϵ+k 1 = eϵ eϵ+k 1. Hence, 1 eϵ+k 1 Q k,ϵ(x|P) eϵ eϵ+k 1. This clearly implies that Q k,ϵ satisfies ϵ-LDP. Behaviors of the optimal mechanism. Let us observe some behaviors of the proposed mechanism with respect to the system parameters, whose formal proofs are in Appendix E. We visualize how the mechanism Q k,ϵ works for different ϵ in Figure 2. Here, we write R to mean R([k], P([k]), ϵ, f) for simplicity. If f(0) = , then R = , which means that Rf(Q) = for any ϵ-LDP sampling mechanism Q. (Such a phenomenon happens for general (X, P), whenever P contains two mutually singular distributions) Hence, from now on in this paragraph, we assume f(0) < . We can observe that R is decreasing in ϵ and increasing in k. For a fixed k, we have R 0 as ϵ , which makes sense since ϵ corresponds to the non-private case. Also, as ϵ 0, we have Q k,ϵ(x|P) 1/k for every P P([k]) and x [k], that is, Q k,ϵ(P) tends to the uniform distribution over [k] for every P P([k]). This fact can be also observed by Figure 2. 1 2 3 4 5 Data value x Probability Original pmf P(x) Q k,ǫ(x|P), ǫ = 1.0 Q k,ǫ(x|P), ǫ = 0.5 Q k,ǫ(x|P), ǫ = 0.1 Figure 2: A visualization of the mechanism Q k,ϵ Remarks about the constant r P . The value of r P may not be unique, but the mechanism Q k,ϵ does not depend on the choice of r P . To see this, let us fix P, and let gr(x) = max 1 r P(x), 1 eϵ+k 1 . Suppose that Pk x=1 gr(x) = Pk x=1 gr (x) = 1 for r r . Since gr(x) gr (x) for each x [k], the equality Pk x=1 gr(x) = Pk x=1 gr (x) implies that gr(x) = gr (x) for all x [k]. Hence Q k,ϵ is uniquely determined. Since r 7 Pk x=1 gr(x) is non-increasing and continuous, we can use the bisection method to find r P . The Furthermore part of the theorem statement precisely means that for r = 1 and r = (eϵ + k 1)/eϵ, the value of Pk x=1 gr(x) is at least and at most 1, respectively, so that we can perform the bisection method with these two initial endpoints to find r such that Pk x=1 gr(x) = 1. Comparison with the previous work [35]. The expression of the optimal mechanism in (8) is similar to (6), the expression of the KL divergence projection onto Mϵ,Q0 derived by Husain et al. [35]. Since 1 eϵ+k 1 Q k,ϵ(x|P) eϵ eϵ+k 1, we can alternatively write Q k,ϵ(x|P) = clip 1 r P P(x); 1 eϵ+k 1, eϵ eϵ+k 1 . Hence, our optimal mechanism can be viewed as an instance of a generalized version of (6), where Q0 is a positive measure, not necessarily a probability measure summing to one, given by Q0(x) = eϵ/2 eϵ+k 1. A natural question is whether Q k,ϵ(P) is a projection of P onto Mϵ,Q0, that is Q k,ϵ(P) is a minimizer of DKL (P Q) among Q Mϵ,Q0, where Mϵ,Q0 is similarly defined as in Section 2.3. As we shall discuss in Section 3.3, this statement is true quite surprisingly even when we replace DKL with any other f-divergences. However, our analysis is more involved, as we need to show the optimality of the proposed mechanism over any other possible mechanisms, including minimizers with respect to other choices of Q0. Also, in Section 5, we compare the worst-case f-divergence of our optimal mechanism with that of the mechanism proposed in [35] which restricts Q0 to be a probability distribution. 3.2 Optimal private sampling over continuous space Next, we consider the continuous case, where X = Rn for some n N. Some of the natural setups for P are (i) P = P(Rn), or (ii) P = C(Rn). We can also think some restrictive but still reasonable setups, such as the setups where (iii) P is the set of empirical distributions supported on some non-empty open subset of Rn, or where (iv) P is the set of continuous distributions on [ 1, 1]n having smooth pdf and zero mean. However, we show that for a general class of P including these four cases, any ϵ-LDP sampling mechanisms have the worst-case f-divergence equal to the maximum value Mf of the f-divergence defined in (2). In the following proposition, X may be a general sample space, not necessarily Rn or finite space. The proof is in Appendix D. Proposition 3.2. Suppose that P contains infinitely many distributions which are pairwise mutually singular. Then, for any ϵ > 0, for any ϵ-LDP mechanism Q QX, P,ϵ, and for any f-divergence, we have Rf(Q) = Mf, where Mf is define at (2). Hence, we need to consider sufficiently regular but practical setups for P. Our setup. In this subsection, when P, Q C(Rn), the corresponding small letters p, q denote their pdfs. In this paper, we consider the case that P = Pc1,c2,h := {P C(Rn) : c1h(x) p(x) c2h(x), x Rn} (9) for some pre-known h : X [0, ) such that R Rn h(x)dx < and c2 > c1 0. Some of the sampling tasks and generative models in literature [53, 35] assume that the set of possible data distributions P satisfies P Pc1,c2,h for some c1, c2, h satisfying the aforementioned condition, hence (9) is a moderate assumption. One example is the sampling from a Gaussian mixture, which is one of the canonical sampling tasks in literature [53, 35]. Suppose that P consists of Gaussian mixtures, where each Gaussian has mean within a unit ball centered at the origin and has unit covariance. That is, P = {Pk i=1 λi N(µi, In) : k N, λi 0, Pk i=1 λi = 1, µi 1}, where In is the identity matrix of size n n. In this case, we can observe that P P0,1,h for h(x) = (2π) n/2 exp( (max(0, x 1))2/2), and it can be easily shown that R Rn h(x) < . Without loss of generality, we may assume the following normalization condition on c1, c2, h, ϵ: Z Rn h(x)dx = 1, c1 < 1 < c2, c2 > c1eϵ. (10) The reason is as follows. First, if any one of three inequalities R Rn h(x)dx > 0, c1 R Rn h(x)dx < 1, and c2 R Rn h(x)dx > 1 is not satisfied, then Pc1,c2,h is either an empty set or a singleton that consists of a distribution having pdf c1h(x) or c2h(x), which makes the problem trivial. Hence we impose all of the three inequalities. Then, we can normalize c1, c2, h, to make R Rn h(x)dx = 1 and c1 < 1 < c2. Furthermore, if c2 eϵc1, then for any P1, P2 Pc1,c2,h, we have p1(x)/p2(x) c2/c1 eϵ, hence we can easily observe that the mechanism Q defined as Q(P) = P for all P Pc1,c2,h satisfies ϵ-LDP and Rf(Q) = 0, hence the problem also becomes trivial, giving R(Rn, Pc1,c2,h, ϵ, f) = 0. Hence, we may assume (10). Minimax utility and optimal mechanism. For the aforementioned setup, we can completely characterize R(X, P, ϵ, f) and find a mechanism which is universally optimal for every f-divergence. The formula is similar to the discrete case, with a carefully chosen clipping bound. Theorem 3.3. For each c2 > c1 0, ϵ > 0, and h : Rn [0, ) satisfying the normalization condition (10), let us define the following constants determined by c1, c2, ϵ: b = c2 c1 (eϵ 1)(1 c1) + c2 c1 , r1 = c1 b , r2 = c2 Then, we have R(Rn, Pc1,c2,h, ϵ, f) = 1 r1 r2 r1 f(r2) + r2 1 r2 r1 f(r1). (12) Moreover, the mechanism Q c1,c2,h,ϵ constructed as below satisfies ϵ-LDP and is optimal for (X = Rn, P = Pc1,c2,h, ϵ) under any Df: For each P P, Q c1,c2,h,ϵ(P) =: Q is defined as a continuous distribution with pdf q(x) = clip 1 r P p(x); bh(x), beϵh(x) , (13) where r P > 0 is a constant depending on P so that R Rn q(x)dx = 1. Furthermore, r P can be chosen such that r1 < r P r2. It is also clear that Q c1,c2,h,ϵ satisfies ϵ-LDP. Also, we note that c1 < b < 1 < beϵ < c2 and 0 r1 < 1 < r2, which is shown during the proof of Theorem 3.3 in Appendix C. In practical scenario, it may be hard to expect P = Pc1,c2,h exactly, and we may only know P Pc1,c2,h for some c1, c2, h satisfying aforementioned conditions. In such case, we still propose to use Q c1,c2,h,ϵ, and in Section 5, we numerically show that this proposed mechanism is better than previously proposed mechanism [35] in terms of the worst-case f-divergence. Behavior of the optimal mechanism. We also observe some behaviors of the proposed mechanism with respect to the system parameters, whose formal proofs are in Appendix E. Again, we write R to mean R(Rn, Pc1,c2,h, ϵ, f) for simplicity. For a fixed (c1, c2), R is decreasing in ϵ. If c1 = 0 (which implies r1 = 0) and f(0) = , then R = , which means Rf(Q) = for any ϵ-LDP sampling mechanism Q. For the behavior at ϵ for a fixed (c1, c2), if c1 > 0, then for sufficiently large ϵ, we have c1eϵ c2 , so we fall in the aforementioned trivial case that R = 0. If c1 = 0 and f(0) < , then as ϵ , we have R 0, which again corresponds to the non-private case. Remarks on the constant r P . By the same reason as in the finite space case, the value of r P may not be unique, but Q c1,c2,h,ϵ does not depend on the choice of r P , and r P can be found by the bisection method with a numerical integration of (13). Note that the continuity of r 7 R gr(x)dx, gr(x) = clip 1 rp(x); bh(x), beϵh(x) , follows from the dominated convergence theorem [54] since we assume R Rn h(x)dx < . The meaning of Furthermore part is also similar to the finite space case, that is, the value of R gr(x)dx at r = r1 and r = r2 is at least and at most 1, respectively, so that we can perform the bisection method with initial endpoints (r1, r2) (When r = 0, we define gr(x) = beϵh(x) whenever p(x) > 0 and gr(x) = bh(x) whenever p(x) = 0). A corner case is that when r1 = 0, the continuity of r 7 R gr(x)dx does not suffice to guarantee the existence of strictly positive r such that R gr(x)dx = 1. However, in the proof, we actually show that R gr1(x)dx = 1 implies R gr(x)dx = 1 for every r (r1, r2], which especially implies that even when r1 = 0, there is a strictly positive r such that R gr(x)dx = 1. This is the reason that we state the strict inequality r1 < r P . 3.3 Proof sketch of the theorems The full proofs of the main theorems, Theorems 3.1 and 3.3, are presented in Appendix C. In Appendix C, we present a generalized theorem which includes Theorems 3.1 and 3.3 as special cases, where X can be a general sample space and P is similarly defined as in the continuous space case. The key idea for proofs and proposed mechanisms is to focus on the behavior of Q(P) when P is in an extreme case in P. In finite space, point masses are extreme cases, and for continuous space with P = Pc1,c2,h, the cases that p(x) {c1h(x), c2h(x)} for all x X are the extreme cases. As implied by the proof, the worst-case f-divergence of the proposed optimal mechanism is attained when P is in the aforementioned extreme cases. Such an approach using extreme case is a frequently used technique in PUT analysis [55 58]. Our proof consists of two parts, the achievability part and the converse part. The achievability part is to show that the worst-case f-divergence Rf(Q ) of our proposed mechanism Q is upper-bounded by the RHS of (7) or (12). The converse part is to show that Rf(Q) of any ϵ-LDP mechanism Q is lower-bounded by the RHS of (7) or (12). From now, we briefly describe the proof idea of each part. Here, we omit the subscripts (k, ϵ) or (c1, c2, h, ϵ) for notational convenience. Achievability part. Let M = {Q (P) : P P}. For finite space, M consists of all distributions Q P([k]) such that 1 eϵ+k 1 Q(x) eϵ eϵ+k 1 for every x [k]. For continuous space, M consists of all continuous distributions Q C(Rn) whose pdf q satisfies bh(x) q(x) beϵh(x) for every x Rn. We construct a mechanism Q such that Rf(Q ) is upper-bounded by the RHS of (7) or (12). The construction is as follows. First, we set a reference distribution µ P(X) and a constant γ [0, 1] according to a certain rule specified in Appendix C. Then, for each given P, we generate a private sample by sampling from the original P with probability γ, and sampling from the reference distribution µ with probability 1 γ. In other words, we have Q (P) = γP + (1 γ)µ. Our choice of µ and γ makes Q (P) M for every P P, which especially implies that Q also satisfies ϵ-LDP. Furthermore, we can find a bound on the ratio of the pmf or pdf for original distribution to that for sampling distribution. Then, invoking [59, Theorem 2.1], which bounds f-divergences given bounds on the ratio between pmf or pdfs, we show that Rf(Q ) is upper-bounded by the RHS of (7) or (12). Next, we demonstrate a non-trivial generalization of the main result of [35] that Q (P) is the f-divergence projection of P onto M for P P and for every f-divergence. Proposition 3.4. Assuming the setups of (X, P, ϵ) in either Theorem 3.1 or 3.3, let Q denote the proposed mechanism Q k,ϵ or Q c1,c2,h,ϵ. Also, let M be as described above. Then, for every P P and every f-divergence Df, we have Df (P Q (P)) = inf Q M Df (P Q). Notice that this proposition differs from [35] in that it holds for all general f-divergences (as opposed to only KL divergence) and general sample spaces, be it discrete or continuous. This result immediately yields Df (P Q (P)) Df P Q (P) , and hence Rf(Q ) Rf(Q ) (RHS of (7) or (12)). We remark that combining with the converse part (to be described below), it implies that the Q is also optimal in our minimax sense. Nevertheless, Q outperforms Q in that Df (P Q (P)) Df P Q (P) for every P P. Remark 3.5. For the finite case, µ is the uniform distribution over [k] and γ is taken in such a way that Q satisfies ϵ-LDP tightly. In this case, an alternative way to implement Q is as follows. For each given P, first we sample from P to get a raw sample and then apply the k-ary randomized response [60] to it. Converse part. For each extreme P, let AP be the high probability set , defined as AP = {x X : P(x) = 1} for finite space and AP = {x X : p(x) = c2h(x)} for continuous space. Then, using the data processing inequality of f-divergence [61], we take a lower bound of Df (P Q(P)) by the f-divergence between the distributions of 1AP (X) for X P and that for X Q(P). Such a lower bound becomes a decreasing function of Q(AP |P) in a certain range. Then, we seek to find an upper bound on inf Q(AP |P) over extreme P, which gives a lower bound on Rf(Q). This involves a novel combinatorial argument. We perform a packing of u copies of X by t subsets A1, , At with Ai = APi for some extreme Pi and appropriately chosen t and u. Then, we decompose the RHS of u = Pu i=1 Q(X|Pi) by an appropriate partition of X involving Ai s and use the definition of ϵ-LDP to find an upper bound on inf Q(AP |P). 4 Discussions on the Proposed Mechanism 4.1 Effect of the r P approximation error In practice, it may not be possible to find the exact value of r P such that the sum or integration of the RHS of (8) or (13) is 1. For the case of continuous space as in Theorem 3.3, one way to implement the proposed mechanism Q c1,c2,h,ϵ in practice is as follows. First, we fix parameters δ1 [0, 1) and δ2 0 that quantify error tolerance. For a given P, we define gr(x) = clip 1 rp(x); bh(x), beϵh(x) and find r P > 0 such that R Rn gr P (x)dx [1 δ1, 1 + δ2]; we delineate a numerical algorithm for this task in Section 3.2 based on the bisection method and a numerical integration method. Then, we get a private sample by sampling from the distribution with pdf ˆq(x) = gr P (x)/ R Rn gr P (x)dx. For the finite case as in Theorem 3.1, we can implement in the same way, except replacing the integral with the sum. It is important to note that ˆq(x) h bh(x) 1+δ2 , beϵh(x) i , indicating that the resulting Q k,ϵ and Q c1,c2,h,ϵ satisfy ϵ + log 1+δ2 -LDP as opposed to ϵ-LDP. Thus, the above implementation yields ϵ-LDP if it is used to implement Q k,ϵ or Q c1,c2,h,ϵ , with ϵ = ϵ log 1+δ2 1 δ1 and sufficiently small δ1, δ2 such that ϵ > 0. 4.2 Continuity of the proposed mechanism In some practical scenarios, the client may not have full access to their distribution P. One example is that the client can only access to samples from P. In such case, the client may first estimate the true distribution, and then perturb the estimated distribution through the optimal mechanism. The question is how the perturbation using the estimated distribution deviates from that using the true distribution. To answer this, we show that the proposed mechanism satisfies a pointwise Lipschitz property with respect to the total variation distance, and the Lipschitz constant is closely related to the factor r P we introduce in Theorems 3.1 and 3.3. Proposition 4.1. Assuming the setups of (X, P, ϵ) in either Theorem 3.1 or 3.3, let Q denote the proposed mechanism Q k,ϵ or Q c1,c2,h,ϵ. Then, for any P, P P, we have DTV (Q (P), Q (P )) 2 max(r P , r P )DTV (P, P ) (14) where r P > 0 is as in Theorem 3.1 or 3.3. This guarantees that for each given true P and given δ > 0, whenever the approximated P satisfies DTV (P, P ) δr P /2, the perturbed distribution Q (P ) satisfies DTV (Q (P), Q (P )) δ. In theoretical perspective, this proposition implies that Q is continuous when P and P(X) are endowed with the metric topology from the total variation distance. The proof of Proposition 4.1 is in Appendix D. 5 Numerical Results In this section, we numerically compare the worst-case f-divergence of our proposed mechanism with that of the previously proposed sampling mechanism. To the best of our knowledge, the only work about the private sampling under LDP is [35], hence we set the baseline as the mechanism proposed in [35]. In all the cases, we perform the comparison across three canonical f-divergences: KL divergence, total variation distance, and squared Hellinger distance, as well as across five values of ϵ: 0.1, 0.5, 1, 2, and 5. 5.1 Comparison for finite data space In this subsection, we compare the mechanisms in the finite space, X = [k] and P = P([k]). As mentioned in Sections 2.3 and 3.1, the baseline mechanism has a hyper-parameter, a reference probability distribution Q0 P(X). We set the baseline as a generalized f-divergence projection onto the relative mollifier. That is, for each given f-divergence, we set the baseline to satisfy Q(P) arg min Q Mϵ,Q0 Df (P Q), where Mϵ,Q0 is defined in Section 2.3. As expected by symmetry, for any f, choosing Q0 to be the uniform distribution minimizes the worst-case fdivergence Rf(Q) among all choices of Q0 for the baseline. Also, even though we do not obtain the closed-form expression of Q(x|P) for the baseline, we obtain the value of Rf(Q) when Q0 is the uniform distribution. The proof of this fact, together with the precise value of Rf(Q) for uniform Q0, is in Appendix F.1. Hence, we always set Q0 to be the uniform distribution in the result about the baseline. Since we have the precise values of Rf(Q) for both our proposed mechanism and the baseline, we plot such values of Rf(Q) in Figure 3. For simplicity, we only provide the plot for k = 10. More plots for some other k s can be found in Appendix G. As shown by the figure, the proposed mechanism has lower worst-case f-divergence than the baseline for all choices of f-divergences and ϵ in the experiment, with significant gap in medium privacy regime ϵ [0.5, 2]. 0.1 0.5 1.0 2.0 5.0 0.0 Worst f-divergence 0.1 0.5 1.0 2.0 5.0 Privacy budget ǫ 0.1 0.5 1.0 2.0 5.0 0.0 Proposed Baseline Figure 3: Theoretical worst-case f-divergences of proposed and previously proposed baseline mechanisms (with uniform Q0) over finite space (k = 10) (Left: KL divergence, Center: Total variation distance, Right: Squared Hellinger distance) 5.2 Comparison for 1D Gaussian mixture In this subsection, we conduct an experiment to compare the mechanisms when the client distributions are Gaussian mixtures over a real line X = R, which is an instance of a continuous space case. We consider the case that each client has a Gaussian mixture distribution in R, where each Gaussian has a mean bounded by 1 and has a unit variance. To avoid arbitrarily large number of Gaussian distributions to be mixed, we set an upper bound K of the number of Gaussian distributions to be mixed per client. Also, to make the numerical integration tractable, we truncate the domain of the distributions to lie inside an interval [ 4, 4]. Unlike the finite space case, there is no known closedform expression of the worst-case f-divergence for the mechanism in [35]. The set of Gaussian mixtures is not exactly of the form P = Pc1,c2,h, hence our proposed mechanism also does not have a known closed-form expression of the worst-case f-divergence. Hence, instead, we compare the mechanisms by an empirical worst-case f-divergence. For an experiment, we randomly construct N Gaussian mixture distributions P1, P2, , PN P, where each Pj is generated independently according to some rules specified in Appendix F.2. After that, we plot the value of the empirical maximum f-divergence maxj [N] Df (Pj Q(Pj)) for the baseline and our proposed Q. For the baseline mechanism, we use MBDE with the same hyperparameter setup as [35, Section 5], except a slight modification of the reference distribution to consider the truncation of the domain. The implementation details are provided in Appendix F.2. In Figure 4, we present the result for N = 100 and K = 10. We can see that the proposed mechanism has much lower worst-case f-divergence than the baseline for all choices of f-divergences and ϵ. 0.1 0.5 1.0 2.0 5.0 0.0 Worst f-divergence 0.1 0.5 1.0 2.0 5.0 Privacy budget ǫ 0.1 0.5 1.0 2.0 5.0 0.000 Proposed Baseline Figure 4: Empirical worst-case f-divergences of proposed and baseline mechanisms over 100 experiments of 1D Gaussian mixture (Left: KL divergence, Center: Total variation distance, Right: Squared Hellinger distance) 6 Conclusion In this paper, we characterized the optimal privacy-utility trade-off for the private sampling under LDP and found the optimal private sampling mechanism in terms of the minimax f-divergence between original and sampling distributions, for both finite and continuous data spaces. Compared to the previous work [35] based on relative mollifier with arbitrarily chosen reference distribution, our work characterizes PUT without dependency on external information other than the original distribution, and it is shown that the mechanism we found is universally optimal under any f-divergence. For future works, there may be other P and other measures of utility more appropriate to practical scenarios, which are not handled in this paper. For example, f-divergence may be an inappropriate utility loss because it only depends on σ-algebra structure and does not consider additional information about geometry of X, such as underlying metric on X. Using utility measures involving the geometry, such as Wasserstein distance [62, 63, 49], may be more appropriate for some scenarios. Also, we can consider the Bayesian approach instead of the worst-case approach. Furthermore, we only consider the task of releasing a single sample per client in this paper. We may also consider the case of releasing multiple samples per client, rather than a single sample. The limitations and broader impacts of this work are in Appendices I and J, respectively. Acknowledgments and Disclosure of Funding We thank the anonymous reviewers for their valuable discussions and comments, particularly regarding the identification of an alternative optimal mechanism, the advantage of the proposed mechanism over the alternative, and the effect of the approximation error on r P . This work was supported in part by the National Research Foundation of Korea (NRF) Grant funded by the Korea Government (MSIT) under Grant 2022R1A2C2092151 and in part by Institute of Information & Communications Technology Planning & Evaluation (IITP) under 6G Cloud Research and Education Open Hub (IITP-2024-RS-2024-00428780) grant funded by the Korea government (MSIT). [1] Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, and Katherine Lee. Scalable Extraction of Training Data from (Production) Language Models. ar Xiv preprint ar Xiv:2311.17035, November 2023. [2] Xiaodong Wu, Ran Duan, and Jianbing Ni. Unveiling security, privacy, and ethical concerns of Chat GPT. Journal of Information and Intelligence, 2(2):102 115, March 2024. ISSN 29497159. doi: 10.1016/j.jiixd. 2023.10.007. [3] Biwei Yan, Kun Li, Minghui Xu, Yueyan Dong, Yue Zhang, Zhaochun Ren, and Xiuzhen Cheng. On Protecting the Data Privacy of Large Language Models (LLMs): A Survey. ar Xiv preprint ar Xiv:2403.05156, March 2024. [4] Zhangheng Li, Junyuan Hong, Bo Li, and Zhangyang Wang. Shake to Leak: Fine-tuning Diffusion Models Can Amplify the Generative Privacy Risk. In 2024 IEEE Conference on Secure and Trustworthy Machine Learning (Sa TML), pages 18 32, April 2024. doi: 10.1109/Sa TML59370.2024.00010. [5] Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. What Can We Learn Privately? SIAM Journal on Computing, 40(3):793 826, January 2011. ISSN 0097-5397. doi: 10.1137/090756090. [6] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable privacypreserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054 1067. ACM, 2014. [7] Andrea Bittau, Úlfar Erlingsson, Petros Maniatis, Ilya Mironov, Ananth Raghunathan, David Lie, Mitch Rudominer, Ushasree Kode, Julien Tinnes, and Bernhard Seefeld. Prochlo: Strong privacy for analytics in the crowd. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP 17, page 441 459. Association for Computing Machinery, 2017. ISBN 9781450350853. [8] Differential privacy team Apple. Learning with privacy at scale. Technical report, Apple, 2017. [9] Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. Collecting Telemetry Data Privately. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. [10] John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. Local Privacy and Statistical Minimax Rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 429 438, October 2013. doi: 10.1109/FOCS.2013.53. [11] Min Ye and Alexander Barg. Optimal Schemes for Discrete Distribution Estimation Under Locally Differential Privacy. IEEE Transactions on Information Theory, 64(8):5662 5676, August 2018. ISSN 0018-9448, 1557-9654. doi: 10.1109/TIT.2018.2809790. [12] Abhishek Bhowmick, John Duchi, Julien Freudiger, Gaurav Kapoor, and Ryan Rogers. Protection Against Reconstruction and Its Applications in Private Federated Learning. ar Xiv preprint ar Xiv:1812.00984, June 2019. [13] Hilal Asi, Vitaly Feldman, and Kunal Talwar. Optimal Algorithms for Mean Estimation under Local Differential Privacy. In Proceedings of the 39th International Conference on Machine Learning, pages 1046 1056. PMLR, June 2022. [14] Badih Ghazi, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, and Adam Sealfon. On computing pairwise statistics with local differential privacy. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 27129 27146. Curran Associates, Inc., 2023. [15] Bonwoo Lee, Jeongyoun Ahn, and Cheolwoo Park. Minimax risks and optimal procedures for estimation under functional local differential privacy. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 57964 57975. Curran Associates, Inc., 2023. [16] Hilal Asi, Vitaly Feldman, Jelani Nelson, Huy Nguyen, and Kunal Talwar. Fast optimal locally private mean estimation via random projections. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 16271 16282. Curran Associates, Inc., 2023. [17] Berivan Isik, Wei-Ning Chen, Ayfer Ozgur, Tsachy Weissman, and Albert No. Exact optimality of communication-privacy-utility tradeoffs in distributed mean estimation. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 37761 37785. Curran Associates, Inc., 2023. [18] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464 473, October 2014. doi: 10.1109/FOCS.2014.56. [19] Stacey Truex, Ling Liu, Ka-Ho Chow, Mehmet Emre Gursoy, and Wenqi Wei. LDP-Fed: Federated learning with local differential privacy. In Proceedings of the Third ACM International Workshop on Edge Systems, Analytics and Networking, Edge Sys 20, pages 61 66, New York, NY, USA, May 2020. Association for Computing Machinery. ISBN 978-1-4503-7132-2. doi: 10.1145/3378679.3394533. [20] Badih Ghazi, Pritish Kamath, Ravi Kumar, Ethan Leeman, Pasin Manurangsi, Avinash V. Varadarajan, and Chiyuan Zhang. Regression with Label Differential Privacy. ar Xiv preprint ar Xiv:2212.06074, October 2023. [21] Vitaly Feldman, Ilya Mironov, Kunal Talwar, and Abhradeep Thakurta. Privacy Amplification by Iteration. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 521 532, October 2018. doi: 10.1109/FOCS.2018.00056. [22] Mengchu Li, Tom Berrett, and Yi Yu. Network change point localisation under local differential privacy. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 15013 15026. Curran Associates, Inc., 2022. [23] Shuzhen Chen, Dongxiao Yu, Yifei Zou, Jiguo Yu, and Xiuzhen Cheng. Decentralized Wireless Federated Learning With Differential Privacy. IEEE Transactions on Industrial Informatics, 18(9):6273 6282, September 2022. ISSN 1941-0050. doi: 10.1109/TII.2022.3145010. [24] Youssef Allouah, Rachid Guerraoui, Nirupam Gupta, Rafael Pinot, and John Stephan. On the Privacy Robustness-Utility Trilemma in Distributed Learning. In Proceedings of the 40th International Conference on Machine Learning, pages 569 626. PMLR, July 2023. [25] Chuan Guo, Kamalika Chaudhuri, Pierre Stock, and Michael Rabbat. Privacy-Aware Compression for Federated Learning Through Numerical Mechanism Design. In Proceedings of the 40th International Conference on Machine Learning, pages 11888 11904. PMLR, July 2023. [26] Yi Liu, Qirui Hu, Lei Ding, and Linglong Kong. Online Local Differential Private Quantile Inference via Self-normalization. In Proceedings of the 40th International Conference on Machine Learning, pages 21698 21714. PMLR, July 2023. [27] Jin Sima, Changlong Wu, Olgica Milenkovic, and Wojciech Szpankowski. Online Distribution Learning with Local Privacy Constraints. In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, pages 460 468. PMLR, April 2024. [28] Antonious M. Girgis, Deepesh Data, and Suhas Diggavi. Distributed User-Level Private Mean Estimation. In 2022 IEEE International Symposium on Information Theory (ISIT), pages 2196 2201, June 2022. doi: 10.1109/ISIT50566.2022.9834713. [29] Ruiquan Huang, Huanyu Zhang, Luca Melis, Milan Shen, Meisam Hejazinia, and Jing Yang. Federated Linear Contextual Bandits with User-level Differential Privacy. In Proceedings of the 40th International Conference on Machine Learning, pages 14060 14095. PMLR, July 2023. [30] Jayadev Acharya, Yuhan Liu, and Ziteng Sun. Discrete Distribution Estimation under User-level Local Differential Privacy. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, pages 8561 8585. PMLR, April 2023. [31] Raef Bassily and Ziteng Sun. User-level Private Stochastic Convex Optimization with Optimal Rates. In Proceedings of the 40th International Conference on Machine Learning, pages 1838 1851. PMLR, July 2023. [32] Yulian Mao, Qingqing Ye, Haibo Hu, Qi Wang, and Kai Huang. Priv Shape: Extracting Shapes in Time Series under User-Level Local Differential Privacy. ar Xiv preprint ar Xiv:2404.03873, April 2024. [33] Kangkang Sun, Jun Wu, Ali Kashif Bashir, Jianhua Li, Hansong Xu, Qianqian Pan, and Yasser D. Al-Otaibi. Personalized Privacy-Preserving Distributed Artificial Intelligence for Digital-Twin-Driven Vehicle Road Cooperation. IEEE Internet of Things Journal, pages 1 1, 2024. ISSN 2327-4662. doi: 10.1109/JIOT.2024.3389656. [34] Alexander Kent, Thomas B. Berrett, and Yi Yu. Rate Optimality and Phase Transition for User-Level Local Differential Privacy. ar Xiv preprint ar Xiv:2405.11923, May 2024. [35] Hisham Husain, Borja Balle, Zac Cranko, and Richard Nock. Local Differential Privacy for Sampling. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, pages 3404 3413. PMLR, June 2020. [36] James Flemings, Meisam Razaviyayn, and Murali Annavaram. Differentially Private Next-Token Prediction of Large Language Models. ar Xiv preprint ar Xiv:2403.15638, April 2024. [37] Sofya Raskhodnikova, Satchit Sivakumar, Adam Smith, and Marika Swanberg. Differentially private sampling from distributions. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 28983 28994. Curran Associates, Inc., 2021. [38] Badih Ghazi, Xiao Hu, Ravi Kumar, and Pasin Manurangsi. On differentially private sampling from gaussian and product distributions. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 77783 77809. Curran Associates, Inc., 2023. [39] Hamid Ebadi, Thibaud Antignac, and David Sands. Sampling and partitioning for differential privacy. In 2016 14th Annual Conference on Privacy, Security and Trust (PST), pages 664 673, February 2016. doi: 10.1109/PST.2016.7906954. [40] Nazmiye Ceren Abay, Yan Zhou, Murat Kantarcioglu, Bhavani Thuraisingham, and Latanya Sweeney. Privacy Preserving Synthetic Data Release Using Deep Learning. In Michele Berlingerio, Francesco Bonchi, Thomas Gärtner, Neil Hurley, and Georgiana Ifrim, editors, Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, pages 510 526, Cham, 2019. Springer International Publishing. ISBN 978-3-030-10925-7. doi: 10.1007/978-3-030-10925-7_31. [41] Liyang Xie, Kaixiang Lin, Shu Wang, Fei Wang, and Jiayu Zhou. Differentially Private Generative Adversarial Network. ar Xiv preprint ar Xiv:1802.06739, February 2018. [42] Bangzhou Xin, Wei Yang, Yangyang Geng, Sheng Chen, Shaowei Wang, and Liusheng Huang. Private FL-GAN: Differential Privacy Synthetic Data Generation Based on Federated Learning. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2927 2931, May 2020. doi: 10.1109/ICASSP40776.2020.9054559. [43] Reihaneh Torkzadehmahani, Peter Kairouz, and Benedict Paten. DP-CGAN: Differentially Private Synthetic Data and Label Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0 0, 2019. [44] Yi Liu, Jialiang Peng, James J.Q. Yu, and Yi Wu. PPGAN: Privacy-Preserving Generative Adversarial Network. In 2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS), pages 985 989, February 2019. doi: 10.1109/ICPADS47876.2019.00150. [45] Teddy Cunningham, Konstantin Klemmer, Hongkai Wen, and Hakan Ferhatosmanoglu. Geo Point GAN: Synthetic Spatial Data with Local Label Differential Privacy. ar Xiv preprint ar Xiv:2205.08886, May 2022. [46] Hua Zhang, Kaixuan Li, Teng Huang, Xin Zhang, Wenmin Li, Zhengping Jin, Fei Gao, and Minghui Gao. Publishing locally private high-dimensional synthetic data efficiently. Information Sciences, 633:343 356, July 2023. ISSN 0020-0255. doi: 10.1016/j.ins.2023.03.014. [47] Hisaichi Shibata, Shouhei Hanaoka, Yang Cao, Masatoshi Yoshikawa, Tomomi Takenaga, Yukihiro Nomura, Naoto Hayashi, and Osamu Abe. Local Differential Privacy Image Generation Using Flow Based Deep Generative Models. Applied Sciences, 13(18):10132, January 2023. ISSN 2076-3417. doi: 10.3390/app131810132. [48] Hansle Gwon, Imjin Ahn, Yunha Kim, Hee Jun Kang, Hyeram Seo, Heejung Choi, Ha Na Cho, Minkyoung Kim, Ji Ye Han, Gaeun Kee, Seohyun Park, Kye Hwa Lee, Tae Joon Jun, and Young-Hak Kim. LDP-GAN : Generative adversarial networks with local differential privacy for patient medical records synthesis. Computers in Biology and Medicine, 168:107738, January 2024. ISSN 0010-4825. doi: 10.1016/j. compbiomed.2023.107738. [49] Jacob Imola, Amrita Roy Chowdhury, and Kamalika Chaudhuri. Metric Differential Privacy at the User-Level. ar Xiv preprint ar Xiv:2405.02665, May 2024. [50] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating Noise to Sensitivity in Private Data Analysis. In Shai Halevi and Tal Rabin, editors, Theory of Cryptography, Lecture Notes in Computer Science, pages 265 284, Berlin, Heidelberg, 2006. Springer. ISBN 978-3-540-32732-5. doi: 10.1007/11681878_14. [51] Igal Sason. On f-Divergences: Integral Representations, Local Behavior, and Inequalities. Entropy, 20(5): 383, May 2018. ISSN 1099-4300. doi: 10.3390/e20050383. [52] Igal Sason and Sergio Verdú. f-Divergence Inequalities. IEEE Transactions on Information Theory, 62 (11):5973 6006, January 2016. ISSN 1557-9654. doi: 10.1109/TIT.2016.2603151. [53] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11): 139 144, October 2020. ISSN 0001-0782. doi: 10.1145/3422622. [54] Elias M. Stein and Rami Shakarchi. Real Analysis: Measure Theory, Integration, and Hilbert Spaces. Princeton University Press, 2005. ISBN 978-0-691-11386-9. doi: 10.2307/j.ctvd58v18. [55] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. Extremal Mechanisms for Local Differential Privacy. Journal of Machine Learning Research, 17(17):1 51, 2016. ISSN 1533-7928. [56] Naoise Holohan, Douglas J. Leith, and Oliver Mason. Extreme points of the local differential privacy polytope. Linear Algebra and its Applications, 534:78 96, December 2017. ISSN 0024-3795. doi: 10.1016/j.laa.2017.08.011. [57] Ankit Pensia, Amir R. Asadi, Varun Jog, and Po-Ling Loh. Simple Binary Hypothesis Testing under Local Differential Privacy and Communication Constraints. ar Xiv preprint ar Xiv:2301.03566, December 2023. [58] Seung-Hyun Nam, Vincent Y. F. Tan, and Si-Hyeon Lee. Optimal Private Discrete Distribution Estimation With 1-bit Communication. IEEE Transactions on Information Forensics and Security, 19:6514 6528, 2024. ISSN 1556-6021. doi: 10.1109/TIFS.2024.3419721. [59] Andrew Rukhin. Information-type divergence when the likelihood ratios are bounded. Applicationes Mathematicae, 4(24):415 423, 1997. ISSN 1233-7234. [60] Stanley L Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63 69, 1965. [61] F. Liese and I. Vajda. On Divergences and Informations in Statistics and Information Theory. IEEE Transactions on Information Theory, 52(10):4394 4412, October 2006. ISSN 1557-9654. doi: 10.1109/ TIT.2006.881731. [62] Cédric Villani. Optimal Transport: Old and New. Number 338 in Grundlehren Der Mathematischen Wissenschaften. Springer, Berlin, 2009. ISBN 978-3-540-71049-3. [63] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. ar Xiv preprint ar Xiv:1701.07875, December 2017. [64] Stephen P. Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, UK ; New York, 2004. ISBN 978-0-521-83378-3. [65] R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970. ISBN 978-0-691-01586-6. [66] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. Equation of State Calculations by Fast Computing Machines. The Journal of Chemical Physics, 21 (6):1087 1092, June 1953. ISSN 0021-9606. doi: 10.1063/1.1699114. [67] Christian P. Robert. The Metropolis-Hastings algorithm. ar Xiv preprint ar Xiv:1504.01896, January 2016. [68] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The Composition Theorem for Differential Privacy. In Proceedings of the 32nd International Conference on Machine Learning, pages 1376 1385. PMLR, June 2015. [69] Ryan M Rogers, Aaron Roth, Jonathan Ullman, and Salil Vadhan. Privacy Odometers and Filters: Payas-you-Go Composition. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. A Assumptions about the Measure Theory The appendices assume the familiarity with the basic measure theory and real analysis. We refer the standard textbooks about the measure theory and real analysis, e.g. [54]. Throughout the main paper and the appendices, we assume the followings: For each sample space X, a σ-algebra on X is implicitly given. Unless mentioned otherwise, the discrete σ-algebra is given for finite X, and the Borel σ-algebra is given for X = Rn. A subset of X always means a measurable subset , and similarly A X always means A is a measurable subset. The continuous" distribution precisely means the absolutely continuous" distribution (with respect to the Lebesgue measure). In the appendices, we also introduce the following notations: For finite X, # denotes the counting measure. For X = Rn, m denotes the Lebesgue measure. B General Definition and More Properties of f-divergences In this appendix, we review the general definition and additional properties about the f-divergences which are important in our analysis. Let a convex function f : (0, ) R with f(1) = 0 be given. For P, Q P(X), not necessarily P Q, we first take a dominating measure µ on X such that P, Q µ (e.g., µ = P + Q), and let p = d P/dµ, q = d Q/dµ. The f-divergence Df (P Q) is defined as Df (P Q) = Z q(x)f p(x) dµ(x). (15) We note that (15) is invariant under the choice of the dominating measure µ. Since it is possible that p(x) = 0 or q(x) = 0, we need to define what is the value of the expression qf(p/q) for p = 0 or q = 0 [51, 52]. It is well-known that any convex function f : (0, 1) R is continuous, and the function f (x) = xf(1/x) (16) is also convex on (0, 1). Furthermore, any convex function f : (0, 1) R with f(1) = 0 has a limit limx 0+ f(x) in R {+ }. Hence, we have the continuous extension f : [0, ) R {+ } by setting f(0) = limx 0+ f(x), which is proper convex and continuous. By the same way, we have the continuous extension f : [0, ) R {+ }. Using these extensions, we define 0f(0/0) = 0, qf(0/q) = qf(0) for q > 0, and 0f(p/0) = pf (0) for p > 0. Especially, if f (0) = , then Df (P Q) = whenever P Q does not hold. (This is the case for KL divergence and χ2 divergence for examples) Similarly, if f(0) = , then Df (P Q) = whenever Q P does not hold. Also, the maximum value of the f-divergence Mf presented in (2) can be written as Mf = f(0) + f (0). The following additional properties of f-divergences are important in our analysis. [61] Theorem B.1. Any f-divergence is jointly convex, that is for any P1, P2, Q1, Q2 P(X) and 0 λ 1, we have Df (λP1 + (1 λ)P2 λQ1 + (1 λ)Q2) λDf (P1 Q1)+(1 λ)Df (P2 Q2). Theorem B.2 (Data-Processing Inequality). Let M be a conditional distribution (Markov kernel) from X to Y. For given P1, P2 P(X), let Q1, Q2 P(Y) be the push-forward measure of P1, P2 through M, respectively. Then for any f-divergence, we have Df (P1 P2) Df (Q1 Q2). Also, we present several equivalent expressions for the total variation distance [51, 52], which are used in Appendix D. In below expressions, the assumption is that P, Q µ for some dominating measure µ with p = d P/dµ and q = d Q/dµ. DTV (P, Q) = 1 Z |p(x) q(x)|dµ(x) (17) x:p(x) q(x) (q(x) p(x))dµ(x) (18) x:p(x) q(x) (p(x) q(x))dµ(x). (19) We introduce one more notation. For λ1, λ2 [0, 1], let DB f (λ1 λ2) denotes the f-divergence between Bernoulli distributions with Pr(1) = λ1 and λ2, respectively. That is, DB f (λ1 λ2) = λ2f λ1 + (1 λ2)f 1 λ1 We should note the following facts: 1. By joint convexity of the f-divergence and continuity of f, DB f (λ1 λ2) is continuous and jointly convex in (λ1, λ2). (But DB f (λ1 λ2) may be extended real-valued) 2. For a fixed λ1, DB f (λ1 λ2) attains a global minimum 0 at λ2 = λ1. Together with convexity, we derive that DB f (λ1 λ2) is decreasing in λ2 [0, λ1] and increasing in λ2 [λ1, 1], respectively. C Generalized Main Theorem and Proof In this appendix, we present the formal proofs of the main theorems, Theorems 3.1 and 3.3. As mentioned in Section 3.3, we first state the generalized theorem with its proof, and later we show how this generalized theorem includes the main theorems as special cases. C.1 Statement of the generalized main theorem First, let us define the general setup we consider. Let a (general) sample space X be given. For a positive measure µ on X such that µ(X) < and c2 > c1 0, let us define Pc1,c2,µ := {P P(X) : P µ, c1 d P/dµ c2 µ-a.e.}. (21) We generally consider the case that P = Pc1,c2,µ for some c1, c2, µ. For example, for the setup of Section 3.2, we have Pc1,c2,h = Pc1,c2,µ, where µ is a positive measure with µ m and dµ/dm = h. For the setup of Section 3.1, we have P([k]) = P0,1,#. By the same reason as in Section 3.2, we may impose the following normalization condition on c1, c2, µ, ϵ: µ(X) = 1, c1 < 1 < c2, c2 > c1eϵ. (22) Note that µ(X) = 1 means that µ is a probability measure, that is µ P(X). Also, note that for X = [k], we can write in normalized form as P([k]) = P0,k,µk, where µk = 1 k# is the uniform distribution on [k]. First, let us define the proposed mechanism, together with the related constants as the same as Theorem 3.3. For the ease of proof, we introduce an additional constant α over Theorem 3.3. Definition C.1. Let c1, c2, µ, ϵ satisfying the normalization condition (22) be given, and let P = Pc1,c2,µ. First, define the following constants determined by c1, c2, ϵ: c2 c1 , (23) b = c2 c1 (eϵ 1)(1 c1) + c2 c1 = 1 αeϵ + 1 α, (24) b = (eϵ 1)(1 c1) + c2 c1 c1 = c1(αeϵ + 1 α), (25) beϵ = (eϵ 1)(1 c1) + c2 c1 eϵ (αeϵ + 1 α). (26) Also, define a mechanism Q c1,c2,µ,ϵ QX, P,ϵ as follows: For each P P, Q c1,c2,µ,ϵ(P) =: Q is defined as a probability measure such that Q µ and dµ (x) = clip 1 dµ (x); b, beϵ , (27) where r P > 0 is a constant depending on P so that R d Q dµ dµ(x) = 1. Furthermore, let Mc1,c2,µ,ϵ = Pb,beϵ,µ, so that Q c1,c2,µ,ϵ(P) Mc1,c2,µ,ϵ for every P P. We should note that 1 = c2α+c1(1 α) = beϵα+b(1 α). Also, Q c1,c2,µ,ϵ clearly satisfies ϵ-LDP. By the same reason as in Sections 3.1 and 3.2, the values of r P may not be unique, but the mechanism Q c1,c2,µ,ϵ does not depend on the choice of r P . Furthermore, we show that the value of R clip 1 r d P dµ (x); b, beϵ dµ(x) at r = r1 and r = r2 are at least and at most 1, respectively. Proposition C.2. Let c1, c2, µ, ϵ, P, α, b, r1, r2 be as in Definition C.1. Then, for any P P, we have Z clip 1 dµ (x); b, beϵ dµ(x) 1 Z clip 1 dµ (x); b, beϵ dµ(x), (28) where, in the case of r1 = 0, we define dµ (x); b, beϵ = ( b, if d P dµ (x) = 0 beϵ, otherwise . (29) Furthermore, if R clip 1 r1 d P dµ (x); b, beϵ dµ(x) = 1, then R clip 1 r d P dµ (x); b, beϵ dµ(x) = 1 for every r [r1, r2]. This proposition, together with the fact that r 7 R clip 1 r d P dµ (x); b, beϵ dµ(x) is continuous and monotone decreasing, implies that r P can be chosen such that r1 < r P r2, as stated in Theorem 3.3. (Again, the continuity is from µ(X) < and the dominated convergence theorem) Also, we should remark that 0 < α < 1 and c1 < b < 1 < beϵ < c2. The first one easily follows from c1 < 1 < c2 and the definition of α. For the second one, first we have 1 < αeϵ + 1 α < eϵ, as αeϵ + 1 α is a propoer convex combination of 1 and ϵ. This directly implies that b < 1 < beϵ. Next, by calculations, we can observe that c1((eϵ 1)(1 c1) + c2 c1) (c2 c1) = (1 c1)(c1eϵ c2) < 0, (30) which implies c1 < b, and c2((eϵ 1)(1 c1) + c2 c1) eϵ(c2 c1) = (c2 1)(c2 eϵc1) > 0, (31) which implies beϵ < c2. Especially, these inequalities imply that 0 r1 < 1 < r2. Now, we show that under a mild decomposability condition, the proposed mechanism Q c1,c2,µ,ϵ is universally optimal under any f-divergences for (X, P = Pc1,c2,µ, ϵ). After that, we show that such a mild condition holds for the setups of both of the main theorems, which finishes the proofs of the main theorems. First, let us state the decomposability condition. This condition is a formal definition of the concept of packing of u copies of X by t subsets A1, , At , which is briefly mentioned in Section 3.3, Definition C.3. Let α (0, 1) and t, u N, t > u. We say that a probability measure µ P(X) is (α, t, u)-decomposable if there exist t subsets A1, A2, , At X such that µ(Ai) = α for all i [t], and for every x X, we have | {i [t] : x Ai} | u. We say that µ P(X) is α-decomposable if for any δ > 0, there exists t, u N, t > u, such that α u/t < α + δ, and µ is (α, t, u)-decomposable. We remark that (α, t, 1)-deomposability means that there are t disjoint subsets B1, B2, , Bt such that µ(Bi) = α for each i [t]. Also, if α is a rational number with α = u/t, u, t N, and µ is (α, t, u)-decomposable, then µ is α-decomposable. Then, we state the generalized theorem. Theorem C.4. Let c1, c2, µ, ϵ, P, α, b, r1, r2 be as in Definition C.1. If µ is α-decomposable, then for any f-divergences Df, we have R(X, P, ϵ, f) = 1 r1 r2 r1 f(r2) + r2 1 r2 r1 f(r1), (32) and furthermore, the mechanism Q c1,c2,µ,ϵ as in Definition C.1 is optimal for (X, P, ϵ) under any f-divergences Df. As guided in Section 3.3, the proof of this theorem is broken into two parts, the achievability part and the converse part. Proposition C.5 (Achievability part). Let c1, c2, µ, ϵ, P, α, b, r1, r2 be as in Definition C.1. Then, for any f-divergences Df, we have Rf(Q c1,c2,µ,ϵ) 1 r1 r2 r1 f(r2) + r2 1 r2 r1 f(r1). (33) (Here, we do not need to assume that µ is α-decomposable) Proposition C.6 (Converse part). Let c1, c2, µ, ϵ, P, α, b, r1, r2 be as in Definition C.1, and suppose that µ is α-decomposable. Then for any f-divergences Df and for any ϵ-LDP mechanism Q QX, P,ϵ, we have r2 r1 f(r2) + r2 1 r2 r1 f(r1). (34) The remaining of this appendix is organized as follows. We first present the proofs of Propositions C.5 and C.6 in Appendices C.2 and C.3, in which we grant Proposition C.2 and some intermediate lemmas. After that, in Appendix C.4, we show that Theorem C.4 contains main theorems, Theorems 3.1 and 3.3, as special cases. Finally, Appendix C.5 presents the proof of Proposition C.2, and Appendices C.6 and C.7 prove intermediate lemmas. C.2 Proof of achievability part (Proposition C.5) As mentioned in Section 3.3, the proof of the achievability part consists of two steps: first presenting an alternative mechanism and then proving that Q c1,c2,µ,ϵ performs the f-divergence projection for any general f-divergence. C.2.1 An alternative mechanism Let Q c1,c2,µ,ϵ QX, P,ϵ be a mechanism defined as follows: Q c1,c2,µ,ϵ(P) = γP + (1 γ)µ, (35) γ = eϵ 1 (eϵ 1)(1 c1) + c2 c1 . (36) Since c2 > eϵc1, it follows that 0 < γ < 1. Also, a direct computation shows that b = γc1 + (1 γ), (37) beϵ = γc2 + (1 γ). (38) Notice that since c1 d P dµ c2 and d Q c1,c2,µ,ϵ(P ) dµ + (1 γ), we have d Q c1,c2,µ,ϵ(P) dµ (x) γc1 + (1 γ) = b, (39) d Q c1,c2,µ,ϵ(P) dµ (x) γc2 + (1 γ) = beϵ. (40) Hence, Q c1,c2,µ,ϵ(P) Mc1,c2,µ,ϵ for every P P, implying that Q c1,c2,µ,ϵ also satisfies ϵ-LDP. Now, we show that Rf(Q c1,c2,µ,ϵ) 1 r1 r2 r1 f(r2) + r2 1 r2 r1 f(r1). (41) To this end, we need to show that for each P P, we have Df P Q c1,c2,µ,ϵ(P) 1 r1 r2 r1 f(r2) + r2 1 r2 r1 f(r1). (42) Fix P P. Let p = d P/dµ and q = d Q c1,c2,µ,ϵ(P)/dµ. First, we claim that q(x) r2 (43) for µ-almost every x X. We have p(x) q(x) = p(x) γp(x) + (1 γ) (44) 1 1 γ γp(x) + (1 γ) which is increasing in p(x) 0. Since c1 p(x) c2, we have c1 γc1 + (1 γ) p(x) q(x) c2 γc2 + (1 γ) (46) for µ-almost every x X. From (37), (38), and Definition C.1, we conclude that q(x) r2, (47) which proves the claim. For the remaining of the proof, we need the following lemma. Lemma C.7 ([59], Theorem 2.1). Let P, Q P(X). Suppose that P, Q µ for some reference measure µ on X, and there exist r1, r2 R with 0 r1 < 1 < r2 such that the densities p = d P dµ , q = d Q dµ satisfy q(x) > 0 and r1 p(x) q(x) r2 for µ-almost every x X. Then for any f-divergence Df, we have Df (P Q) 1 r1 r2 r1 f(r2) + r2 1 r2 r1 f(r1). (48) This lemma directly implies (42), and consequently (41). C.2.2 f-divergence projection of the proposed mechanism Next, we prove that Q c1,c2,µ,ϵ(P) is the projection of P onto Mc1,c2,µ,ϵ for every f-divergence. Proposition 3.4 can be stated in a more general way as follows. Proposition C.8. For any P P and any f-divergences Df, we have Df P Q c1,c2,µ,ϵ(P) = inf Q Mc1,c2,µ,ϵ Df (P Q) . (49) This proposition implies that Df P Q c1,c2,µ,ϵ(P) Df P Q c1,c2,µ,ϵ(P) , (50) for every P P, and hence Rf(Q c1,c2,µ,ϵ) Rf(Q c1,c2,µ,ϵ) 1 r1 r2 r1 f(r2) + r2 1 r2 r1 f(r1), (51) which completes the proof of the achievability part. Next, we prove Proposition C.8. Fix P P and p = d P/dµ. If f(0) = and µ({x : p(x) = 0}) > 0, then Df (P Q) = for every Q Mc1,c2,µ,ϵ, thus the proposition holds trivially. Consequently, we assume either f(0) < or µ({x : p(x) = 0}) = 0. The optimization problem inf Q Mc1,c2,µ,ϵ Df (P Q) can be cast as the following inf q:X (0, ) Z q(x)f p(x) such that q(x) b, x, (53) q(x) beϵ, x, (54) Z q(x)dµ(x) = 1, (55) where q = d Q/dµ. Our goal is to show that q = d Q c1,c2,µ,ϵ(P)/dµ is an optimal solution of the above optimization problem. Motivation for the optimality proof. To provide the motivation for the proof of the optimality of q for the above optimization problem, we first consider the following analogous finite-dimensional optimization problem inf q (0, )n such that qi b, (57) qi beϵ, (58) i=1 qi = 1. (59) for convex function f : (0, ) R and pi > 0. The convexity of f (x) = xf(1/x) (see Appendix B) implies that the above is a convex optimization problem. For simplicity, we assume that f is differentiable. However, note that this assumption is only for simplifying the motivation for the proof, and the result holds for general f or f . We formulate the Lagrangian for the above optimization as L(q, ϕ, ψ, ν) = i=1 (pif (qi/pi) + ϕi(b qi) + ψi(qi beϵ)) + ν with dual variables ϕ, ψ [0, )n and ν R. The Karush-Kuhn-Tucker (KKT) condition yields: (f ) (qi/pi) ϕi + ψi ν = 0, i, (61) (57), (58), (59), (62) ϕi, ψi 0, i, (63) ϕi(qi b) = ψi(beϵ qi) = 0, i. (64) Now, suppose that there is a feasible point q (0, )n satisfying (57), (58), and (59) such that q = clip(pi/r; b, beϵ) for some r > 0. We show that q satisfies the KKT condition for some feasible dual variables (ϕ, ψ, ν). For (q , ϕ, ψ, ν) to satisfy the KKT condition, the following should hold: If b < pi/r < beϵ, then we have q i = pi/r. From (61) and (64), we must have ϕi = ψi = 0 and ν = (f ) (1/r). If pi/r b, then q i = b. Then, ψi = 0 and ϕi = (f ) (b/pi) ν. If pi/r beϵ, then q i = beϵ. Then ϕi = 0 and ψi = ν (f ) (beϵ/pi). Now, since f is convex, (f ) is monotonically increasing. Hence, the following should be satisfied: If pi/r b, then (f ) (b/pi) (f ) (1/r) 0, and If pi/r beϵ, then (f ) (1/r) (f ) (beϵ/pi) 0. It can thus be verified that (q , ϕ, ψ, ν) satisfies the KKT condition with ν = (f ) (1/r), ϕi = (f ) (b/pi) (f ) (1/r), if pi/r b, 0, otherwise, ψi = (f ) (1/r) (f ) (beϵ/pi), if pi/r beϵ, 0, otherwise. Optimality proof. Next, we present a proof for the optimality of q for the optimization problem inf q:X (0, ) Z q(x)f p(x) such that q(x) b, x, (66) q(x) beϵ, x, (67) Z q(x)dµ(x) = 1. (68) Here, note that we do not assume f is differentiable. To this goal, we first review the following basic facts about a general convex function f : (0, ) R, which may not be differentiable [64, 65]: The left derivative f (x) := limh 0 f(x+h) f(x) h and the right derivative f +(x) := limh 0+ f(x+h) f(x) h exist and finite for every x (0, ), regardless of whether f is differentiable or not. For every 0 < x < y, we have f (x) f +(x) f (y) f +(y). For every x, y (0, ) and any g [f (x), f +(x)], we have f(y) f(x) + g(y x). (69) By continuous extension, this holds for y = 0 also. That is, f(0) f(x) gx for every x (0, ) and g [f (x), f +(x)]. Let q : X (0, ) be any feasible function satisfying (66) - (68). Recall that f (x) := xf(1/x) is convex, and we can express q(x)f(p(x)/q(x)) = p(x)f (q(x)/p(x)) whenever p(x) = 0. Also, whenever p(x) = 0, we bound f (q(x)/p(x)) by the linear approximation of f at q (x)/p(x) using (69), as follows: Hence, we have = p(x)f q(x) q(x)ζ(x) + ξ(x), (71) ζ(x) = (f ) + ξ(x) = p(x)f q (x) q (x)(f ) + Also, whenever p(x) = 0, we have q(x)f(p(x)/q(x)) = q(x)p(0). Hence, we set ζ(x) = p(0) and ξ(x) = 0 when p(x) = 0, so that q(x)ζ(x) + ξ(x) (74) holds for all x X. We note that ζ(x) and ξ(x) does not depend on the choice of q(x). Also, the equality holds if q(x) = q (x). Next, as an analogous to ν in the above motivation for the proof, let ν = (f ) +(1/r P ). Since R q(x)dµ(x) = 1, we can write Z q(x)f p(x) dµ(x) = Z q(x) f p(x) ν dµ(x) + ν (75) Z [q(x)(ζ(x) ν) + ξ(x)] dµ(x) + ν. (76) Now, we define the following sets which form a partition of X: L = x X : 1 r P p(x) < b , (77) M = x X : b 1 r P p(x) beϵ , (78) U = x X : 1 r P p(x) > beϵ . (79) For each case of x L, M, U, we have q (x) = b, 1 r P p(x), beϵ, respectively. We observe the following: If x M, then ζ(x) = ν clearly. If x U, then q (x)/p(x) < 1/r P . Since (f ) + is monotone increasing, we have ζ(x) ν. If x L and p(x) = 0, then q (x)/p(x) > 1/r P . Again, since (f ) + is monotone increasing, we have ζ(x) ν. Finally, if p(x) = 0 (which implies x L also), then from the definition f (t) = tf(1/t), we have (f ) +(t) = f(1/t) 1 t f (1/t). From (69) with y = 0, we have ν = (f ) +(1/r P ) = f(r P ) r P f (r P ) f(0) = ζ(x). Hence, ζ(x) ν for every x L. Therefore, since b q(x) beϵ, we can write Z q(x)f p(x) dµ(x) Z [q(x)(ζ(x) ν) + ξ(x)] dµ(x) + ν (80) [b(ζ(x) ν)1L(x) + beϵ(ζ(x) ν)1U(x) + ξ(x)]dµ(x) + ν. (81) The last expression does not depend on the choice of q(x). Also, we observe that all inequalities become equality if q(x) = q (x). This completes the proof for the optimality of q , and hence completes the proof of the achievability part. C.3 Proof of converse part (Proposition C.6) Let Q QX, P,ϵ be given. Let A = {A X : µ(A) = α}. For each A A, let p A(x) = c2 if x A c1 if x X\A. Since c2α + c1(1 α) = 1, we have R p A(x)dµ(x) = 1. Hence for each A A, we can define a probability measure PA P by d PA dµ = p A. Also, note that PA(A) = c2α. For each A A, let βA = Q(A|PA). Then, the push-forward measures of PA and Q(PA) by the indicator function 1A are Bernoulli distributions with Pr(1) = c2α and βA, respectively. By the data processing inequality (Theorem B.2), we have Df (PA Q(PA)) DB f (c2α βA) . (82) The main lemma to proceed is the following. Lemma C.9. Let X be a sample space. Let t, u N, t > u. Let A1, A2, , At X be subsets such that for each x X, we have | {i [t] : x Ai} | u. Then for any t probability measures Q1, , Qt P(X) satisfying that Qi(A) eϵQj(A) for all i, j [t] and A X, we have min i [t] Qi(Ai) (u/t)eϵ (u/t)eϵ + 1 (u/t). (83) Now, by the assumption that µ is α-decomposable, there exist sequences {tj} j=1 , {uj} j=1 N of positive integers such that tj > uj, α uj/tj, limj uj/tj = α, and for each j N, there exist tj subsets Aj,1, Aj,2, , Aj,tj X such that µ(Aj,i) = α for all i [tj], and for every x X, we have i [tj] : x Aj,tj uj. By applying Lemma C.9 to Qi = Q(PAj,i), we obtain mini [tj] βAi,j (uj/tj)eϵ (uj/tj)eϵ+1 (uj/tj). This implies that inf A A βA (uj/tj)eϵ (uj/tj)eϵ+1 (uj/tj) for all j N, and by taking the limit j , we have inf A A βA αeϵ αeϵ+1 α = beϵα. Since beϵ < c2, we have beϵα < c2α. By continuity of DB f and the fact that DB f (λ1 λ2) is decreasing in λ2 [0, λ1], we have sup A A Df (PA Q(PA)) sup A A DB f (c2α βA) DB f (c2α beϵα) . (84) It follows that Rf(Q) = sup P P Df (P Q(P)) sup A A Df (PA Q(PA)) DB f (c2α beϵα) . (85) Furthermore, we have DB f (c2α beϵα) = beϵαf c2α + (1 bαeϵ)f 1 c2α = beϵαf c2α + b(1 α)f c1(1 α) = beϵαf(r2) + b(1 α)f(r1), (88) and we can derive beϵα = 1 r1 r2 r1 and b(1 α) = r2 1 r2 r1 , as follows. From (30), (31) and the definition of r1, r2, we have 1 r1 = (1 c1)c2 c1eϵ c2 c1 , (89) r2 1 = c2 1 c2 c1 . (90) From this, we have 1 r1 r2 r1 = 1 r1 (r2 1) + (1 r1) (91) (1 c1)eϵ + (c2 1) (92) (eϵ 1)(1 c1) + c2 c1 (93) = b eϵ 1 c1 = beϵα. (95) Also, from above and 1 = beϵα + b(1 α), we have r2 1 r2 r1 = 1 1 r1 r2 r1 = 1 beϵα = b(1 α). (96) This ends the proof of the converse part. C.4 Deduction to main theorems In this subsection, we show that Theorem C.4 contains main theorems, Theorems 3.1 and 3.3, as special cases. C.4.1 Deduction to Theorem 3.1 Recall that in this setup, X = [k], P = P([k]) = P0,k,µk, where µk is the uniform distribution on [k]. That is, (c1, c2) = (0, k). The values of the constants α, b, r1, r2 are α = 1/k, (97) b = k eϵ + k 1, (98) r1 = 0, (99) r2 = eϵ + k 1 We can easily observe that µk is (α, k, 1)-decomposable, because the sets {i}, i [k], are disjoint and µk({i}) = 1 k. It follows that µk is α-decomposable. Hence, we can apply Theorem C.4. By a direct calculation of R(X, P, ϵ, f) = 1 r1 r2 r1 f(r2) + r2 1 r2 r1 f(r1), we can derive the formula of R([k], P([k]), ϵ, f) presented in Theorem 3.1. It remains to show that the mechanism Q k,ϵ presented in Theorem 3.1 is the same as Q 0,k,µk,ϵ, and show the claim about the range of r P . Recall that Q k,ϵ(x|P) = max 1 r P P(x), 1 eϵ + k 1 x [k], P P([k]), (101) and we claim that r P can be chosen such that 1 r P (eϵ + k 1)/eϵ. First, as explained in Section 3.1, we can alternatively write Q k,ϵ(x|P) = clip 1 r P P(x); 1 eϵ + k 1, eϵ Since P(x) = 1 k d P dµk (x) for each x [k] and P P([k]), (102) can be written as 1 k d Q k,ϵ(P) dµk (x) = clip 1 1 k d P dµk (x); 1 eϵ + k 1, eϵ d P dµk (x); k eϵ + k 1, keϵ d Q k,ϵ(P) dµk (x) = clip 1 d P dµk (x); k eϵ + k 1, keϵ d P dµk (x); b, beϵ . (106) Hence, Q k,ϵ = Q 0,k,µk,ϵ. Second, to prove the claim about the range of r P , let us fix P P([k]). We need to show that x=1 max P(x), 1 eϵ + k 1 eϵ + k 1P(x), 1 eϵ + k 1 The left inequality can be easily derived by x=1 max P(x), 1 eϵ + k 1 x=1 P(x) = 1. (108) For the right inequality, we recall that b = k eϵ+k 1. Since P(x) 1, we have eϵ eϵ+k 1P(x) eϵ eϵ+k 1. Hence, again by using P(x) = 1 k d P dµk (x), we have eϵ + k 1P(x), 1 eϵ + k 1 eϵ + k 1P(x); 1 eϵ + k 1, eϵ eϵ + k 1 1 k d P dµk (x); 1 eϵ + k 1, eϵ d P dµk (x); b, beϵ , (112) and thus k X eϵ + k 1P(x), 1 eϵ + k 1 d P dµk (x); b, beϵ (114) d P dµk (x); b, beϵ dµk(x). (115) The desired inequality follows from Proposition C.2. C.4.2 Deduction to Theorem 3.3 Recall that in this setup, X = Rn, P = Pc1,c2,h = Pc1,c2,µ, where µ m and dµ/dm = h. Note that the normalization condition (10) about (c1, c2, h, ϵ) implies the normalization condition (22) about (c1, c2, µ, ϵ). It can be directly observed that Q c1,c2,h,ϵ = Q c1,c2,µ,ϵ, because for each P P with corresponding pdf p(x), the chain rule of the Radon-Nikodym derivative shows that p(x) = d P dm = d P dm(x) = d P dµ (x)h(x). Hence, once we show that µ is α-decomposable, Theorem C.4 and Proposition C.2 directly contain Theorem 3.3 as a special case. Thus, it remains to show µ is α-decomposable. In fact, we prove a stronger statement that: µ is (α, t, u)-decomposable for any t, u N such that t > u and α u/t. Then, since the set of rational numbers is dense in R, this also implies that µ is α-decomposable. To prove this, let us first introduce the following lemma. Lemma C.10. Let α (0, 1) and t, u N, t > u, α u/t. If µ P(X) is (α/u, t, 1)- decomposable, then µ is also (α, t, u)-decomposable. Proof. In this proof, assume that the sum and subtraction operations performed in subscripts are modulo t operations, with the identification that 0 = t. By (α/u, t, 1)-decomposability, there are t disjoint subsets B1, B2, , Bt such that µ(Bi) = α/u for each i [t]. Using this, for each i [t], define Ai as Ai = u 1 j=0 Bi+j. As Bi s are disjoint, we have µ(Ai) = Pu 1 j=0 µ(Bi+j) = u (α/u) = α for all i [t]. Also, for each x Bi, x is contained in exactly u sets among A1, , At, which are Ai, Ai 1, , Ai u+1. Furthermore, if x / Bi for all i [t], then x is contained in none of Ai. Hence | {i [t] : x Ai} | u for all x X. Thus µ is (α, t, u)-decomposable. By this lemma, it suffices to show that for any t N such that t 2 and α 1/t, µ is (α, t, 1)- decomposable. As µ m, the map s R 7 µ(( , s] Rn 1) is continuous, and as s and , we have µ(( , s] 0 and 1, respectively. Hence by the intermediate value theorem, for each i [t], there exists si R such that µ(( , s] Rn 1) = αi. Then, setting A1 = ( , s1] Rn 1 and Ai = (si 1, si] Rn 1 for i 2 gives the desired Ai s in the definition of decomposability. In conclusion, µ is α-decomposable, and hence Theorem 3.3 can be deduced from Theorem C.4. C.5 Proof of Proposition C.2 Let P P be given. Let p = d P dµ , and again assume that c1 p(x) c2 for all x X. Similar to the proof of the achievability part, for each r > 0, let us define the following sets which form a partition of X: Lr = x X : 1 r p(x) < b , (116) Mr = x X : b 1 r p(x) beϵ , (117) Ur = x X : 1 r p(x) > beϵ . (118) For r = 0, we let L0 = {x X : p(x) = 0}, U0 = {x X : p(x) > 0}, M0 = . Also, let qr(x) = clip 1 rp(x); b, beϵ . Then, qr(x) = b, 1 rp(x), beϵ for x Lr, Mr, Ur, respectively. First, we show that R qr1(x)dµ(x) 1. We divide the case of c1 = 0 and c1 > 0. Suppose first that c1 = 0. Then r1 = 0, α = 1 c2 , and b = c2 eϵ+c2 1. Observe that 1 = Z p(x)dµ(x) = Z U0 p(x)dµ(x) c2µ(U0), (119) hence µ(U0) 1/c2. It follows that Z qr1(x)dµ(x) = bµ(L0) + beϵµ(U0) (120) = b(1 µ(U0)) + beϵµ(U0) (121) = b(eϵ 1)µ(U0) + b (122) c2 + b (123) = b eϵ + c2 1 c2 = 1. (124) Next, suppose that c1 > 0. Then r1 > 0. From p(x) c1, we have 1 r1 p(x) c1 r1 = b. Hence Lr1 = . Thus Z qr1(x)dµ(x) = 1 Mr1 p(x)dµ(x) + beϵµ(Ur1) (125) Mr1 p(x)dµ(x) + c1eϵµ(Ur1) Mr1 p(x)dµ(x) and T1 = µ(Ur1), so that Z qr1(x)dµ(x) = 1 r1 (S1 + c1eϵT1). (127) As c1 p(x) c2, we have Mr1 p(x)dµ(x) c1µ(Mr1) = c1(1 µ(Ur1)) = c1(1 T1), (128) Mr1 p(x)dµ(x) = Z Ur1 p(x)dµ(x) c2µ(Ur1) = c2T1. (129) From these, we can get S1 + c1T1 c1, (130) S1 + c2T1 1. (131) As c1 < c1eϵ < c2, we can express c1eϵ as a convex combination of c1 and c2, as c1eϵ = c2 c1eϵ c2 c1 c1 + c1eϵ c1 c2 c1 c2. (132) Hence, by taking c2 c1eϵ c2 c1 [equation (130)] + c1eϵ c1 c2 c1 [equation (131)], we get S + c1eϵT c2 c1eϵ c2 c1 c1 + c1eϵ c1 c2 c1 (133) = c2 c1eϵ + eϵ 1 c2 c1 c1 (134) = (eϵ 1)(1 c1) + c2 c1 c2 c1 c1 (135) = r1. (136) Thus we have R qr1(x)dµ(x) 1. Similarly, we show that R qr2(x)dµ(x) 1. from p(x) c2, we have 1 r2 p(x) c2 r2 = beϵ. Hence Ur2 = . Thus Z qr2(x)dµ(x) = bµ(Lr2) + 1 Mr2 p(x)dµ(x) (137) c2e ϵµ(Lr2) + Z Mr2 p(x)dµ(x) Similar as above, let S2 = µ(Lr2) and T2 = R Mr2 p(x)dµ(x), so that Z qr2(x)dµ(x) = 1 r2 (c2e ϵS2 + T2), (139) and, we have Mr2 p(x)dµ(x) c2µ(Mr2) = c2(1 µ(Lr2)) = c2(1 S2) (140) Mr2 p(x)dµ(x) = Z Lr2 p(x)dµ(x) c1µ(Lr2) = c1S2. (141) c2S2 + T2 c2, (142) c1S2 + T2 1. (143) As c1 c2e ϵ c2, we have c2e ϵ = c2 c2e ϵ c2 c1 c1 + c2e ϵ c1 c2 c1 c2 (144) and by the same reason, we have c2e ϵS2 + T2 c2 c2e ϵ c2 c1 + c2e ϵ c1 c2 c1 c2 (145) = 1 e ϵ + c2e ϵ c1 c2 c1 c2 (146) = (eϵ 1)(1 c1) + c2 c1 c2 c1 c2e ϵ (147) = r2. (148) Thus we have R qr2(x)dµ(x) 1. Finally, we show that R qr1(x)dµ(x) = 1 implies R qr(x)dµ(x) = 1 for all r [r1, r2]. Let us assume R qr1(x)dµ(x) = 1. We first claim that: p(x) = c2 for µ-a.e. x Ur1, and p(x) = c1 for µ-a.e. x X\Ur1. Again, we divide the case of c1 = 0 and c1 > 0. Suppose first that c1 = 0. By tracking the proof of R qr1(x)dµ(x) 1, We can observe that the equality R qr1(x)dµ(x) = 1 holds if and only if µ(U0) = 1/c2, if and only if p(x) = c2 for µ-a.e. x U0. By definition of U0, we have p(x) = 0 = c1 for all x X\U0. Hence we get the claim. Next, suppose that c1 > 0. Again, by tracking the proof of R qr1(x)dµ(x) 1, We can observe that the equality R qr1(x)dµ(x) = 1 is equivalent to any of the following statements: Both S1 + c1T1 = c1 and S1 + c2T1 = 1 holds. Both R Mr1 p(x)dµ(x) = c1µ(Mr1) and R Ur1 p(x)dµ(x) = c2µ(Ur1) holds. p(x) = c1 for µ-a.e. x Mr1 and p(x) = c2 for µ-a.e. x Ur1. Since Lr1 = , we also get the claim. WLOG, assume that p(x) = c2 for all x Ur1, and p(x) = c1 for all x X\Ur1. Then, for every r (r1, r2], we have the following: For each x Ur1, we have 1 r2 = beϵ, hence qr(x) = beϵ. For each x X\Ur1, If c1 = 0, then p(x) = 0, qr(x) = b, and If c1 > 0, then r1 > 0, 1 r1 = b, hence again qr(x) = b. Also, we have 1 = R p(x)dµ(x) = c2µ(Ur1) + c1(1 µ(Ur1)), hence µ(Ur1) = 1 c1 c2 c1 = α. It follows that for every r (r1, r2], we have R qr(x)dµ(x) = beϵµ(Ur1) + b(1 µ(Ur1)) = beϵα + b(1 α) = 1. This concludes the proof. C.6 Proof of Lemma C.7 Although the proof can be found in [59], we present the proof here for the completeness. If r1 = 0 and f(0) = , then the RHS of the inequality we want to show is , thus it becomes trivial. Hence, we may assume that either r1 > 0 or f(0) < . By the assumption, p(x)/q(x) is the convex combination of r1 and r2 for µ-a.e. x X, as follows. p(x)/q(x) = (p(x)/q(x)) r1 r2 r1 r2 + r2 (p(x)/q(x)) r2 r1 r1. (149) Hence, by the convexity of f and R p(x)dµ(x) = R q(x)dµ(x) = 1, we have Df (P Q) = Z q(x)f(p(x)/q(x))dµ(x) (150) Z q(x) (p(x)/q(x)) r1 r2 r1 f(r2) + r2 (p(x)/q(x)) r2 r1 f(r1) dµ(x) (151) = Z p(x) r1q(x) r2 r1 f(r2) + r2q(x) p(x) r2 r1 f(r1) dµ(x) (152) r2 r1 f(r2) + r2 1 r2 r1 f(r1). (153) C.7 Proof of Lemma C.9 First, we claim that for each i [t u], we can construct a partition {Bi,1, Bi,2, , Bi,u} of Au+i into u (measurable) sets, such that for each j [u], the sets Aj, B1,j, B2,j, , Bt u,j are disjoint. The construction is in the inductive way as follows: Given i [t u], suppose that we have constructed such partitions of Au+1, , Au+i 1 such that for each j [u], Aj, B1,j, B2,j, , Bi 1,j are disjoint. Let Ci,j = Aj i 1 k=1Bk,j . Then for each x Au+i, at least one of Ci,1, Ci,2, , Ci,u does not contain x, because j=1 1X\Ci,j(x) = 1 1Ci,j(x) = u j=1 1Ci,j(x) (154) k=1 1Bk,j(x) k=1 1Bk,j(x) (155) j=1 1Bk,j(x) = u k=1 1Au+k(x) (156) u (u 1) = 1, (157) where the last line is from that since x Au+i, at most u 1 sets among A1, A2, , Au+i 1 can contain x. Hence, setting Bi,j = n x Au+i : j = min j [u] : x / Ci, j o (158) gives the partition {Bi,1, Bi,2, , Bi,u} of Au+i, such that for each j [u], Ci,j and Bi,j are disjoint. This implies that Aj, B1,j, B2,j, , Bi,j are disjoint. Now, let ℓ= mini [t] Qi(Ai). Using these partitions, we can now show that j=1 Qj(Aj) + k=1 Qj(Bk,j) = j=1 Qj(Aj) + j=1 Qj(Bk,j) (160) j=1 Qj(Aj) + k=1 Qj(Au+k) (161) j=1 Qj(Aj) + k=1 e ϵQu+k(Au+k) (162) ℓ u + e ϵ(t u) . (163) ℓ u u + e ϵ(t u) = (u/t)eϵ (u/t)eϵ + 1 (u/t). (164) D Proofs of Remaining Propositions We present the proofs of remaining propositions in the paper, Propositions 3.2 and 4.1. D.1 Proof of Proposition 3.2 By the assumption, there are subsets {Ai} i=1 of X which are pairwise disjoint and Pi(Ai) = 1 for all i N. Let us pick P0 P, and let Q0 = Q(P0). Since Ai s are disjoint, we have P i=1 Q0(Ai) = Q0 (S i=1 Ai) 1 < . Hence, we have limi Q0(Ai) = 0. Also, by definition of ϵ-LDP, for any P P and i N, we have Q(P)(Ai) eϵQ0(Ai). Especially, this implies Q(Pi)(Ai) eϵQ0(Ai), and thus limi Q(Pi)(Ai) = 0. Now, similar to the converse proof in Section C.3, let βi = Q(Pi)(Ai). Then, the push-forward measures of Pi and Q(Pi) by the indicator function 1A are Bernoulli distributions with Pr(1) = 1 and βi, respectively. By the data processing inequality (Theorem B.2), we have Df (Pi Q(Pi)) DB f (1 βi) . (165) Since limi βi = 0, by continuity of DB f , we have Rf(Q) lim sup i Df (Pi Q(Pi)) lim i DB f (1 βi) = DB f (1 0) = Mf, (166) where the last equality is because two Bernoulli distributions with Pr(1) = 1 and Pr(1) = 0, respectively, are mutually singular. Hence, we must have Rf(Q) = Mf. D.2 Proof of Proposition 4.1 For generality, we prove that the statement of Proposition 4.1 holds in the general setup in Definition C.1. That is, for the setup in Definition C.1, the mechansim Q = Q c1,c2,µ,ϵ satisfies DTV (Q (P), Q (P )) 2 max(r P , r P )DTV (P, P ) (167) for all P, P P. As the mechanisms Q k,ϵ and Q c1,c2,h,ϵ in the paper are special cases of Q c1,c2,µ,ϵ, a proof in this general setup induces Proposition 4.1. Now, let us assume the setup in Definition C.1 Let P, P P be given. WLOG, assume that r P r P . Let p = d P/dµ, p = d P /dµ, and q(x) = clip 1 r P p(x); b, beϵ , q (x) = clip 1 r P p (x); b, beϵ , so that q = d Q (P)/dµ and q = d Q (P )/dµ. For simplicity, we denote clip(x) := clip(x; b, beϵ). We first note the fact that clip(x) is monotone increasing and 1-Lipschitz in x. From this and the equivalent expressions of the total variation distance in Appendix B, we have DTV (Q (P), Q (P )) (168) x:q(x) q (x) (q(x) q (x))dµ(x) (169) x:q(x) q (x) r P p(x) clip 1 r P p (x) dµ(x) (170) x:q(x) q (x) r P p(x) clip 1 r P p (x) dµ(x) x:q(x) q (x) r P p (x) clip 1 r P p (x) dµ(x) (171) x:q(x) q (x) 1 r P (p(x) p (x)) dµ(x) + 0 (172) X |p(x) p (x)|dµ(x) = 2 r P DTV (P, P ) . (173) This ends the proof. E Behaviors of Proposed Mechanisms In this appendix, we present the formal proofs for the behaviors of the proposed mechanisms presented in Sections 3.1 and 3.2. We first observe that the formula of the optimal worst-case f-divergence in general case 1 r1 r2 r1 f(r2) + r2 1 r2 r1 f(r1) (174) is the y-coordinate value at x = 1 of the line segment joining (r1, f(r1)) and (r2, f(r2)). As f is convex, this formula is increasing in r2 and decreasing in r1, provided that r1 < 1 < r2. Now, let us present the proofs. If f(0) = and P contains two mutually singular distributions, then R(X, P, ϵ, f) = . Proof. Let P1, P2 P be mutually singular distributions with disjoint supports A1, A2 X respectively (That is, P1(A1) = P2(A2) = 1 and A1 A2 = ). Suppose that Q is an ϵ-LDP sampling mechanism for (X, P) such that Rf(Q) < . Since f(0) = , Df (P Q) < , implies Q P. Hence, as Df (Pi Q(Pi)) < and Pi(Ac i) = 0, we have Q(Ac i|Pi) = 0 for each i = 1, 2. As Q satisfies ϵ-LDP, we have Q(Ac 1|P2) Q(Ac 1|P1) = 0, Q(Ac 1|P2) = 0. Now, since A1 A2 = , we have Ac 1 Ac 2 = X. But by the union bound, 1 = Q(X|P2) Q(Ac 1|P2) + Q(Ac 2|P2) = 0, which is a contradiction. Hence, Rf(Q) = for every ϵ-LDP sampling mechanism Q. From now, assume f(0) < . Let us first prove the behaviors for the case of finite X in Section 3.1. R([k], P([k]), ϵ, f) is decreasing in ϵ and increasing in k. Proof. Recall from Appendix C.4.1 that the case of X = [k], P = P([k]) can be fit into the general case with r1 = 0, (175) r2 = eϵ + k 1 Here, r2 is decreasing in ϵ and increasing in k. As (174) is increasing in r2, we get the desired claim. For a fixed k, we have R([k], P([k]), ϵ, f) 0 as ϵ . Proof. As ϵ , we have r2 1. As f is continuous, f(1) = 0, and f(0) < , we obtain from (174) that R([k], P([k]), ϵ, f) 1 0 1 0f(1) + 1 1 1 0f(0) = 0 (177) For a fixed k, as ϵ 0, we have Q k,ϵ(x|P) 1/k for every P P([k]) and x [k]. Proof. We know that 1 eϵ+k 1 Q k,ϵ(x|P) eϵ eϵ+k 1. As ϵ 0, both of 1 eϵ+k 1 and eϵ eϵ+k 1 converges to 1 k, hence we get the desired claim. Next, let us prove the behaviors for the continuous case in Section 3.2. If c1 = 0 and f(0) = , then R(Rn, Pc1,c2,h, ϵ, f) = . Proof. Since r2 > 1 and r1 = 0, we have r2 1 r2 r1 = r2 1 r2 > 0. Hence r2 1 r2 r1 f(r1) = r2 1 r2 r1 f(0) = , which proves R(Rn, Pc1,c2,h, ϵ, f) = . For a fixed (c1, c2), R(Rn, Pc1,c2,h, ϵ, f) is decreasing in ϵ. Proof. We can observe that r1 is increasing in ϵ and r2 is decreasing in ϵ. Since (174) is increasing in r2 and decreasing in r1, we get the desired claim. For a fixed (c1, c2) with c1 = 0, as ϵ , we have R(Rn, Pc1,c2,h, ϵ, f) 0. Proof. We can observe that r1 = 0 and r2 1. Hence, by the same argument as (177), we have the desired claim. F Detailed Explanation of Setups in Numerical Results We present details about the setups in producing numerical results in Section 5, and generating Figure 1 in Section 1. In this appendix, for each µ Rn, Nµ,Σ[x] = 1 (2π)n/2 det(Σ) exp (x µ)T Σ 1(x µ) 2 is the pdf of the n-dimensional Gaussian distribution with mean µ and covariance Σ. Note that we denote N(µ, Σ) to refer the Gaussian distribution itself with mean µ and covariance matrix Σ. F.1 Explanation for numerical result for finite data space In this appendix, we show that among the choices of Q0 in the baseline for X = [k], choosing Q0 to be the uniform distribution minimizes Rf(Q), and present the precise value of Rf(Q) for the baseline with uniform Q0. From now, let us fix k, ϵ, f, and we denote QQ0 to mean the baseline with the reference distribution Q0. Also, let δx P([k]) be the point mass at x [k]. Recall that Mϵ,Q0 = {Q P(X) : e ϵ/2Q0(A) Q(A) eϵ/2Q0(A), A X}, (178) and we set the baseline to satisfy that Df (P QQ0(P)) = inf Q Mϵ,Q0 Df (P Q) . (179) First, if Q0(x) = 0 for some x [k], then QQ0(x|P) eϵ/2Q0(x) = 0, QQ0(x|P) = 0 for every P P([k]). Especially, QQ0(x|δx) = 0, and this implies that QQ0(δx) and δx are mutually singular. Hence, Rf(QQ0) Df (δx QQ0(δx)) = Mf, (180) concluding that Rf(QQ0) = Mf. Hence, to minimize Rf(QQ0), it suffices to set Q0(x) > 0 for all x [k]. Hence from now, we only consider such Q0. We note that for every x [k] and P P([k]), we have we have QQ0(x|P) eϵ/2Q0(x), and furthermore QQ0(x|P) = 1 X y [k]\{x} QQ0(y|P) (181) y [k]\{x} e ϵ/2Q0(y) (182) = 1 e ϵ/2(1 Q0(x)) (183) = e ϵ/2Q0(x) + (1 e ϵ/2). (184) B(t) = min{eϵ/2t, e ϵ/2t + (1 e ϵ/2)}, (185) QQ0(x|P) B(Q0(x)). (186) Note that B(t) is increasing in t. Now, for any x [k], we have Rf(QQ0) Df (δx QQ0(δx)) (187) = QQ0(x|δx)f 1 QQ0(x|δx) y [k]\{x} QQ0(y|δx)f(0) (188) = QQ0(x|δx)f 1 QQ0(x|δx) + (1 QQ0(x|δx))f(0). (189) We can observe that the last term can be written in the form of the optimal worst-case f-divergence (174) with r1 = 0 and r2 = 1/QQ0(x|δx). In other words, let R(r1, r2) = 1 r1 r2 r1 f(r2) + r2 1 r2 r1 f(r1). (190) Then we have Rf(QQ0) R(0, 1/QQ0(x|δx)). As noted in Appendix E, R(r1, r2) is increasing in r2 for 0 r1 < 1 < r2. Since QQ0(x|δx) B(Q0(x)), we have Rf(QQ0) R(0, 1/B(Q0(x))). (191) Since this should hold for all x [k], we have Rf(QQ0) max x [k] R(0, 1/B(Q0(x))). (192) Since minx [k] Q0(x) 1/k for all Q0 P([k]), and B(t) is increasing in t, we have Rf(QQ0) R(0, 1/B(1/k)). (193) Now, let µk be the uniform distribution over [k]. We will show that Rf(Qµk) = R(0, 1/B(1/k)), (194) which suffices to prove that Q0 = µk minimizes Rf(QQ0). We observe that Mϵ,µk is a convex set in P([k]). Since Df (P Q) is jointly convex in (P, Q), we obtain that Df (P Qµk(P)) = min Q Mϵ,µk Df (P Q) is convex in P by [64, Section 3.2.5]. Hence, the maximum of Df (P Qµk(P)) occurs when P is one of the extreme points of P([k]), that is, the point masses δx. By the same arguments as in (187)-(189), we have Df (δx Q) = R(0, 1/Q(x)), and hence Rf(Qµk) = sup P P([k]) Df (P Qµk(P)) (195) = max x [k] Df (δx Qµk(δx)) (196) = max x [k] inf Q Mϵ,µk Df (δx Q) (197) = max x [k] inf Q Mϵ,µk R(0, 1/Q(x)) (198) 0, 1 minx [k] sup Q Mϵ,µk Q(x) Also, by the same arguments as in (181)-(184), we have Q(x) B(1/k), Q Mϵ,µk, x [k]. (200) Now, we will show that sup Q Mϵ,µk Q(x) = B(1/k) for any x [k]. To show this, we will prove that there is a distribution Q P([k]) such that Q(x) = B(1/k) and Q(y) = 1 B(1/k) k 1 for all y [k]\{x}, and this Q is contained in Mϵ,µk. It suffices to show the followings: 1 k e ϵ/2 B(1/k) min 1, 1 k eϵ/2 , (201) 1 k e ϵ/2 1 B(1/k) k 1 min 1, 1 k eϵ/2 . (202) We note that whenever 0 t 1, we have B(t) = min{eϵ/2t, e ϵ/2t + (1 e ϵ/2)} t, and B(t) e ϵ/2t + (1 e ϵ/2) = 1 (1 t)e ϵ/2 1. From these, we can easily observe that B(1/k) min 1, 1 keϵ/2 , and 1 k e ϵ/2 1 k B(1/k). (203) Also, 1 B(1/k) k 1 1 (1/k) k eϵ/2 , (204) k 1 1 [e ϵ/2(1/k) + (1 e ϵ/2)] k e ϵ/2. (205) This shows that sup Q Mϵ,µk Q(x) = B(1/k) for any x [k]. Thus, min x [k] sup Q Mϵ,µk Q(x) = B(1/k), (206) Rf(Qµk) = R 0, 1 B(1/k) This ends the proof that Q0 = µk minimizes Rf(QQ0), and Rf(Qµk) = R(0, 1/B(1/k)). F.2 Explanation for the experiment in 1D Gaussian mixture The precise description of the set of possible client distributions in our experimental setup is as follows: P : p(x) = Pk i=1 λi Nµi,1[x] R 4 4 Pk i=1 λi Nµi,1[x]dx 1[ 4,4](x), k [K], λi 0, i=1 λi = 1, |µi| 1 Each of Pj P is generated by choosing k, λi, and µi in (208) as follows: (i) First, choose the number of Gaussian distributions k by first sample k from the Poisson distribution with mean k0 (which is chosen beforehand), and let k = min( k + 1, K), and (ii) after that, sample each of µ1, , µk independently from the uniform distribution on [ 1, 1], and sample (λ1, , λk) from the uniform distribution on P([k]). The implementation of our proposed mechanism is as follows. We can observe that Z 4 i=1 λi Nµi,1[x]dx inf µ [ 1,1] 4 Nµ,1[x]dx = Z 4 4 N1,1[x]dx = Φ(3) Φ( 5), (209) where Φ is the cdf of the 1D standard Gaussian distribution, and for each x [ 4, 4], we have i=1 λi Nµi,1[x]dx sup µ [ 1,1] Nµ,1[x] = 1 2π exp [max(|x| 1, 0)]2/2 . (210) Hence, we have P P0,1, h = P0,c2,h, where h(x) = exp [max(|x| 1, 0)]2/2 2π(Φ(3) Φ( 5)) 1[ 4,4](x) (211) and c2 = R h(x)dx, h(x) = h(x)/c2. When implementing our proposed mechanism, we use the bisection method to find a constant r P . We predetermine the error tolerances δ1, δ2 0, δ1 < 1, and we terminate the bisection method to find r P if the integration of (13) lies in the interval [1 δ1, 1+δ2]. As mentioned in Section 4.1, to consider the error tolerances, we actually implement Q 0,c2,h,ϵ (P), with ϵ = ϵ log 1+δ2 In the experiment in the paper, we use k0 = 2 and δ1 = δ2 = 10 5. For the baseline, we use the same hyperparameter setups as in [35, Section 5, Paragraph Architectures"], except a slight modification in a reference distribution Q0. In [35], they set the standard Gaussian as the reference distribution, Q0 = N(0, 1). To consider the truncated domain [ 4, 4], we set Q0 as the truncation of the standard Gaussian, that is, Q0 has a pdf q0(x) = N0,1[x] R 4 4 N0,1[x]dx 1[ 4,4](x). (212) The baseline has many other hyperparameters not specified in [35], such as the proposal distribution used for the Metropolis-Hasting algorithm [66, 67], initial model parameters for the weak learner, etc. We noticed that the code for [35] is available at https://github.com/karokaram/ Privated Boosted Densities/tree/master and hence we made every effort to faithfully reproduce the baseline with exactly the same hyperparameters, including those not mentioned in the paper [35]. However, subtle variations may arise due to differences in the programming languages employed; we used Python, while [35] used Julia. F.3 Explanation for Figure 1 For k N and σ > 0, the Gaussian ring distribution with k modes and a component-wise variance σ2 is the mixture of k Gaussian distributions in R2 with equal weights, where each Gaussian distribution has the covariance σ2I2 and equally spaced mean in the unit circle, and one of the Gaussian distribution has mean (1, 0). That is, it has a density 1 k Pk i=1 Nµi,σ2I2[x], with µi = (cos 2πi k , sin 2πi k ). In Figure 1, the left image is the pdf of the Gaussian ring distribution with k = 3 and σ2 = 0.5. It is used as a client s original distribution. For our proposed mechanism, similar to Section 3.2, we use the setup that P consists of Gaussian mixtures, where each Gaussian has mean within a unit ball centered at the origin and has covariance 0.5I2. that is, P = {Pk i=1 λi N(xi, σ2I2) : k N, λi 0, Pk i=1 λi = 1, xi 1}, with σ2 = 0.5. We can also observe that P P0,1, h for h(x) = 1 2πσ2 exp [max(0, x 1)]2 Same as Appendix F.2, we have P0,1, h = P0,c2,h, where c2 = R h(x)dx, h(x) = h(x)/c2. Hence, we use Q 0,c2,h,ϵ. The implementation of Q c1,c2,h,ϵ(P) is the same as the description in Appendix F.2 with δ = 10 5. For the baseline, we also use MBDE with the same hyperparameter setup as [35, Section 5, Paragraph Architectures"], with the (untruncated) 2D standard Gaussian as a reference distribution, Q0 = N(0, I2). G Additional Numerical Results for Finite Space In this appendix, we present more numerical results for the finite space case explained in Section 5.1 for other k. We present the result for k = 5, 20, and 100 in Figures 5-7. As shown by the figures, the proposed mechanism always has the smaller worst-case f-divergence compared to the baseline, as the optimality is proved for the proposed mechanism. However, the performance gap between two mechanisms decreases as ϵ becomes smaller and k becomes larger. 0.1 0.5 1.0 2.0 5.0 0.0 Worst f-divergence 0.1 0.5 1.0 2.0 5.0 Privacy budget ǫ 0.1 0.5 1.0 2.0 5.0 0.0 Proposed Baseline Figure 5: Theoretical worst-case f-divergences of proposed and previously proposed baseline mechanisms (with uniform Q0) over finite space (k = 5) 0.1 0.5 1.0 2.0 5.0 0 Worst f-divergence 0.1 0.5 1.0 2.0 5.0 Privacy budget ǫ 0.1 0.5 1.0 2.0 5.0 0.0 0.8 Sq. Hel Proposed Baseline Figure 6: Theoretical worst-case f-divergences of proposed and previously proposed baseline mechanisms (with uniform Q0) over finite space (k = 20) 0.1 0.5 1.0 2.0 5.0 0 Worst f-divergence 0.1 0.5 1.0 2.0 5.0 Privacy budget ǫ 0.1 0.5 1.0 2.0 5.0 0.0 Proposed Baseline Figure 7: Theoretical worst-case f-divergences of proposed and previously proposed baseline mechanisms (with uniform Q0) over finite space (k = 100) H Instructions for Reproducing Results In this appendix, we provide instructions for reproducing the experiments and figures in the paper. For a detailed document about the code, refer to the provided file README.md. For the tasks consuming a large time, we also specify the running times of such tasks. We do not specify the running times of the short tasks consuming less than 5 seconds. All experiments are performed on our simulation PC with the following specifications: OS: Ubuntu 22.04.1 CPU: Intel(R) Core(TM) i9-9900X Memory: 64GB There are a few remarks: Although we expect that the codes can be run and reproduce our results on sufficiently recent versions of Python libraries, we provide the information about the anaconda environment used in the experiment in environment.yaml for the completeness. The provided codes contain some lines to make the figures use TEX fonts. Running such lines requires the LATEX to be installed in the experimental environment. We can remove the following lines to disable using TEX: matplotlib.use("pgf") "pgf.texsystem": "pdflatex", font.family : serif , text.usetex : True, pgf.rcfonts : False, font.serif : Computer Modern Roman , H.1 Instruction for producing Figure 1 Figure 1 can be obtained by running the code plot_Gauss Ring.py, without any program arguments. We measure the running time for initializing the mechanism and and calculating the sampling distribution for the baseline (MBDE [35]) and our proposed mechanism. The measured running times in our environment are as follows: (unit: second) Initializing the mechanism Baseline: 0.80 Proposed: 1.22 Calculating the sampling distribution Baseline: 35.19 Proposed: 664.18 H.2 Instruction for producing Figure 2 Figure 2 can be obtained by running the code visualize_finite Space.py, without any program arguments. H.3 Instruction for producing results for finite space The results about the theoretical worst-case f-divergence for finite space, Figures 3,5,6,7, can be obtained by running the code plot_finite.py with a program argument --k to specify the value of k. For example, in the command line, the aforementioned four figures can be generated by the following commands, respectively: python plot_finite.py --k 10 python plot_finite.py --k 5 python plot_finite.py --k 20 python plot_finite.py --k 100 H.4 Instruction for producing results for 1D Gaussian mixture The experiment of 1D Gaussian mixture in Section 5.2 consists of the following two codes: 1. exp_1DGauss Mix.py This code performs an experiment on a single ϵ. We can provide two program arguments --eps and --seed to specify the values of ϵ and the random seed, respectively. 2. plot_1DGauss Mix.py This code generate the plot like Figure 4 from the results of exp_1DGauss Mix.py The results in the paper, Figure 4, can be obtained as follows: 1. First, run the following commands: python exp_1DGauss Mix.py --eps 0.1 --seed 1 python exp_1DGauss Mix.py --eps 0.5 --seed 2 python exp_1DGauss Mix.py --eps 1.0 --seed 3 python exp_1DGauss Mix.py --eps 2.0 --seed 4 python exp_1DGauss Mix.py --eps 5.0 --seed 5 These can be run in any order or in parallel. Check that all of the five result files data_1DGauss Mix_eps{eps}.npy corresponding to five values of ϵ are generated. In our environment, running all of the five commands in parallel consumes 3h 13m 35s. 2. Then, run the code plot_1DGauss Mix.py without any program arguments. I Limitations Our main contribution lies in proposing a minimax-optimal mechanism, and we present several experimental results based on synthetic data to support the superiority of our mechanism. However, since we have not conducted experiments based on real datasets, the analysis of performance in real-world scenarios is insufficient. Also, our PUT formulation in the minimax sense provides optimal utility in the worst case, which may result in reduced average utility when prior information is given. Finally, our implementation of the proposed mechanism requires a large amount of running time due to numerical integration, which makes experiments in multidimensional spaces challenging. J Broader Impacts Our proposed mechanism can be utilized for privacy protection in the field of generative models, which has recently received significant attention. A major deterrent to the adoption of privacy protection algorithms in real-world scenarios is the potential performance degradation. Our PUToptimal mechanism, that minimizes the loss of utility given the privacy budget, can alleviate such concerns. We should note that an LDP mechanism provides a certain level of privacy protection but cannot guarantee complete privacy protection without completely sacrificing utility. Additionally, clients typically provide multiple data points through various channels, and when these data are combined, it can lead to greater privacy leakage [68, 69]. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We clearly stated the main contributions and scope of our paper in the abstract and introduction (Section 1). Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We stated the limitations of our work in Appendix I. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: All theorems, propositions, and lemmas in the paper include all the necessary assumptions, and are accompanied with either the proofs in the appendices or relevant references. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We provided instructions to reproduce all the figures and experimental results in Appendix H. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We plan to make the code attached in the supplementary material publicly open after the review period. Also, the instructions to reproduce the results are fully specified in Appendix H. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We presented enough details to implement our proposed mechanism, and we attached the code for implementing both the proposed mechanism and the baseline in the supplementary material. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: As the main contribution is theoretically characterizing the worst-case utility, we focused on extracting the worst-case utility in the experiment. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We provided the information about the computer resources and running times for the experiments in Appendix H. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: Our research conforms with the Neur IPS Code of Ethics. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: We discussed both the positive and negative societal impacts in Appendix J. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: We do not recognize any risk for misuse. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: All of the codes only use the basic Python libraries like numpy, scipy, torch, and so on. No other external assets are used. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: As mentioned in Appendix H, we documented the details of our codes in the file README.md, which is provided by the Neur IPS Code and Data Submission Guidelines. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Our paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Our paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.