# private_geometric_median__8dcaeb10.pdf

Private Geometric Median

Mahdi Haghifam Thomas Steinke Jonathan Ullman

In this paper, we study differentially private (DP) algorithms for computing the geometric median (GM) of a dataset: Given n points, x1, . . . , xn in Rd, the goal is to find a point θ that minimizes the sum of the Euclidean distances to these points, i.e., Pn i=1 θ xi 2. Off-the-shelf methods, such as DP-GD, require strong a priori knowledge locating the data within a ball of radius R, and the excess risk of the algorithm depends linearly on R. In this paper, we ask: can we design an efficient and private algorithm with an excess error guarantee that scales with the (unknown) radius containing the majority of the datapoints? Our main contribution is a pair of polynomial-time DP algorithms for the task of private GM with an excess error guarantee that scales with the effective diameter of the datapoints. Additionally, we propose an inefficient algorithm based on the inverse smooth sensitivity mechanism, which satisfies the more restrictive notion of pure DP. We complement our results with a lower bound and demonstrate the optimality of our polynomial-time algorithms in terms of sample complexity.

1 Introduction

Differentially private (DP) convex optimization is a fundamental task where we approximately minimize a data-dependent convex loss function while limiting what can be learned about individual data points. The predominant algorithm for DP convex optimization is DP (stochastic) gradient descent, or DP-(S)GD, for short [SCS13; BST14]. Given a dataset X(n) which contains private information, and a loss function F(θ; X(n)), DP-(S)GD starts with an initial value θ0 Rd and iteratively updates it using θt+1 =ΠΘ θt η θt F(θt;X(n))+ξt where η > 0 is the step size, ξt is noise to ensure DP, Θ Rd is a closed convex feasible set, and ΠΘ is the Euclidean projection operator. In the most general setting of Lipschitz convex functions, the excess error depends linearly on the radius of the set Θ, and this linear dependence is necessary in the worst-case [BST14]. This linear dependence is problematic because we can think of the diameter of the set Θ as capturing a measure of the uncertainty we have about the location of the minimizer, and we want our algorithm to perform well even with a high degree of uncertainty. This linear dependence can be improved under certain unrealistically strong assumptions, such as strong convexity, but it is unclear whether we can improve the dependence on the radius under weaker, more natural conditions. In this paper, as a step towards answering this question, we identify a simple, optimization task computing the geometric median where we can exponentially improve the dependence on the radius.

We study private algorithms for computing the geometric median (GM) of a dataset: We are given a set of n data points X(n) (x1, . . . , xn) (Rd)n, where xi represents the private information of

Khoury College of Computer Sciences, Northeastern University. Supported by a Khoury College of Computer Sciences Distinguished Postdoctoral Fellowship. m.haghifam@northeastern.edu Google Deep Mind. Khoury College of Computer Sciences, Northeastern University. Supported by NSF awards CNS-2232692 and CNS-2247484.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Algorithm Privacy Utility [F(An(X(n)); X(n))] Run-time Samples Loc DPSGD (Section 3.1) Approx

d nε OPT n2 log(R/r) +n2d

ε Loc DPCutting Plane (Section 3.2) Approx

d nε OPT n2 log(R/r) +nd2 + d2+ω

ε SInv S (Section 4) Pure

1 + d log(R/r)

nε OPT Exponential d log(R/r)

Baseline: DP-(S)GD Approx OPT + R

d ε n2d N/A

Table 1: Summary of our results. Here OPT = arg minθ Rd F(θ; X(n)) denotes the optimal loss and ω is the matrix-multiplication exponent. The highlighted part is the runtime of the warm-up phase which is the same for Loc DPSGD and Loc DPCutting Plane. We also assume that maxi [n]|X(n) Bd(xi, r)| < 3n/4. (See Section 3.1, Section 3.2, and Section 4 for the general results without this restriction.) For readability, we omit logarithmic factors that depend on n and d.

one individual, and we are interested in approximately solving the following optimization problem:

θ GM(X(n)) arg min θ Rd F(θ; X(n)), where, F(θ; X(n)) X

i [n] θ xi 2. (1)

The geometric median generalizes the standard one-dimensional median. The geometric median is a useful tool for robust estimation and aggregation, because it is less sensitive to outliers than the mean of the data, i.e., it is a nontrivial estimator even when 49% of the input data is arbitrarily corrupted. These properties make GM a popular tool for designing robust versions of distributed optimization methods [CSX17; WLCG20; FGGPS22; AHJSDT22; PKH22; EFGH23], boosting the confidence of weakly concentrated estimators [Min15], clustering [BMM03], etc.

Baseline for Private GM. Since the geometric median is the minimizer of a Lipschitz convex loss function, we can privately approximate it using the standard approach of DP-(S)GD. In particular, if we know a priori that all the data points lie in a known ball of radius R (without loss of generality this ball is centered at the origin, i.e., xi 2 R for every i [n]), then DP-(S)GD guarantees (ε, δ)-DP with the following excess error [BST14]:

F(DPGDn(X(n)); X(n)) F(θ ; X(n)) = O

As discussed in the beginning of this section, this guarantee has a significant drawback: the excess error of the algorithm depends linearly on the radius R of the a priori bound on the data. This bound could be very loose; it does not scale with the data. Can we do better? What quantity should the excess error guarantee scale with?

It is known that the GM is inside the convex hull of the datapoints. However, this convex hull can have a very large diameter due to a small number of outliers, while most of the datapoints live in a ball with a small diameter. A key property of GM is robustness to outliers, so we want our accuracy guarantee to also be robust to some outliers. Specifically, if 51% of the points lie in a ball of diameter R then the geometric median is O( ) far from that ball (see Lemma C.6 for a more precise statement). Thus, we aim to design a DP algorithm whose error is proportional to the actual scale of the majority of the data, rather than the a priori worst-case bound. However, the algorithm designer does not have a priori knowledge of the location or diameter of a ball that contains most of the data; the algorithm must discover this information from the data. This prompts the following question: Can we design an efficient and private algorithm with an excess error guarantee that scales with the radius that contains majority of the datapoints? Our results provide a positive answer.

1.1 Contributions

Our main contribution is a pair of polynomial-time DP algorithms for approximating the geometric median with an excess error guarantee that scales with the effective diameter of the datapoints. Also, the sample complexity and the runtime of our algorithms depend logarithmically on the a priori bound R. Both of our algorithms achieve the same excess error bounds up to logarithmic factors, but have incomparable running times. We also give a simple numerical experiment on synthetic data as a proof of concept that our algorithm improves over DP-(S)GD, as presdicted by the theory. In

terms of optimality, we show the our proposed algorithms is optimal in terms of sample complexity. Furthermore, we propose an algorithm based on the inverse smooth sensitivity mechanism for the private geometric median problem that satisfies the more restrictive notion of pure DP. Below, we give an overview of these algorithms and the techniques involved.

Polynomial-Time Algorithms. Both of our algorithms for the private geometric median problem are two-phase algorithms: in the first phase, which we refer to as warm-up, the algorithm shrinks the feasible set to a ball whose diameter is proportional to what we call the quantile radius in time that depends logarithmically on R. The second phase, which we call fine-tuning, uses the output of the warm-up algorithm to further improve the error.

First, we formalize the notion of the quantile radius as the radius of the smallest ball containing sufficiently many points. Definition 1.1 (Quantile Radius). Fix a dataset X(n) = (x1, . . . , xn) (Rd)n and θ Rd. For every γ [0, 1], define γn(θ) min{ : |i [n] : xi θ | γn}.

To motivate the idea behind our algorithms, assume the algorithm designer knew a ball, with center θ0 and radius ˆ such that θ0 θ O( ˆ ) and ˆ = e O( 4n/5(θ )). Then, running DP-(S)GD over this ball would give excess error O( 4n/5(θ )

d/ε). This guarantee is particularly interesting as the excess error scales with the quantile radius and not the largest possible norm of any point. Also, by definition of the quantile radius and the geometric median loss function, we have that F(θ ; X(n)) (1 γ)n γn(θ ). This inequality shows that an algorithm whose excess error depends on γn(θ ) has a multiplicative guarantee rather than the standard additive guarantee for DP-(S)GD. This type of guarantee is particularly desirable for the geometric median since an algorithm with a multiplicative guarantee will be scale free and be adaptive to the niceness of the dataset. However, since we do not know such a pair θ0 and ˆ a priori, the objective of the warm-up algorithm is to privately find these quantities.

The warm-up algorithm is based on the following structural result: given a point θ that satisfies θ θ 3n/4(θ ), we have F(θ; X(n)) F(θ ; X(n)) θ θ . (See Lemma 2.6 for a formal statement.) This result implies that, even though F(θ; X(n)) is not a strongly convex function, we have a growth condition such that the excess error increases with the distance to the global minimizer, at least when the excess error is large enough. (In contrast, strong convexity would imply quadratic growth F(θ; X(n)) F(θ ; X(n)) θ θ 2, rather than linear growth.) Intuitively, this growth condition allows us to take larger step sizes and make progress faster, consuming less of the privacy budget. However, since this growth condition only holds for θ that is more than 3n/4(θ ) away from the minimizer, which is a data-dependent property, we first need to develop a private algorithm to estimate 3n/4(θ ) in order to make use of this property. In Section 2.1, we develop an efficient algorithm, Radius Finder, for this task, which is inspired by [NSV16]. Our procedure assumes that we know some potentially very small lower bound r 3n/4(θ ), which is necessary by the impossibility results in [BNSV15]. Since the sample complexity of this procedure depends only on log(1/r), we can choose this parameter to be very small. In Section 2.1, we show how to eliminate this assumption at the cost of a small additive error. With high probability, Radius Finder (see Theorem 2.4) outputs ˆ such that 3n/4(θ ) ˆ O( 4n/5(θ )). Having obtained ˆ , the second step of the warm-up algorithm is finding a good initialization point. In Section 2.2, we propose Localization, based on DP-GD with geometrically decaying step sizes, to perform this task. Due to the growth condition we show that DP-GD makes a fast progress towards some point that is within O( 4n/5(θ )) from the optimizer: in log(R) iterations, with high probability, it outputs θ0 such that θ is in the ball of radius O( ˆ ) = O( 4n/5(θ )) centered at θ0.

DP Cutting Plane Method for Private GM. The main drawback of using DP-SGD for the finetuning stage is that its run-time can be large when n d. To address this, we design the second fine-tuning algorithm, Loc DPCutting Plane, based on private variant of the cutting plane method that has faster running time when n is large. There are two challenges in the analysis: by using the noisy gradients, we cannot argue that the optimal point always lives in the intersection of the cutting planes, which is a crucial part of the standard analysis. The second challenge is that the cutting plane method is not a descent method in the sense that the loss function is not decreasing with the iteration, and we need to privately select an iterate with small loss. The challenge for developing the private variant here is that the loss F(θ; X(n)) has sensitivity proportional to R, so running the exponential

mechanism in the natural way incurs loss proportional to R. We address both of these challenges and develop an algorithm whose excess error is proportional to 4n/5(θ ).

Pure DP algorithm for Private Geometric Median Problem. In Section 4, we propose a pure (ε, 0)-DP algorithm for the geometric median problem, albeit a computationally inefficient one. Our algorithm is based on the inverse smooth sensitivity mechanism of [AD20]. At a high level, the algorithm outputs θ Rd with a probability proportional to exp( ε len(X(n), θ)/2) where len(X(n), θ) is the minimum number of data points from X(n) that needs to be modified to obtain a dataset X(n) such that the geometric median of X(n) be equal θ. Our analysis shows that the proposed mechanism outputs ˆθ = GM( X(n)) such that X(n) and X(n) differ in at most k = O(d log(R)/ε) with a high probability. Then, by a careful sensitivity analysis, we show ˆθ θ can be upper bounded by the F(θ ; X(n)). Using this result we provide an algorithm with a multiplicative guarantee. Moreover, we show ˆθ θ is upper bounded O( γn(θ )) for some γ (1/2, 1].

Lower bound on the Sample Complexity. We show every (ε, δ)-DP algorithm requires Ω(

d/ε) samples so that it satisfies Eˆθ An(X(n))[F(ˆθ; X(n))] (1 + α) minθ Rd F(θ, X(n)) for a constant α. This result shows that the sample complexity of our polynomial-time algorithms is nearly optimal.

A summary of the results is provided in Table 1, comparing the proposed algorithms in terms of privacy, utility, runtime, and sample complexity. As discussed earlier, algorithms with error adaptive to the quantile radius can achieve a nearly multiplicative guarantee. The utility column in Table 1 compares the algorithms based on the achievable αmul and αadd in order to F(An(X(n)); X(n)) (1 + αmul)F(θ ; X(n)) + αadd with a high probability.

1.2 Related Work

DP convex optimization is a well-studied problem [CMS11; KST12; BST14; ACGMMTZ16; STU17; FKT20]. There has been significant interest in developing new algorithms that offer improved guarantees compared to DP-(S)GD for specific problem classes or by leveraging additional information. For instance, [LUZ20; SSTT21; ABGMU22; BMS22] demonstrate that for linear models the dependency of the excess error on the dimension can be improved, [GHST24; ABL23] study the impact of the second-order information on the convergence, [KDRT21; AGMRSSSTT22; GHNOSTTW23] explore the impact of public data, etc. The current paper addresses a drawback of DP-(S)GD, namely, the linear dependence of the excess error on the distance from the initializer to the optimal point in non-strongly convex settings.

Another related line of work to our warm-up strategy is private averaging of [NSV16; NS18; CKMST21; TCKMS22]. The advantage of the algorithm proposed in this work is its simplicity while being optimal in terms of sample complexity: we exploit a structural property of the geometric median and show that running DPGD with the geometrically decaying stepsizes can yield a suitable initialization point without the need for preprocessing steps such as filtering [CKMST21; TCKMS22], coordinate-wise discretization [NSV16], hashing [NS18], etc. The proposed quantile radius can be seen as a robust notion of radius proposed in [BHI02].

In one dimension (i.e., d = 1), private versions of the median are well studied [DNPR10; BNS13; BNSV15; DNRR15; BDRS18; ALMM19; KLMNS20; ASSU23; CLNSS23]. In particular, these works improve the dependence on the a priori bound R to log R, rather than log R in our results.

1.3 Notation

Let d N. For a vector x Rd, x denotes the ℓ2 norm of x. We use the following notation for the ball of radius R: Bd(a, R) = {x Rd : x a R}. Also, B d (a, R) denotes {x Rd : x a R}. We refer to Bd(0, R) = Bd(R), similarly, it holds for B d (0, R) = B d (R). Let , denote the standard inner product in Rd. For a convex and closed subset Θ Rd, let ΠΘ : Rd Θ be the Euclidean projection operator, given by ΠΘ(x) = arg miny Θ y x 2. For a (measurable) space R, M1(R) denotes the set of all probability measures on R. Let Z be the data space and let Θ Rd be the parameter space. Let f : Θ Z R be a loss function. We say f is L-Lipschitz iff there exists L R such that z Z, w, v Θ : |f(w, z) f(v, z)| L w v .

1.4 Notions of DP

Definition 1.2. Let ε > 0 and δ [0, 1). A randomized mechanism An : Zn M1(Θ) is (ε, δ)-DP, iff, for every neighbouring dataset (i.e., replacement) X Zn and X Zn, and for every measurable subset M Θ, it holds Pθ An(X)(θ M) eε Pθ An(X )(θ M) + δ.

For some of our privacy analysis, we use concentrated differential privacy [DR16; BS16], as it provides a simpler composition theorem the privacy parameter ρ adds up when we compose. Definition 1.3 ([BS16, Def. 1.1]). A randomized mechanism A : Zn M1(R) is ρ-z CDP, iff, for every neighbouring dataset (i.e., replacement) X Zn and X Zn, and for every α (1, ), it holds Dα(An(X) An(X )) ρα, where Dα(An(X) An(X )) is the α-Renyi divergence between An(X) and An(X ).

We should think of ρ ε2: to attain (ε, δ)-DP, it suffices to set ρ = ε2 4 log(1/δ)+4ε [BS16, Lem. 3.5].

Lemma 1.4 ([BS16, Prop. 1.3]). Assume we have a randomized mechanism A : Z M1(R) that satisfies ρ-z CDP, then for every δ > 0, A is (ρ + 2 p

ρ log(1/δ), δ)-DP.

2 Private Localization

In this section, we present the proposed algorithm for the warm-up stage; it has two steps: Private Estimation of Quantile Radius and Private Localization.

2.1 Step 1: Private Estimation of Quantile Radius

Algorithm 1 describes our private algorithm Radius Finder for quantile radius estimation.

Algorithm 1 Radius Findern

1: Inputs: data set X(n) (Bd(R))n, fraction γ (1/2, 1], privacy budget ρ-z CDP, failure probability β, discretization error 0 < r < R . 2: m = γn . 3: For every ν 0 and i [n], let

Ni(ν) |X(n) Bd(xi, ν)|.

Number of datapoints within distance of ν from xi 4: For every ν 0, define

m max distinct{i1,...,im} [n] {Ni1(ν) + + Nim(ν)}.

5: Grid = {r, 2r, 4r , 2 log( 2R

r ) r}. 6: Queries = {N(v) : v Grid} 7:

ˆi = Above Threshold Queries, ρ, m + 18 2ρ log 2

Algorithm 7 8: Output ˆ = 2ˆir if ˆi = Fail; else Output Fail.

Remark 2.1. The runtime of Radius Finder is Θ((n2 + n log(n)) log( R/r )): First, we need to compute the pairwise distances which take n2 time. Then, for a fixed ν, we can compute N(ν) using the pairwise distances in time Θ(n2). To compute N(ν), we need to sort {Ni(ν)}i [n], in Θ(n log(n))

time, and pick top m. Finally, we need to repeat this for each ν [r, . . . , 2 log( 2R

Notice that Algorithm 1 uses the datapoints as centers for computing the number of the datapoints in a given distance. The privacy proof of Algorithm 1 is based on the following lemma. Lemma 2.2. Fix n N. For every dataset X(n), for every 1/2 γ 1 and for every fixed ν, the query N(ν) 1

m max{i1,...,im} [n]{Ni1(ν) + + Nim(ν)}, has a sensitivity upper-bounded by 3 where m = γn and Ni(ν) |X(n) Bd(xi, ν)|. Here Bd(x, ν) := {y Rd : y x ν}.

The objective of Algorithm 1 is to privately approximate γn(θ ). Nonetheless, Algorithm 1 relies on computing the pairwise distances between datapoints. The following lemma elucidates why computing these pairwise distances serves as an effective proxy for computing γn(θ ).

Lemma 2.3. Fix n N, 1 m n, γ1, γ2 (1/2, 1] such that γ2 γ1, and dataset X(n). For every ν 0, define N(ν) 1 m max{i1,...,im} [n]{Ni1(ν) + + Nim(ν)}, where Ni(ν) |X(n) Bd(xi, ν)|. Let θ = GM(X(n)). For every ˆν such that N(ˆν) γ1n and N(ˆν/2) < γ2n , we have γ1n(θ ) 2γ1 1

4γ1 1 ˆν 4 γ2n(θ ).

Using these two lemmas, in the next theorem, we present the privacy and utility guarantees of Algorithm 1. As we are interested in finding the smallest radius, we use the standard Above Threshold from [DNRRV09; DR+14] as a subroutine in Algorithm 1. The algorithmic description of Above Threshold is provided in Appendix B for completeness. Theorem 2.4. Let Radius Findern denote Algorithm 1. Fix d N, R > 0, r > 0, β (0, 1], and ρ > 0. Then, for every n N and every dataset X(n) (Bd(R))n the output of Radius Findern satisfies ρ-z CDP. Also, the output of Radius Findern satisfies the following utility guarantees:

1. Given n > 18 (1 γ) 2ρ log(4/β), then P γn(θ ) 2γ 1

4γ 1 ˆ 1 β.

2. Assume that the data points satisfies N(r) < m. Let γ min{γ + 1 n 36 2ρ log 2( log 2R

r + 1)/β , 1}, then, given n > 18 (1 γ) 2ρ log(4/β), we have

P γn(θ )2γ 1

4γ 1 ˆ 4 γn(θ ) 1 5

3. Let γ min{γ + 1

n 36 2ρ log 2( log 2R

r + 1)/β , 1}. Given n > 18 (1 γ) 2ρ log(4/β), we have

P γn(θ )2γ 1

4γ 1 ˆ and n ˆ 4 γn(θ ) or ˆ = r o 1 2β.

Remark 2.5. A sufficient condition for N(r) < m in Item 2 is that maxi [n]|X(n) Bd(xi, r)| < m = γn . Intuitively, this means that no data point should have a significant portion of other data points within a ball of radius r centered on it.

2.2 Step 2: Fast Localization

In the second step of the warm-up phase, we develop a fast algorithm for finding a good initialization point using the private estimate of the quantile radius. The main structural result that we use for the algorithm design is stated in the next lemma.

Lemma 2.6. Fix n N, X(n) (Rd)n and θ1, θ0 Rd. For every γ [0, 1], define γn(θ0) min{r 0 : |i [n] : xi θ0 r| γn}. Assume there exists ζ 0 such that F(θ1; X(n)) F(θ0; X(n)) ζn. Then, for every γ (1/2, 1], we have

(2γ 1) θ1 θ0 2γ γn(θ0) ζ

To gain some intuition behind Lemma 2.6, let us instantiate θ0 = θ . This result implies that for a θ Rd such that θ θ γn(θ ), the loss function of the geometric median satisfies F(θ; X(n)) F(θ ; X(n)) θ θ . Using this result, we propose Algorithm 2 for finding a good initialization. The next theorem states the privacy and utility guarantees of Algorithm 2. Theorem 2.7. Let Localizationn denote Algorithm 2. Fix d N, R > 0, r > 0, ρ > 0, and β (0, 1). Then for every dataset X(n) (Bd(R))n the outputs of Localizationn satisfies ρ-z CDP. Moreover, let (ˆθ, ˆ ) = Localizationn(X(n), ρ, r, β) and define random set Θloc = {θ

Bd(R) : θ ˆθ 25 ˆ }. Then, given

d log( R/r ) ρ

log log( R/r )

, 1 ρ log R/r

Algorithm 2 Localizationn

1: Inputs: dataset X(n) (Bd(R))n, privacy parameters ρ-z CDP, discretization error r, failure probability β 2: γ = 3/4

3: ˆ = Radius Findern X(n), γ, ρ

2 , r Algorithm 1

4: if ˆ = Fail then 5: Output Fail and Halt.

6: kwu = 1 log(2) log R/ ˆ kwu 1 log(2) log(R/r) with probability one

7: θ0 = 0 Rd, Twu = 500, rad0 = R 8: for t {0, . . . , kwu 1} do 9: Θt = {θ Bd(R) : θ θt radt}

10: ηt = radt q

11: θt+1 = DPGD θt, X(n), ρ 2kwu , Θt, ηt, Twu Algorithm 6

12: radt+1 = 1

2radt + 12 ˆ

13: Output θkwu and ˆ .

we have P θ Θloc and 0.75n(θ ) 4 ˆ and n ˆ 4 0.8n(θ ) or ˆ = r o 1 2β. Also,

assuming that the datapoints satisfies maxi [n]|X(n) Bd(xi, r)| < 3n/4, we have

P θ Θloc and 0.75n(θ ) 4 ˆ and ˆ 4 0.8n(θ ) 1 2β.

3 Private Fine-tuning

In Section 2, we developed an algorithm for the warm-up stage. The output of the warm-up stage is θ0 and radius ˆ such that θ0 θ O( ˆ ) and ˆ = e O( 4n/5(θ )) as formalized in Theorem 2.7. In this section, we build upon the output of the warm-up algorithm to develop two polynomial-time algorithms for the fine-tuning stage.

3.1 Fine-tuning Using DPGD

Our first algorithm is based on DP-GD [BST14]. The main ideas behind Algorithm 3 is as follows: 1) from the utility guarantee of the warm-up phase in Theorem 2.7, the distance of the initialization and θ only depends on ˆ , i.e., it does not depend on R, 2) By definition of the quantile radius in Definition 1.1 and Equation (1), we have that F(θ ; X(n)) (1 γ)n γn(θ ), 3) in the case that the data satisfies some regularity conditions, we have ˆ 4 0.8n(θ ) from Theorem 2.7. The next theorem summarizes the utility and privacy guarantees of this algorithm.

Algorithm 3 Loc DPGDn

1: Inputs: dataset X(n) (Bd(R))n, privacy parameters ρ-z CDP, discretization error r, failure probability β.

2: θ0, ˆ = Localizationn X(n), ρ

2 Algorithm 2

3: Θ0 = {θ Bd(R) : θ θ0 25 ˆ }

4: ηft = 50 ˆ q

d 6ρn2 and Tft = n2ρ 256d

5: ˆθ = DPGD θ0, X(n), ρ

2, Θ0, ηft, Tft Algorithm 6

6: Output ˆθ

Theorem 3.1. Let Localized DPGDn denote Algorithm 3. For every d N, R > 0, r > 0, ρ > 0, and β (0, 1], A = {Localized DPGDn}n 1 satisfies the following: for every n N and every

dataset X(n) (Bd(R))n the output of Localized DPGDn satisfies ρ-z CDP. Also, given

d log( R/r ) ρ

log log( R/r )

, 1 ρ log R/r

F θ ; X(n) + O

Moreover, given that the datapoints satisfies maxi [n]|X(n) Bd(xi, r)| < 3n/4, we have

F θ ; X(n) !

where ˆθ is the output of Algorithm 3.

3.2 Fine-tuning Using Noisy Cutting Plane Method

In this section, we present the second fine-tuning algorithm: Loc DPCutting Plane of Algorithm 4. This algorithm is based on the well-known cutting plane method [New65; Lev65; Nes98].

Algorithm 4 Loc DPCutting Planen

1: Inputs: dataset X(n) (Bd(R))n, privacy parameters (ε, δ)-DP, discretization error r, failure probability β 2: ρ = ε2

16 log(2/δ)+8ε 3: θ0, ˆ = Localizationn X(n), ρ

2, r, min{ β

2} Algorithm 2

4: Θ0 = {θ Bd(R) : θ θ0 25 ˆ }

5: kft = Θ d τ log n τ ρ

d See Assumption 1 for definition of τ

6: for t {0, . . . , kft 1} do 7: θt = Centre(Θt) See Assumption 1

8: ξdir,t N 0, kft

9: Θt+1 = n θ Θt D F(θt; X(n)) + ξdir,t, θ θt E < 0 o

10: Define Probability Measure: π(t) exp ε 448 ˆ F θt; X(n) for t {0, . . . , kft 1}

11: Output θˆt where ˆt π

Similar to non-private cutting plane method, Loc DPCutting Plane is not a descent algorithm. As a result, we need to devise a mechanism for selecting an iterate with minimal loss. In the next lemma, we provide a bespoke analysis of the exponential mechanism with the score function F(θ; X(n)) defined in Equation (1). Note that the sensitivity of F(θ; X(n)) is R. However, the next result demonstrates that through a novel analysis of the sensitivity of F(θ; X(n)), the noise scale due to privacy can be significantly reduced. Proof can be found in Appendix E. Lemma 3.2. Let ε R, k N, and d N be constants. Let Θ Rd be a set with a bounded diameter of diam. Let X(n) (Rd)n be a dataset and θ GM(X(n)). Let {θ1, . . . , θk} Θ be k fixed vectors. Also, assume that θ Θ. Let be such that 3 3n/4(θ ) + 2diam . Consider the following probability measure over {1, . . . , k}:

π(i; X(n)) = exp ε

2 F(θi; X(n))

j [k] exp ε

2 F(θj; X(n)) , i [k].

1. Let ˆi π( ; X(n)) and OPT mini [k]{F(θi; X(n)) F(θ ; X(n))}. Then, for every β (0, 1], we have

P F(θˆi; X(n)) F(θ ; X(n)) OPT + 2

ε log(k/β) 1 β.

2. Let X(n) be a dataset of size n that differs in one sample from X(n). Then, for every i [k], we have exp( ε)π(i; X(n)) π(i; X(n)) exp(ε)π(i; X(n)).

The next theorem provides the privacy guarantee of Algorithm 4. The privacy analysis differs from the rest of the algorithms in the paper. This deviation arises from the fact that for analyzing the privacy guarantee of Line 10 of Algorithm 4, we use Lemma 3.2. Notice that the guarantee in Lemma 3.2 holds provided that Θ0, defined in Line 4 of Algorithm 4, satisfies θ Θ0. Ergo, the privacy guarantee of Algorithm 4 only satisfies approximate-DP.

Theorem 3.3. Let Loc DPCutting Planen denote Algorithm 4. Fix d N, R > 0, r > 0, ε > 0, δ (0, 1], and β (0, 1]. Then, for every n N and every dataset X(n) (Bd(R))n the output of Loc DPCutting Planen satisfies (ε, δ)-DP.

We also make the following assumption about the performance of Centre subroutine in Algorithm 4.

Assumption 1. There exists some τ (0, 1] such that for all t {0, . . . , kft 1}, the subroutine of Centre in Algorithm 4 satisfies vol(Θt+1) (1 τ)vol(Θt). Furthermore, the time for calling the routine Centre is Tc.

Using the John Ellipsoid [Joh14] as the Centre makes τ a dimension independent constant and Tc = O(d1+ω) (by [LSW15]). Now we are ready to state the utility guarantee of Algorithm 4.

Theorem 3.4. Let Loc DPCutting Planen denote Algorithm 4. For every d N, R > 0, r > 0, ε > 0, δ (0, 1], and β (0, 1], A = {Loc DPCutting Planen}n 1 satisfies the following: for every n N and every dataset X(n) (Bd(R))n, given

d log( R/r ) ρ

log log( R/r )

, 1 ρ log R/r

where ρ = ε2 16 log(2/δ)+8ε, we have the following: Let κ n ρ

τρ log d log(κ)

P F ˆθ; X(n) 1 + α

F(θ ; X(n)) + rα 1 3β,

Moreover, assuming that the datapoints satisfies maxi [n]|X(n) Bd(xi, r)| < 3n/4, we have

P F ˆθ; X(n) 1 + α

F(θ ; X(n)) 1 3β,

where ˆθ is the output of Algorithm 4.

4 Pure-DP Algorithm for Geometric Median

In this section, we propose an algorithm based on the assumption that we have an access to an oracle that outputs an exact GM X(n) . Before presenting the algorithm, we need a definition: For two sequences of a = (a1, . . . , an) (Rd)n and b = (b1, . . . , bn) (Rd)n, we define the hamming distance as d H(a, b) = Pn i=1 1[ai = bi]. The proposed algorithm is shown in Algorithm 5, and its utility and privacy guarantees are presented in the following theorem.

Theorem 4.1. Let SInv Sn denote the algorithm in Algorithm 5. Fix d N, R > 0, r > 0, and ε > 0. Then, for every n N and every dataset X(n) (Bd(R))n the output of SInv Sn satisfies ε-DP. Also, for every β (0, 1) and for every n > 2k 2 j 2 ε(log(1/β) + d log(R/r)) k , with probability at least 1 β, we have:

1. The value of the cost function satisfies

F(ˆθ; X(n)) 1 + 4k

F(θ ; X(n)) + nr.

Algorithm 5 SInv Sn

1: Input: dataset X(n) (Bd(R))n, privacy parameter ε-DP, discretization error r. 2: For every y Bd(R)

lenr(X, y) min X (Rd)n {d H(X(n), X(n)) such that z Bd(y, r) with GM( X(n)) = z}

3: Define density: dπ(y) = exp ε

2 lenr(X, y) R

y Bd(R) exp ε

2 lenr(X, y) dy 1[y Bd(R)]

4: Output ˆθ π

2. In terms of distance, ˆθ θ r + min γ (1/2,1]:γ> k

2(γ k /n) (γ k /n)2 .

The proof of Theorem 4.1 is provided in Appendix F. The proof is based on showing that the output ˆθ = GM X(n) is such that X(n) and X(n) differ in at most k = O(d log(R)/ε) datapoints with a high probability. Then, we use the properties of the geometric median to show that the sensitivity of GM to changing k < n/2 points can be bounded by the value of the optimal loss at θ = GM(X(n)). Lemma 4.2. For every n N and for every k < n 2 , and for every (x1, . . . , xn, y1, . . . , yk) (Rd)n+k, define θ0 = GM((x1, . . . , xn)) and θk = GM((x1, . . . , xn k, y1, . . . , yk)). Then,

θk θ0 2 n 2k F(θ0; (x1, . . . , xn)).

5 Lower Bound on the Sample Complexity

In this section we prove a lower bound on the sample complexity of any (ε, δ)-DP algorithm for the task of private geometric median with a multiplicative error.

Theorem 5.1. Let ε0, α0, d0 be universal constants. Then, for every ε ε0, α α0, and d d0 and every (ε, δ)-DP algorithm An : (Rd)n M1(Rd) (with δ = O(

d/n)) such that for every dataset

X(n) (Rd)n its output satisfies Eˆθ An(X(n)) h F ˆθ; X(n) i (1 + α) minθ B d (1) F(θ; X(n)),

we require n = Ω

This result, whose proof can be found in Appendix G, shows that the sample complexity of the proposed polynomial time algorithms is tight in terms of the dependence on ε and d.

6 Numerical Example

103 104 105 106 107 108 109 1010 R

F( ; X(n))/F( ; X(n))

d = 200, n = 3000, = 2, = 1/n

Ours DPGD In this section, we numerically compare Loc DPGDn (Algorithm 3) and DPGD on a synthetic dataset. The dataset consists of two subsets: one tightly clustered at a random location on Bd(R), and the other uniformly distributed over Bd(R). We plot F(θ; X(n))/F(θ ; X(n)) for both algorithms as R varies. The results show that Loc DPGDn s performance degrades more gracefully than DP-GD with increasing R. See Appendix H for experimental details and more results.

7 Conclusion and Limitations

In this paper, we presented three private algorithms for the geometric median task, ensuring an excess error guarantee that scales with the effective data scale. Our results open up many directions: we believe our warm-up algorithm has broader applications, and finding other problems where it can be used as a subroutine is interesting. Another direction is to characterize the optimal run-time: is it possible to develop a linear time algorithm, i.e. Θ(nd), with an optimal excess error?

Acknowledgments

The authors would like to thank Jad Silbak, Eliad Tsfadia, and Mohammad Yaghini for helpful discussions.

[ACGMMTZ16] M. Abadi, A. Chu, I. Goodfellow, H. B. Mc Mahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy . In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 2016, pp. 308 318. [AHJSDT22] A. Acharya, A. Hashemi, P. Jain, S. Sanghavi, I. S. Dhillon, and U. Topcu. Robust training in high dimensions via block coordinate geometric median descent . In: International Conference on Artificial Intelligence and Statistics. PMLR. 2022, pp. 11145 11168. [ASSU23] M. Aliakbarpour, R. Silver, T. Steinke, and J. Ullman. Differentially Private Medians and Interior Points for Non-Pathological Data . ar Xiv preprint ar Xiv:2305.13440 (2023). [ALMM19] N. Alon, R. Livni, M. Malliaris, and S. Moran. Private PAC learning implies finite Littlestone dimension . In: Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing. 2019, pp. 852 860. [AGMRSSSTT22] E. Amid, A. Ganesh, R. Mathews, S. Ramaswamy, S. Song, T. Steinke, V. M. Suriyakumar, O. Thakkar, and A. Thakurta. Public data-assisted mirror descent for private model training . In: International Conference on Machine Learning. PMLR. 2022, pp. 517 535. [ABGMU22] R. Arora, R. Bassily, C. Guzmán, M. Menart, and E. Ullah. Differentially private generalized linear models revisited . Advances in Neural Information Processing Systems 35 (2022), pp. 22505 22517. [AD20] H. Asi and J. C. Duchi. Instance-optimality in differential privacy via approximate inverse sensitivity mechanisms . Advances in neural information processing systems 33 (2020), pp. 14106 14117. [ABL23] M. Avella-Medina, C. Bradshaw, and P.-L. Loh. Differentially private inference via noisy optimization . The Annals of Statistics 51.5 (2023), pp. 2067 2092. [BHI02] M. B adoiu, S. Har-Peled, and P. Indyk. Approximate clustering via coresets . In: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. 2002, pp. 250 257. [BMS22] R. Bassily, M. Mohri, and A. T. Suresh. Differentially private learning with margin guarantees . Advances in Neural Information Processing Systems 35 (2022), pp. 32127 32141. [BST14] R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds . In: 2014 IEEE 55th annual symposium on foundations of computer science. IEEE. 2014, pp. 464 473. [BNS13] A. Beimel, K. Nissim, and U. Stemmer. Private learning and sanitization: Pure vs. approximate differential privacy . In: International Workshop on Approximation Algorithms for Combinatorial Optimization. Springer. 2013, pp. 363 378. [BMM03] P. Bose, A. Maheshwari, and P. Morin. Fast approximations for sums of distances, clustering and the Fermat Weber problem . Computational Geometry 24.3 (2003), pp. 135 146. [BDRS18] M. Bun, C. Dwork, G. N. Rothblum, and T. Steinke. Composable and versatile privacy via truncated cdp . In: Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing. 2018, pp. 74 86. [BNSV15] M. Bun, K. Nissim, U. Stemmer, and S. Vadhan. Differentially private release and learning of threshold functions . In: 2015 IEEE 56th Annual Symposium on Foundations of Computer Science. IEEE. 2015, pp. 634 649. [BS16] M. Bun and T. Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds . In: Theory of Cryptography Conference. Springer. 2016, pp. 635 658.

[CMS11] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization . Journal of Machine Learning Research 12.Mar (2011), pp. 1069 1109. [CSX17] Y. Chen, L. Su, and J. Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent . Proceedings of the ACM on Measurement and Analysis of Computing Systems 1.2 (2017), pp. 1 25. [CKMST21] E. Cohen, H. Kaplan, Y. Mansour, U. Stemmer, and E. Tsfadia. Differentiallyprivate clustering of easy instances . In: International Conference on Machine Learning. PMLR. 2021, pp. 2049 2059. [CLNSS23] E. Cohen, X. Lyu, J. Nelson, T. Sarlós, and U. Stemmer. Optimal differentially private learning of thresholds and quasi-concave optimization . In: Proceedings of the 55th Annual ACM Symposium on Theory of Computing. 2023, pp. 472 482. [CLMPS16] M. B. Cohen, Y. T. Lee, G. Miller, J. Pachocki, and A. Sidford. Geometric median in nearly linear time . In: Proceedings of the forty-eighth annual ACM symposium on Theory of Computing. 2016, pp. 9 21. [DNPR10] C. Dwork, M. Naor, T. Pitassi, and G. N. Rothblum. Differential privacy under continual observation . In: Proceedings of the forty-second ACM symposium on Theory of computing. 2010, pp. 715 724. [DNRR15] C. Dwork, M. Naor, O. Reingold, and G. N. Rothblum. Pure differential privacy for rectangle queries via private partitions . In: International Conference on the Theory and Application of Cryptology and Information Security. Springer. 2015, pp. 735 751. [DNRRV09] C. Dwork, M. Naor, O. Reingold, G. N. Rothblum, and S. Vadhan. On the complexity of differentially private data release: efficient algorithms and hardness results . In: Proceedings of the forty-first annual ACM symposium on Theory of computing. 2009, pp. 381 390. [DR+14] C. Dwork, A. Roth, et al. The algorithmic foundations of differential privacy . Foundations and Trends in Theoretical Computer Science 9.3 4 (2014), pp. 211 407. [DR16] C. Dwork and G. N. Rothblum. Concentrated differential privacy . ar Xiv preprint ar Xiv:1603.01887 (2016). [FGGPS22] S. Farhadkhani, R. Guerraoui, N. Gupta, R. Pinot, and J. Stephan. Byzantine machine learning made easy by resilient averaging of momentums . In: International Conference on Machine Learning. PMLR. 2022, pp. 6246 6283. [FKT20] V. Feldman, T. Koren, and K. Talwar. Private stochastic convex optimization: optimal rates in linear time . In: Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing. 2020, pp. 439 449. [GHNOSTTW23] A. Ganesh, M. Haghifam, M. Nasr, S. Oh, T. Steinke, O. Thakkar, A. Thakurta, and L. Wang. Why is public pretraining necessary for private model training? In: International Conference on Machine Learning. PMLR. 2023, pp. 10611 10627. [GHST24] A. Ganesh, M. Haghifam, T. Steinke, and A. Thakurta. Faster differentially private convex optimization via second-order methods . Advances in Neural Information Processing Systems 36 (2024). [Joh14] F. John. Extremum problems with inequalities as subsidiary conditions . Traces and emergence of nonlinear programming (2014), pp. 197 215. [KDRT21] P. Kairouz, M. R. Diaz, K. Rush, and A. Thakurta. (Nearly) Dimension Independent Private ERM with Ada Grad Rates via Publicly Estimated Subspaces . In: Proceedings of Thirty Fourth Conference on Learning Theory. Ed. by M. Belkin and S. Kpotufe. Vol. 134. Proceedings of Machine Learning Research. PMLR, 15 19 Aug 2021, pp. 2717 2746. [KLSU19] G. Kamath, J. Li, V. Singhal, and J. Ullman. Privately learning highdimensional distributions . In: Conference on Learning Theory. PMLR. 2019, pp. 1853 1902.

[KLMNS20] H. Kaplan, K. Ligett, Y. Mansour, M. Naor, and U. Stemmer. Privately learning thresholds: Closing the exponential gap . In: Conference on Learning Theory. PMLR. 2020, pp. 2263 2285. [KST12] D. Kifer, A. Smith, and A. Thakurta. Private convex empirical risk minimization and high-dimensional regression . In: Conference on Learning Theory. 2012, pp. 25 1. [Ksc17] F. R. Kschischang. The complementary error function . Online, April (2017). [LM00] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection . Annals of Statistics (2000), pp. 1302 1338. [LUZ20] H. Lê Nguyen, J. Ullman, and L. Zakynthinou. Efficient private algorithms for learning large-margin halfspaces . In: Algorithmic Learning Theory. PMLR. 2020, pp. 704 724. [LSW15] Y. T. Lee, A. Sidford, and S. C.-w. Wong. A faster cutting plane method and its implications for combinatorial and convex optimization . In: 2015 IEEE 56th Annual Symposium on Foundations of Computer Science. IEEE. 2015, pp. 1049 1065. [Lev65] A. J. Levin. An algorithm for minimizing convex functions . Dokl. Akad. Nauk SSSR 160 (1965), pp. 1244 1247. ISSN: 0002-3264. [MT07] F. Mc Sherry and K. Talwar. Mechanism design via differential privacy . In: 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 07). IEEE. 2007, pp. 94 103. [EFGH23] E.-M. El-Mhamdi, S. Farhadkhani, R. Guerraoui, and L.-N. Hoang. On the strategyproofness of the geometric median . In: International Conference on Artificial Intelligence and Statistics. PMLR. 2023, pp. 2603 2640. [Min15] S. Minsker. Geometric median and robust estimation in Banach spaces (2015). [Nes98] Y. Nesterov. Introductory lectures on convex programming volume i: Basic course . Lecture notes 3.4 (1998), p. 5. [New65] D. J. Newman. Location of the maximum on unimodal surfaces . Journal of the ACM (JACM) 12.3 (1965), pp. 395 398. [NS18] K. Nissim and U. Stemmer. Clustering algorithms for the centralized and local models . In: Algorithmic Learning Theory. PMLR. 2018, pp. 619 653. [NSV16] K. Nissim, U. Stemmer, and S. Vadhan. Locating a small cluster privately . In: Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 2016, pp. 413 427. [PKH22] K. Pillutla, S. M. Kakade, and Z. Harchaoui. Robust aggregation for federated learning . IEEE Transactions on Signal Processing 70 (2022), pp. 1142 1154. [Sha11] O. Shamir. A variant of azuma s inequality for martingales with subgaussian tails . ar Xiv preprint ar Xiv:1110.2392 (2011). [STU17] A. Smith, A. Thakurta, and J. Upadhyay. Is interaction necessary for distributed private learning? In: 2017 IEEE Symposium on Security and Privacy (SP). IEEE. 2017, pp. 58 77. [SCS13] S. Song, K. Chaudhuri, and A. D. Sarwate. Stochastic gradient descent with differentially private updates . In: 2013 IEEE global conference on signal and information processing. IEEE. 2013, pp. 245 248. [SSTT21] S. Song, T. Steinke, O. Thakkar, and A. Thakurta. Evading the curse of dimensionality in unconstrained private glms . In: International Conference on Artificial Intelligence and Statistics. PMLR. 2021, pp. 2638 2646. [TCKMS22] E. Tsfadia, E. Cohen, H. Kaplan, Y. Mansour, and U. Stemmer. Friendlycore: Practical differentially private aggregation . In: International Conference on Machine Learning. PMLR. 2022, pp. 21828 21863. [WLCG20] Z. Wu, Q. Ling, T. Chen, and G. B. Giannakis. Federated variance-reduced stochastic gradient descent with robustness to byzantine attacks . IEEE Transactions on Signal Processing 68 (2020), pp. 4583 4596.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: The abstract and the introduction completely summarize our findings.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: In Section 7, we discussed two limitations of our work.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: In our problem setup, we completely discussed all the assumptions. Also, a complete proof of every claim is presented in the appendix.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: In Appendix H, we discussed all the details behind our implementation. Also, we release the code. Since the dataset considered is synthetic, there is no concern regarding the dataset.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We have released the code along with a Colab notebook. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: In our experiments, we compare our proposed algorithm with a well-known baseline. We implemented our algorithm from scratch. Also, all the details are included in the code and Appendix H. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We have included the error bars in the plots. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer:[Yes] Justification: Our results can be produced using public Google Colab. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer:[NA] Justification: This question is not applicable to our paper. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: Our work does not have any societal impact. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: It is not applicable to our work.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: Not applicable to our work.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification: It is not applicable to our work.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: It is not applicable to our work.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justification: It is not applicable to our work.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.

A Preliminaries

A.1 Gradient of the Geometric Loss

( θ x θ x θ = x 0 θ = x. (3)

A.2 DP Gradient Descent (DPGD)

In this section, we provide the algorithmic description of DPGD and its privacy and utility analysis for completeness.

Algorithm 6 DPGD

1: Inputs: initialization point θ1 Rd, dataset X(n) (Rd)(n), privacy budget ρ, feasible set Θ, stepsize η, number of iterations T . 2: σ2 = T 2ρn2 3: for t {1, . . . , T } do θt+1 = ΠΘ(θt η( F (θt; X(n)) + ξt)),

where ξt N(0, σ2Id). 4: Output 1

T PT t=1 θt

Lemma A.1. Let Θ Rd be a closed and convex set with a finite diameter diam. Let ℓ: Θ Z R be a loss function such that for every z Z, ℓ( , z) is convex and L-Lipschitz. Let X(n) = (z1, . . . , zn) Zn and ˆLn(θ) = 1 n P

i [n] ℓ(θ, zi). Consider DP-Gradient descent algorithm

θt+1 = ΠΘ(θt η( ˆLn(θt) + ξt)), where ξt N(0, σ2Id). Then, for every T N, by setting

d 12L2ρn2 , and σ2 = L2T

2ρn2 , we have the following: {θt}t [T ] satisfies ρ-z CDP. Also, for

every β > 0, given Td log(4/β) and 1 p

56 log(2/β), with probability at least 1 β, we have

min θ Θ ˆLn(θ) L diam

Proof. The privacy proof is based on the z CDP analysis of the Gaussian mechanism and the composition property of z CDP [BS16].

Let gt ˆLn(θt) + ξt and θ arg minθ Θ ˆLn(θ). Note that we can replace ˆLn(θt) by any subgradient at θt. By the convexity of ℓand the first-order convexity condition we can write

i [T ] ˆLn(θt) ˆLn(θ )

D ˆLn(θt), θt θ E .

Then, by the contraction property of the projection, we can write

θt+1 θ 2 = ΠΘ(θt ηgt) θ 2

= θt θ 2 + η2 gt 2 2η gt, θt θ

θt θ 2 + 2η2 ˆLn(θt) 2 + ξt 2 2η gt, θt θ

θt θ 2 + 2η2L2 + 2η2 ξt 2 2η gt, θt θ .

Here, we have used for every a, b Rd, a + b 2 2 a 2 + 2 b 2, and ˆLn(θ) L for every θ. Therefore, we conclude that

2η θt θ 2 θt+1 θ 2 + η ξt 2 + ηL2. (4)

Define the following random variable for every t [T]

Yt = D ˆLn(θt), θt θ E gt, θt θ . (5)

Also, define the following filtration

Ft = σ(θ0, . . . , θt), (6)

which is the sigma-field generated by θ0, . . . , θt.

Lemma A.2. {Yt}t [T ] is a martingale difference sequence adapted to {Ft}t [T ].

Proof. Notice that ˆLn(θt), θt, and θ are Ft-measurable. Therefore, we can write

E[Yt|Ft] = E[ gt, θt θ D ˆLn(θt), θt θ E |Ft]

= E[gt|Ft], θt θ D ˆLn(θt), θt θ E .

By definition ξt is independent of the history up to time t. Therefore, E[ξt|Ft] = 0 since E[ξt] = 0 which gives E[gt|Ft] = E[ ˆLn(θt) + ξt|Ft] = ˆLn(θt), (7)

Therefore, E[Yt|Ft] = 0. Moreover, by Cuachy-Schwartz inequality and the boundedness of Θ we can write E[|Yt|] = E[| ξt, θt θ |] E[ ξt θt θ ] RE[ ξt ] < . (8) Therefore, {Yt}t [T ] is a martingale difference sequence as was to be shown.

Using Equation (4) and by the definition of Yt in Equation (5), we can write D ˆLn(θt), θt θ E 1

2η θt θ 2 θt+1 θ 2 + η ξt 2 + ηL2 + Yt.

Summing it from 0 to T 1 gives

D ˆLn(θt), θt θ E 1 2ηT θ0 θ 2 + η

t [T ] ξt 2 + ηL2 + 1

2ηT + ηL2 + η

t [T ] ξt 2

Analyzing (A) in Equation (9)

Notice that P

t [T ] ξt 2 d= σ2 Y 2. Therefore, for every β (0, 1) provided that Td log(4/β), with probability at least 1 β/2, we have

t [T ] ξt 2 ησ2d

Analyzing (B) in Equation (9)

Lemma A.3 (Shamir [Sha11]). Let m N. Let {Zm}m [M] be a martingale difference sequence adapted to a filtration {Fm}m [M], and suppose there are constants b > 1 and c > 0 such that for any m and any α > 0, it holds that

P(|Zt| α|Ft) b exp( cα2).

Then for any β > 0, it holds with probability at least 1 β that

28b log(1/β)

We can rephrase Equation (5) as

Yt = D ˆLn(θt), θt θ E gt, θt θ = ξt, θ θt .

Notice that condition on Ft, ξt, θ θt |Ft N(0, σ2 θt θ 2). Therefore,

P(| ξt, θ θt | α|Ft) 2 exp

Note that the bound holds for every t [T]. Therefore, using Lemma A.3, with probability at least 1 β/2, we have 1 T

t [T ] ξt, θ θt 2σR

28 log(2/β)

From Equation (9), Equation (10), and Equation (11), we have with probability at least 1 β

min θ Θ ˆLn(θ) R2

2ηT + ηL2 + ησ2d

28 log(2/β)

(12) provided that Td log(4/β). Let

2ρn2 , η = R L

Using these parameters, we obtain that

min θ Θ ˆLn(θ) RL

56 log(2/β)

56 log(2/β) + RL

where the last step is by assuming that 1 p

56 log(2/β).

B Above Threshold Algorithm

Algorithm 7 Above Threshold

1: Inputs: Queries {f0, . . . , fk 1}, Privacy Budget ρ-z CDP, Threshold T.

2: ξtresh Lap 6 2ρ

3: ˆT = T + ξtresh 4: for i [k] do

5: ξi Lap 12 2ρ

6: if fi + ξi > ˆT: then 7: Output ˆ = i. 8: Halt 9: Output Fail.

C Technical Lemma

Lemma C.1. Let σ > 0. Let Y be a random variable with the distribution N(0, σ2). Then, for every

β (0, 1], we have P |Y | > σ p

2 log(2/β) β.

Lemma C.2 (Laurent and Massart [LM00]). Let m N. Consider random vector Y N(0, Im). Then, for every t 0,

P Y 2 m + 2

tm + 2t exp( t)

Corollary C.3. Let β (0, 1), m N, and m log 2

β . Consider Y N(0, Im), then

Y 2 m 1 + 4

Lemma C.4. Let n N and n 4. Let X(n) (Rd)n and X(n) (Rd)n be two datasets that differ in one sample. Let θ GM(X(n)) and θ GM( X(n)). Let 3n/4(θ ) be the radius of the ball around θ that contains at least 3n/4 of X(n). Then,

2 3n/4(θ ).

Proof. The proof is by contrapositive. In particular, we show that for every θ Rd such that θ θ > 3 2 3n/4(θ ), we have, θ / GM( X(n)). Let I = {i [n] : xi Bd(θ , 3n/4(θ )) and xi X(n)}. Using the variational representation of 2, we can write

F(θ; X(n)) F(θ; X(n)), θ θ

where the last step follows from Cauchy Schwarz inequality. Then, we can write

F(θ; X(n)) |I|

(3n/4) 1 + p

Therefore θ θ 3/2 3n/4(θ ).

Lemma C.5. For every n N and for every X(n) = (x1, . . . , xn), we have GM(X(n)) lies in the convex hull of {x1, . . . , xn}.

Lemma C.6. Let (x1, . . . , xn) (Rd)n be a dataset and θ = GM((x1, . . . , xn)). Let B [n] such that |B| < n/2. Then, for every θ, we have

θ θ 2n 2|B|

n 2|B| max i/ B θ xi .

Proof. It is a simple modification of [CLMPS16, Lemma. 24].

D Proof of Section 2

Proof of Lemma 2.2. The proof follows closely [NSV16, Lemma 4.5]. Let X and X are two neighboring datasets of size n that differ in the first sample. For a fixed ν and i [n], if i = 1, Ni(ν) can change by one. Also, in the worst-case the new datapoint can be close to the rest of the datapoints. Therefore, the sensitivity is bounded by 1

m((m 1) + n) 1 + n

γ 3 where the last step follows from γ 1/2.

Proof of Lemma 2.3. Let ˆν be such that N(ˆν) γ1n , by definition of N( ) it means that there exists a datapoint xi such that the ball of radius ˆν around it contains at least γ1n datapoints. Let B = {j [n] : xi xj > ˆν}. By the described argument, we have |B| (1 γ1)n. Then, we invoke Lemma C.6 with the described B and θ = xi to obtain

θ xi 2n 2(1 γ1)n

n 2(1 γ1)n ˆν

= 2γ1 2γ1 1 ˆν.

The first step follows because function h : R R, h(z) = 2n 2z

n 2z is increasing for z < n/2. In the next step, we use the triangle inequality to write

γ1n(θ ) γ1n(xi) + xi θ

ˆν + 2γ1 2γ1 1 ˆν

where ( ) is defined in Definition 1.1.

Next, we turn into proving the upperbound on ˆν. By assumption N(ˆν/2) < γ2n . For the sake of contradiction, assume that ˆν > 4 γ2n(θ ). Then, consider the set G = {i [n] : θ xi γ2n(θ )}. By definition, |G| γ2n . Consider an arbitrary subset of G with the size of γ2n . The main observation, which follows from the triangle inequality, is that a ball of radius 2 γ2n(θ ) around every point in G contains at least γ2n datapoint. Therefore, N(ˆν/2) N(2 γ2n(θ )) γ2n which contradicts with the assumption that N(ˆν/2) < γ2n .

Proof of Theorem 2.4. The privacy proof simply follows from the privacy analysis in [DR+14, Sec. 3.6]. We focus here on the utility guarantees.

Part 1: Let k = log 2R

r . It is simple to see that

P ˆ = Fail P N(2kr) + ξk m + ξthresh

= P(n m ξthresh ξk) P((1 γ)n ξthresh ξk)

where the last step follows from the assumption that maxxi,xj X(n) xi xj 2R, which gives us N(2kr) = n by the definition of N( ). By a simple tail bound on the Laplace distribution, we have P |ξthresh| 6 2ρ log(4/β) β/4 and P |ξk| 12 2ρ log(4/β) β/4. Therefore, given

n > 1 1 γ 18 2ρ log(4/β), P(n m ξthresh ξk) β/2.

Part 2: Lemma C.6 implies that for every ν such that N(ν) γn , we have, γn(θ ) 2γ 1

4γ 1 ν. Therefore, we can write

P γn(θ )2γ 1

4γ 1 ˆ P N ˆ γn P γn(θ )2γ 1

4γ 1 > ˆ P N ˆ < γn .

P N ˆ < γn P N ˆ < γn and ˆ = Fail + P ˆ = Fail . (15)

Under the event that ˆ = Fail, there exists i {0, . . . , k 1}, such that

N( ˆ ) + ξi m + 18 2ρ log(2(k + 1)/β) + ξthresh

By a simple tail bound and union bound, we have

P(B) P |ξthresh| 6 2ρ log(2(k + 1)/β) and {max i |ξi| 12 2ρ log(2(k + 1)/β)}

where k = log 2R

r . We further upperbound the first term in Equation (15) as follows

P N ˆ < γn and ˆ = Fail P N ˆ < γn and ˆ = Fail and Bc + P(B).

We claim that the first term in this equation is zero. Recall that m = γn . Under the event Bc, ξthresh ξi 18 2ρ log(2(k + 1)/β) and as a result, m + 18 2ρ log(2(k + 1)/β) + ξthresh ξi m. Therefore, it shows that the probability of the first term is zero. Also, as showed above, P(B) β/2.

Therefore, P N ˆ < γn P(B)+P ˆ = Fail . Combining it with P ˆ = Fail β/2 concludes the proof.

Part 3: Assume that N(r) < m. Let k = log 2R

r . Let γ = γ + 1

n 18 2ρ log(2(k + 1)/β). In Part 2, we showed that given n > 18 2ρ log(4/β), we have

P γn(θ )2γ 1

4γ 1 ˆ 1 β.

We only focus on the upperbound. From Lemma 2.3, we have

P ˆ 4 γn(θ ) P N ˆ /2 γn P ˆ > 4 γn(θ ) P N ˆ /2 > γn . (17) We can write

P N ˆ /2 > γn P N ˆ /2 > γn and ˆ / {r, Fail} + P ˆ {r, Fail}

P N ˆ /2 > γn and ˆ / {r, Fail} + P ˆ = r + P ˆ = Fail ,

where the last step follows from the union bound. For the first term, we have

P N ˆ /2 > γn and ˆ / {r, Fail}

= P N ˆ /2 > γn and ˆ / {r, Fail} and N( ˆ /2) + ξˆi < m + 18 ρ log 2

(18) where ˆi = log ˆ /2r 1. The last step follows from the following observation: under the event

that ˆ / {r, Fail}, during the execution of Algorithm 7, both N( ˆ ) and N( ˆ /2) are compared to the noisy threshold. Using the tail bounds in Equation (16), we have under the event Bc, with probability at least 1 β/2,

m + 18 2ρ log 2

+ ξthresh ξˆi m + 18 2ρ log 2

+ 18 2ρ log(2(k + 1)/β)

m + 36 2ρ log(2(k + 1)/β).

This shows that we have

P N ˆ /2 > γn and ˆ / {r, Fail} and N( ˆ /2) + ξˆi < m + 18 ρ log 2

In the next step, we bound P ˆ = r . Notice that

P ˆ = r = P N(r) + ξ1 > m + 18 2ρ log 2

Using simple tail bound, we have P ξthresh ξ1 18 2ρ log 2 β log 2R

r β/2 which shows

that P ˆ = r β/2 since we assume that N(r) < m. Therefore, combining all the pieces together, we proved

P γn(θ )2γ 1

4γ 1 ˆ 4 γn(θ ) 1 5β

Part 4: In Part 2, we showed that given n > 18 (1 γ) 2ρ log(4/β), we have

P ˆ γn(θ )2γ 1

Consider the following event E = n ˆ 4 γn(θ ) or ˆ = r o . We have

P(Ec) = P ˆ > 4 γn(θ ) and ˆ = r

P ˆ > 4 γn(θ ) and ˆ = {r, Fail} + P ˆ = Fail

P N ˆ /2 > γn and ˆ = {r, Fail} + P ˆ = Fail .

Here, the last step follows from Equation (17). Notice that in Equation (18), we analyzed the probability of the first term and we showed that it is as most β/2. We also have that P ˆ = Fail

β/2 from Part 1. Therefore, P(Ec) β. Combining it with Equation (19) concludes the proof.

Proof of Lemma 2.6. Let I = {i [n] : θ0 xi γn(θ0)}. For every i I, we have xi θ0 γn(θ0). Using the triangle inequality, for every i I, we can write θ1 xi θ1 θ0 θ0 xi θ1 θ0 (2 γn(θ0) θ0 xi ). The last equation is equivalent to θ1 xi θ0 xi θ1 θ0 2 γn(θ0). (20) Then, for every i / I, by an application of the triangle inequality θ1 xi + θ1 θ0 θ0 xi ( ) θ1 xi θ0 xi θ1 θ0 . (21)

Then, by adding both sides of Equation (20) and Equation (21), we have

F(θ1; X(n)) F(θ0; X(n)) |I| θ1 θ0 (n |I|) θ1 θ0 2|I| γn(θ0). This equation can be represented as

θ1 θ0 F(θ1; X(n)) F(θ0; X(n)) + 2|I| γn(θ0)

ζn + 2|I| γn(θ0)

2|I| n . (22)

Let γ n = |I|. We know that γ γ. Using this representation we can write

θ1 θ0 ζ + 2γ γn(θ0)

For a fixed a, b > 0 define h(x) a+2xb

2x 1 . For x > 1/2, dh(x)

dx = 2(a+b)

(2x 1)2 . This shows that h(x) is decreasing for x > 1/2. Therefore using this observation ζ + 2γ γn(θ0)

2γ 1 ζ + 2γ γn(θ0)

as was to be shown.

Proof of Theorem 2.7. The privacy proof is straightforward. Algorithm 2 uses the data in Line 3 and Line 11. Based on the privacy budget allocation and the composition properties of z CDP, we can show that the output satisfies ρ-z CDP.

For the claim regarding utility, in the first step, consider the recursion in Line 12 of Algorithm 2, i.e., radt+1 = 1

2radt + 12 ˆ initialized at rad0 = R. It can be easily shown that radm = 1 2m rad0 +

12 ˆ Pm 1 i=0 (1/2)i for m 1. In particular, let kwu = 1 log(2) log R/ ˆ , then, we obtain that

radkwu 25 ˆ .

Let γ = 3/4 and

2 2ρ log 2 log 2R

0.75 + 0.05 = 0.8,

where the last step follows because n Ω 1 ρ log(( log(R/r) + 1)/β) . Then, define the following event G1 = n 0.75n(θ ) 4 ˆ and n ˆ 4 0.8n(θ ) or ˆ = r oo .

In the next step, we analyze the probability that θ Θloc. We claim that

P θ Θloc G1 (1 β/(2kwu))kwu.

We prove this by induction. In particular, we claim that for every m {0, . . . , kwu} we have P θ Θm G1 (1 β/(2kwu))m. Note that we Θloc = Θkwu.

For the base case, by the assumption that the datapoints are in Bd(R), we have P(θ Θ0|G1) = P(θ Θ0) = 1 since Θ0 is trivially independent of every random variable. Then, we show the claim for m {1, . . . , kwu} assuming the claim holds for m 1. We can write

P θ Θm G1 = P θ θm radm G1

P θ θm radm θ Θm 1 and G1 P θ Θm 1 G1 . (23)

We claim that

P θ θm radm θ Θm 1 and G1 P

2 F(θm; X(n)) F(θ ; X(n))

2radm 1 θ Θm 1 and G1

To show this lets instantiate Lemma 2.6 with θ0 = θ and γ = 3/4 to obtain that for every θ Rd,

θ θ 2 F(θ; X(n)) F(θ ; X(n))

n + 3 γn(θ ).

Notice that conditioned on G1, we have 3 γn(θ ) 12 ˆ . This shows that

2(F (θm;X(n)) F (θ ;X(n)))

2radm 1 implies that θ θm radm conditioned on G1 by the definition of radm in Line 12. In the next step, we invoke Lemma A.1. Conditioned on θ Θm 1 and G1, with probability at least 1 β 2kwu , we have

F(θm; X(n)) F(θ ; X(n)) 2radm 1

log(4kwu/β) +

Notice kwu 1 log(2) log(R/r) a.s. By setting Twu = 128 and the bound on the sample size, we have

F(θm; X(n)) F(θ ; X(n)) radm 1

2 . Also, notice that the randomness in DPGD is independent of history. Therefore,

2 F(θm; X(n)) F(θ ; X(n))

2radm 1 θ Θm 1 and G1

1 β 2kwu , (24)

Therefore, combining Equations (23) and (24), we obtain

P θ Θm G1 1 β 2kwu

as was to be shown. From Theorem 2.4, given n Ω 1 ρ log(( log(R/r) + 1)/β) , we have

P(θ Θloc and G1) = P θ Θloc G1 P(G1) 1 β 2kwu

kwu (1 β) (1 2β). (25)

This proves the first claim.

Regarding the second claim, define the following event

G2 = 0.75n(θ )1

4 ˆ 4 0.8n(θ ) .

Notice that in the proof of P(θ Θloc) we only used the fact that with a high probability γn(θ ) 1

4 ˆ . Since G2 G1, we can write

P(θ Θloc and G2) = P θ Θloc G2 P(G2) (1 β 2kwu )kwu (1 5β/4) 1 2β,

where the last step follows from Part 3 of Theorem 2.4 which states that P(G2) 1 5β/4.

E Proof of Section 3

Proof of Theorem 3.1. For the privacy proof, notice that Algorithm 3 uses the training set in Line 2 and Line 5. By the privacy budget allocation and the composition properties of z CDP in [BS16], it is immediate to see that the output satisfies ρ-z CDP.

Next, we prove the utility properties. Define the following event

G1 = n θ Θloc and 0.75n(θ ) 4 ˆ and n ˆ 4 0.8n(θ ) or ˆ = r oo .

Also, by the non-negativity of 2, we have

F(θ ; X(n)) =

i=1 θ xi 0.2n 0.8n(θ ).

Using this inequality, for every θ, we can write

F θ; X(n) F θ ; X(n) O

F ˆθ; X(n) F θ; X(n) O

F θ ; X(n) .

Under the event that θ Θ0, we can invoke Lemma A.1 to write

F ˆθ; X(n) F θ ; X(n) O

where it follows because the internal randomness of DPGD is independent of the randomness in Localization step.

By the definition of event G1, either ˆ = r or ˆ 4 0.8n(θ ). Note that if ˆ 4 0.8n(θ ), we can use Equation (26) to provide a multiplicative guarantee. Therefore, conditioned on the event G1, we have

P F ˆθ; X(n) F θ ; X(n) O

min θ Rd F θ; X(n) G1 1 β/2.

The first statement then follows because, from Theorem 2.7, we have P(G1) 1 β.

For the second statement, under the condition that maxi [n]|X(n) Bd(xi, r)| < 3n/4, we can define the following high probability event:

G2 = n θ Θloc and 0.75n(θ ) 4 ˆ and ˆ 4 0.8n(θ ) o .

The argument then proceeds in the same way as the argument for the first claim.

Proof of Lemma 3.2. We can write π( ; X(n)) as

π(i; X(n)) = exp ε

2 F(θi; X(n)) F(θ ; X(n))

j [k] exp ε

2 F(θj; X(n)) F(θ ; X(n)) , i [k]. (28)

It follows because F(θ ; X(n)) is a constant independent of i. Then, the first claim follows from the standard utility analysis of the exponential mechanism in [MT07].

In the next step, we provide the proof for the second claim. To this end, because of Equation (28), we analyze the sensitivity of F(θi; X(n)) F(θ ; X(n)) for every i [k]. Note that θ is a data dependent quantity. Let X(n) be a dataset that differ in one sample from X(n). Also, let θ GM( X(n)) and assume, without loss of generality, that X(n) = (x1, . . . , x n) and X(n) = (x1, . . . , xn). For a fixed θ {θ1, . . . , θk}, we can write

h F(θ; X(n)) F(θ ; X(n)) i h F(θ; X(n)) F(θ ; X(n)) i

= θ xn θ x n θ xn + θ xn+1 +

θ θ + θ θ +

2 θ θ + θ θ +

θ xi θ xi .

Here, the last two steps follow from the triangle inequality. Note that θ is the geometric median of Xn. Therefore by the first-order optimally condition, we have

i=1 θ xi = θ x n . (30)

Using the first-order convexity condition applied to the function h(θ) = θ x for a fixed x, we can write n 1 X

= θ x n , θ θ

where the second step follows from Equation (30) and the last step follows because ( θ xn+1 ) 1. Therefore, using Equation (29) and Equation (31), we have h F(θ; X(n)) F(θ ; X(n)) i h F(θ; X(n)) F(θ ; X(n)) i 2 θ θ + 2 θ θ .

In the next step of the proof, we invoke Lemma C.4 to upperbound the sensitivity as follows 2 θ θ + 2 θ θ 2diam + 3 3n/4(θ )

, where the last step follows because θ θ diam by the assumption. Notice that the sensitivity analysis in the reverse direction is also the same. Therefore, the second claim follows from the standard analysis of the privacy of the exponential mechanism.

Proof of Theorem 3.3. The privacy proof of Algorithm 4 is relatively non-standard. Let A1 X(n) = ˆ , {θi}i {0,...,kft 1} . Also, let A2 X(n); ˆ , {θi}i {0,...,kft 1} = ˆt. In particular, A1 X(n)

can be viewed as the first part of Algorithm 4 before Line 10. Also, A2( ; ) denotes the exponential mechanism in Line 10 of Algorithm 4. Using the conversion between z CDP and DP, the privacy budget allocation, and the composition properties of z CDP, we have that A1( ) satisfies (ε/2, δ/2)-DP.

Define the following event

G = {θ Θ0 and 0.75n(θ ) 4 ˆ }.

Let µ be a measure on M1(R (Rd)kft) that satisfies the following: for every dataset X(n), P A1 X(n) µ( ). Let P1 denote the density. Since A1 satisfies approximate-DP, we assume for every z R (Rd)kft, we have P1(z; X(n)) exp(ε/2)P1(z; X(n)) + δ/2.

To prove the requirement of privacy, let S R (Rd)kft {0, . . . , kft 1}. Also, let X(n) be a dataset of size n that differs in one sample from X(n). Then, we can write

PA1(X(n)),A2( ;X(n)) ˆ , {θi}i {0,...,kft 1}, ˆt S

Z 1[S]P1 ˆ , {θi}i {0,...,kft 1}|X(n) π ˆt| ˆ , {θi}i {0,...,kft 1}, X(n) dµ

Z 1[S]P1 ˆ , {θi}i {0,...,kft 1}|X(n) π ˆt| ˆ , {θi}i {0,...,kft 1}, X(n) 1 h ( ˆ , θ0) G i dµ

Z 1[S]P1 ˆ , {θi}i {0,...,kft 1}|X(n) π ˆt| ˆ , {θi}i {0,...,kft 1}, X(n) 1 h ( ˆ , θ0) Gci dµ

Z 1[S]P1 ˆ , {θi}i {0,...,kft 1}|X(n) π ˆt| ˆ , {θi}i {0,...,kft 1}, X(n) 1 h ( ˆ , θ0) G i dµ + P(Gc).

Notice that under the event that (ˆΘ, θ0) G, we can invoke Lemma 3.2 to reason about the privacy properties of A2. Under the event G, we can see that 3 3n/4(θ ) + 2diam(Θ0) 112 ˆ . Therefore, by Lemma 3.2, we have

π ˆt| ˆ , {θi}i {0,...,kft 1}, X(n) exp(ε/2) π ˆt| ˆ , {θi}i {0,...,kft 1}, X(n) .

Moreover, by Theorem 2.7, we have P(Gc) δ/2. Therefore, X

Z 1[S]P1 ˆ , {θi}i {0,...,kft 1}|X(n) π ˆt| ˆ , {θi}i {0,...,kft 1}, X(n) dµ

Z 1[S]P1 ˆ , {θi}i {0,...,kft 1}|X(n) π ˆt| ˆ , {θi}i {0,...,kft 1}, X(n) dµ + δ/2.

Then, we use the fact that A1 X(n) satisfies (ε/2, δ/2):

Z 1[S]P1 ˆ , {θi}i {0,...,kft 1}|X(n) π ˆt| ˆ , {θi}i {0,...,kft 1}, X(n) dµ + δ/2

Z 1[S]P1 ˆ , {θi}i {0,...,kft 1}| X(n) π ˆt| ˆ , {θi}i {0,...,kft 1s}, X(n) dµ + δ.

It concludes the proof.

Proof of Theorem 3.4. Define the following event

G1 = n θ Θloc and 0.75n(θ ) 4 ˆ and n ˆ 4 0.8n(θ ) or ˆ = r oo . (32)

By Theorem 2.7, and the assumption on the minimum number of samples, we have P(G) 1 β.

By applying the standard tail bound on the norm of a Gaussian random vector (as outlined in Corollary C.3), we have

t {0, . . . , kft 1} : ξdir,t 2 3dkft

log(10kft/β)

Also, we can write

t {0, . . . , kft 1} : ξdir,t, θ θt > 50 ˆ

ρ log(10kft/β)

t {0, . . . , kft 1} : ξdir,t, θ θt > 50 ˆ

ρ log(10kft/β) G1

t {0, . . . , kft 1} : ξdir,t, θ θt > 50 ˆ

ρ log(10kft/β) G1

where the last step follows because P(Gc 1) β. Conditioned on G1, for all t {0, . . . , kft 1}, we have θt θ 50 ˆ since θ Θ0, θt Θ0, and the diameter of Θ0 is 50 ˆ . Also, notice that ξdir,t (θt, Θ0). Using these observations, conditioned on the event G1, using the standard tail bound on Gaussian random variable (as outlined in Lemma C.1), we can write

t {0, . . . , kft 1} : ξdir,t, θ θt > 50 ˆ

ρ log(10kft/β) G1

Therefore, we conclude

t {0, . . . , kft 1} : ξdir,t, θ θt 50 ˆ

ρ log(10kft/β)

To prove the claim regarding the suboptimality gap, we consider two cases:

1. There exists t {0, . . . , kft 1} such that θ Θt and θ / Θt+1,

Note that these two events are mutually exclusive and their union covers all the space. In what follows, we show that in both cases there exists t {0, . . . , kft 1} such that F(θt; X(n)) has a small excess loss.

For the first case, suppose t be such that θ Θt and θ / Θt+1. Therefore, we can write

θ / Θt+1 D F(θt; X(n)) + ξdir,t, θ θt E 0

D F(θt; X(n)), θ θt E ξdir,t, θ θt . (35)

Notice that using the first-order convexity condition, we have F(θt; X(n)) F(θ ; X(n)) F(θt; X(n)), θt θ . Therefore, by Equation (35), we have

F(θt; X(n)) F(θ ; X(n)) D F(θt; X(n)), θt θ E

ξdir,t, θ θt . (36)

Under the events G1 and Gn,2, defined in Equations (32) and (34), we have

F(θt; X(n)) F(θ ; X(n)) ξdir,t, θ θt ˆ O

ρ log(kft/β)

For the second case, i.e., θ Θkft, we have the following geometric fact [Nes98]: there exists t {0, . . . , kft 1} such that the distance of θ and the separating hyperplane at time t satisfies

* F(θt; X(n)) + ξdir,t F(θt; X(n)) + ξdir,t , θ θt

Here ν is a constant such that νd exp( τkft)(25 ˆ )d. The values of ν and kft will be determined later. Using the first order convexity, we can write

F(θ ; X(n)) F(θt; X(n))

F(θt; X(n)) + ξdir,t

* F(θt; X(n)) + ξdir,t F(θt; X(n)) + ξdir,t , θ θt

ξdir,t, θ θt

ν F(θt; X(n)) + ξdir,t ξdir,t, θ θt

ν 2 F(θt; X(n)) + 2 ξdir,t ξdir,t, θ θt

ν(2n + 2 ξdir,t ) ξdir,t, θ θt , where the second step follows from the well-known inequality a + b 2 a + 2 b for every a, b Rd. Then, the last step follows because for every θ Rd, F(θ; X(n) n. Therefore, under the events G1, Gn,1, and Gn,2, defined in Equations (32) to (34), we have the following bound on the suboptimality gap

F(θt; X(n)) F(θ ; X(n)) ν(2n + 2 ξdir,t ) + ξdir,t, θ θt

ν(2n + 2 ξdir,t ) + O

ρ log(kft/β)

v u u tdkft

ρ log(kft/β)

(39) Recall that ν satisfies νd exp( τkft)(25 ˆ )d. It can be easily seen that by setting

under the events G1, Gn,1, and Gn,2, we can further upperbound Equation (39) as follows

F(θt; X(n)) F(θ ; X(n)) O

ρ log(kft/β)

Therefore, from Equations (37) and (40), under the event G1 Gn,1 Gn,2, for both cases we showed that there exists t such that

F(θt; X(n)) F(θ ; X(n)) 0.8n(θ ) O

ρ log(kft/β)

or F(θt; X(n)) F(θ ; X(n)) r O

ρ log(kft/β)

By the non-negativity of 2, we have

F(θ ; X(n)) =

i=1 θ xi 0.2n 0.8n(θ ). (42)

Therefore, we conclude that for both cases there exists t such that

F(θt; X(n))

τβ log(κ) !!

F(θ ; X(n))

or F(θt; X(n)) F(θ ; X(n)) r O

τβ log(κ) ! (43)

where κ n ρ

Let us define OPT mint {0,...,kft 1} F(θt; X(n)) F(θ ; X(n)) . In the next step of the proof, we show that the exponential mechanism in Line 10 with high probability can identify an iterate whose suboptimality gap is close to OPT. Using the properties of the exponential mechanism in Line 10 as outlied in Lemma 3.2, we have with probability at least 1 β/3 over the randomness of the exponential mechanism

F(θˆt; X(n)) F(θ ; X(n)) OPT + 448 ˆ

ε (log(3kft/β)). (44)

Notice that under the event G1 and using Equation (42), we have 448 ˆ

ε log(3kft/β) F(θ ; X(n))

nε O(log(kft/β)) or 448 ˆ

ε log(3kft/β) r

εO(log(kft/β)) (45)

Moreover, under the event G1 Gn,1 Gn,2, we provided an upperbound on OPT in Equation (43). Combining Equation (43), Equation (44), and Equation (45), proves the first claim.

For the second statement, under the condition that maxi [n]|X(n) Bd(xi, r)| < 3n/4, we can define the following high probability event:

G2 = n θ Θloc and 0.75n(θ ) 4 ˆ and ˆ 4 0.8n(θ ) o .

The argument then proceeds in the same way as the argument for the first claim.

F Proof of Section 4

Proof of Lemma 4.2. For i [n k], we can write θk xi θk θ0 θ0 xi = θk θ0 2 θ0 xi + θ0 xi . Also for every j [k], we have θk yj θ0 yj θ0 θk . Summing both sides of these inequalities, we obtain

i=1 θk xi +

(n 2k) θ0 θk 2

i=1 θ0 xi +

i=1 θ0 xi +

i=1 θk xi +

i=1 θ0 xi +

(n 2k) θ0 θk 2

Since Pn k i=1 θk xi + Pk j=1 θk yj h Pn k i=1 θ0 xi + Pk j=1 θ0 yj i 0 by the

assumption that θk = GM(x1, . . . , xn k, y1, . . . , yk), we obtain

(n 2k) θ0 θk 2

i=1 θ0 xi 0

θ0 θk 2 n 2k

θ0 θk 2 n 2k

i=1 θ0 xi +

i=n k+1 θ0 xi

θ0 θk 2 n 2k F(θ0; (x1, . . . , xn)).

Proof of Theorem 4.1. We claim that for every neighbouring datasets X (Bd(R))n and X (Bd(R))n and for every y Bd(R), we have

|lenr(X, y) lenr(X , y)| 1.

This follows from the fact that for every X, we have d H(X, X) d H(X , X) + 1. Then, the proof of privacy follows from the privacy proof of the exponential mechanism [MT07].

Next, we present the utility proof. Let k N be a constant that determined later. Define the following two sets:

A1 = {y Bd(R) : lenr(X, y) k} A2 = {y Bd(R) : lenr(X, y) = 0}

Pˆθ π(ˆθ A1)

Pˆθ π(ˆθ A2) =

2 lenr(X, y) dy R

2 lenr(X, y) dy

We can use the following simple facts: Z

y Bd(R) dy = V1Rd where V1 is the volume of

the ball of radius one in Rd. For A2 notice that, for all y Bd(GM(X), r), we have lenr(X, y) = 0. Thus, R

y A2 dy V1rd. Putting these two pieces together,

Pˆθ π(ˆθ A1)

Pˆθ π(ˆθ A2) exp ε

d Pˆθ π(ˆθ A1) exp ε

where the last step follows from the fact that Pˆθ π(ˆθ A2) 1. Therefore, we obtain that for every β (0, 1) with probability at least 1 β we have

lenr(X, ˆθ) j2

where ˆθ π.

Under the above event, let ˆθ Bd(R) be such that lenr(X, ˆθ) k . This is equivalent to the following: there exists z Bd(ˆθ, r) and X (Rd)n such that z = GM( X) and d H(X, X) k . Using this observation, we can write ˆθ GM(X) z GM(X) + ˆθ z

z GM(X) + r

= GM( X) GM(X) + r.

Suboptimality Gap: Let θ GM(X), then

F(ˆθ; X) F(θ ; X) =

i=1 ( z xi + r θ xi )

GM( X) xi θ xi .

Define I [n] be the indices of the points that X and X differs. We know that |I| k . Then, we can write n X

GM( X) xi θ xi

GM( X) xi θ xi

GM( X) xi θ xi

By triangle inequality, we can write

GM( X) xi θ xi

|I| θ GM( X)

k θ GM( X) .

For i I, let ( X)i = x i where ( X)i denote the i-th data point in X. Since GM( X) is a geometric median of X, by the first-order optimality condition

θF(GM( X); X) = 0 X

i [n]/I θ GM( X) xi = X

i I θ GM( X) x i . (52)

To control A2, notice that θ xi is a convex function in θ for every xi. By the first-order convexity condition, for every θ1 and θ2, we have θ1 xi θ2 xi ( θ1 xi ), θ1 θ2 . Therefore, we can write X

GM( X) xi θ xi X

D GM( X) xi , GM( X) θ E . (53)

Then, by Equation (52),

D GM( X) xi , GM( X) θ E

D θ GM( X) x i , GM( X) θ E .

Finally notice that by Equation (3), for every x i, θ GM( X) x i 1. Therefore, by Cauchy Schwarz inequality

D θ GM( X) x i , GM( X) θ E

|I| GM( X) θ

k GM( X) θ .

By Equations (51) and (54), we obtain

F(ˆθ; X) F(θ ; X) nr + 2k GM( X) θ .

Then, we invoke Lemma 4.2 which states that GM( X) GM(X) 2 n 2k F(GM(X); X).

Putting all the pieces together,

F(ˆθ; X) F(θ ; X) nr + 2k GM( X) θ

n 2k F(θ ; X), (55)

as was to be shown.

Figure 1: Graphical Intuition Behind Equation (57)

Distance to θ : From Equation (49), we know that ˆθ GM(X) GM( X) GM(X) + r

where X is a dataset of size n such that d H(X, X) k . The proof is based on characterizing the worst case distance between the geometric median of two datasets that differ in at most k points.

For the dataset X(n), recall that θ = GM X(n) . Also, recall the definition of γn(θ ) from Definition 1.1. Let θ Rd be such that θ θ > γn(θ ). Define m = | X Bd(θ , γn(θ ))|. By the variational representation of , we can write F(θ; X) F(θ; X), θ θ

x X Bd(θ , γn(θ ))

x X\{ X Bd(θ , γn(θ ))}

x X Bd(θ , γn(θ ))

(56) where the last step follows from Cauchy-Schwarz inequality. Then, we claim that for every x X Bd(θ , γn(θ )), we have θ x

To gain the intuition behind it see Figure 1. Therefore, from Equation (56),

We are interested on characterizing the condition under which F(θ; X) > 0. A sufficient condition is that given n < 2m

2 (n m) > 0 ( ) γn(θ ) 1 q

n 2 < θ θ .

This shows that the distance of GM( X) and θ has to satisfy GM( X) θ γn(θ ) q

The function h(x) = 1

2x x2 is decreasing in the range of x (0, 1]. Also, notice that m =

| X Bd(θ , γn(θ ))| γn k . Therefore, GM( X) θ γn(θ ) q

as was to be shown.

G Proof of Section 5

Proof of Theorem 5.1. The proof is based on the reduction provided in Lemma G.1 and the lowerbound on the sample complexity of the mean estimation of Gaussian distribution with known covariance matrix in [KLSU19, Thm. 6.5].

Lemma G.1. Let ε 49 10 5, α 49 10 5, δ 10 4, and d 22500 be constants. Let An be an arbitrary (ε, δ)-DP algorithm such that for every dataset X(n), its output satisfies

Eˆθ An(X(n)) h F(ˆθ; X(n)) i (1 + α) min θ B d (1) F(θ; X(n)).

Let µ B d (1) and let X(n) = (X1, . . . , Xn) N(µ, Id) n. Let ˆθ An(X(n)). Then, with probability at least 2/3 over X(n) and the internal randomness of An, we have ˆθ µ 0.2

Proof. The proof consists of several steps:

Step 1: Bound on the Empirical Error. Let An be an arbitrary (ε, δ)-DP algorithm such that for every dataset X(n), its output satisfies

F(ˆθ; X(n)) (1 + α) min θ B d (R) F(θ; X(n)).

Let µ B d (R) and X(n) = (X1, . . . , Xn) N(µ, Id) n. The utility guarantee of the algorithm implies that

E h F(ˆθ; X(n)) i (1 + α)E

min θ B d (R) F(θ; X(n))

(1 + α)E h F(µ; X(n)) i .

To further upperbound the last step, we can use Jensen s inequality to write

1 n E h F(µ; X(n)) i = E[ X1 µ ]

E h X1 µ 2i

Therefore, in-expectation over X(n) N(µ, Id) n and the internal randomness of An, we have

EX(n) N(µ,Id) n,ˆθ An(X(n)) h F ˆθ; X(n) F µ; X(n) i nα

Step 2: Relating Empirical Error to Population Error. Let (X0, X1, . . . , Xn) N(µ, Id) (n+1). With an abuse of notation, let θ = An((X1, . . . , Xn)), and, for every i [n], let θ(i) = An((X1, . . . , Xi 1, X0, Xi+1, . . . , Xn)). Let T be a constant that will be determined later. We can write

E h θ(i) Xi i = Z

t=0 P θ(i) Xi t dt

t=0 P θ(i) Xi t dt + Z

t=T P θ(i) Xi t dt. (59)

Consider the first term in Equation (59). Since An satisfies (ε, δ)-DP,

P θ(i) Xi t = E h P θ(i) Xi t (X0, . . . , Xn) i

E h exp(ε)P θ Xi t (X0, . . . , Xn) + δ i

= exp(ε) P( θ Xi t) + δ.

Therefore, the first term can be upperbounded as Z T

t=0 P θ(i) Xi t dt exp(ε) Z T

t=0 P( θ Xi t)dt + Tδ

exp(ε) E[ θ Xi ] + Tδ.

In the next step, we upperbound the the second term in Equation (59). Notice that n (θ(i), Xi) : θ(i) µ (Xi µ) t o n (θ(i), Xi) : Xi µ t θ(i) µ o

n Xi : Xi µ t 2R

where the first step follows from the triangle inequality and the last step follows because µ and θ(i) are in B d (R). Using this, we can write Z

t=T P θ(i) Xi t dt Z

t=T P Xi µ t 2R

d P Xi µ u +

where the last step follows from the change of variable u = t (2R + 1)

d. In the next step, we use the concentration bounds for the norm of multivariate Guassian random variable. Using Lemma C.2, we can write P Xi µ u +

d = P Xi µ 2 u2 + d + 2u

Let T = 2(2R + 1)

d. Then, using standard bounds on the complementary error function [Ksc17], we can write Z

t=T P Xi µ u +

d exp 2(2R + 1)2d . (64)

In the last step, we claim that E[ θ X0 ] = E θ(i) Xi for every i [n]. It is because θ(i) d= θ, Xi d= X0, and θ(i) Xi. Ergo, combining and summing over i [n], we obtain

i=1 E[ θ Xi ]

dδ + 1 (4R + 2)

d exp 2(2R + 1)2d

This bound implies that

E[ θ X0 ] E[ µ X0 ]

i=1 (E[ θ Xi ] E[ µ X0 ])

+ (exp(ε) 1)E[ µ X0 ]

dδ + 1 (4R + 2)

d exp 2(2R + 1)2d .

This equation can be rephrased as follows

E[ θ X0 ] E[ µ X0 ] β

β = exp(ε)α + (exp(ε) 1) + (4R + 2)δ + 1 (4R + 2)d exp 2(2R + 1)2d . (66)

Step 3: Relating Population Error to Distance In Step 2, we showed that in-expectation over (X0, . . . , Xn) N(µ, Id) (n+1) and ˆθ An(X(n)) where X(n) = (X1, . . . , Xn), we have

E h ˆθ X0 i E[ µ X0 ] β

For notiational convenience, let h : Rd R be h(θ) EX N(µ,Id)[ θ X ]. Equation (67) can be written as EX(n) N(µ,Id) n,ˆθ An(X(n)) h h(ˆθ) h(µ) i β

Since µ is the minimizer of h(θ), for every θ Rd we have that h(θ) h(µ). Therefore, h(ˆθ) h(µ) is a non-negative random variable. We can invoke Markov s inequality to write

PX(n) N(µ,Id) n,ˆθ An(X(n)) h(ˆθ) h(µ) 3β

In the next step, we provide a deterministic argument: For every θ Rd such that h(θ) h(µ) 3β

d, we provide an upperbound on θ µ . By subtracting µ, we can write

h(θ) h(µ) = EX N(µ,Id)[ θ X µ X ]

= EZ N(0,Id)[ θ µ + Z Z ]. (69)

Define the following events

E2 n Z : θ µ, Z θ µ p

2 log(2/γ) o .

Using Corollary C.3 and simple concentration bound for Gaussian random variable we have that P(E1 E2) 1 γ. Let E = E1 E2. By dropping the positive term, we can write

E[ θ µ + Z Z ] = E[( θ µ + Z Z ) 1[E]] + E[( θ µ + Z Z ) 1[Ec]] E[( θ µ + Z Z ) 1[E]] E[ Z 1[Ec]]. (71)

Using Cauchy-Schwarz inequality, E[ Z 1[Ec]] p

E[ Z 2] = p

d. In the next step, we analyze the first term.

E[( θ µ + Z Z ) 1[E]]

θ µ 2 + Z 2 + 2 θ µ, Z Z 1[E]

Z 2 + 2 θ µ, Z

The value of γ will be determined later. Let d be large enough such that 1 2

= 0.9 and 1 + 4

= 1.1. (73)

Then, we can write

Z 2 + 2 θ µ, Z

2 log(2/γ) θ µ

Notice that we assumed that h(θ) h(µ) 3β

d. Therefore, we have

2 log(2/γ) θ µ

2 log(2/γ) θ µ

0.9d (3β + γ)

Simple calculations show that this bound implies that

log(2/γ) + 0.1

We would like to set the parameters such that

s 1 + 3β+ γ

easily see that it implies 3β+ γ = 0.0045. For example, we can pick β = 0.0014 and γ = 9 10 8. Recall that

β = exp(ε)α + (exp(ε) 1) + (4R + 2)δ + 1 (4R + 2)d exp 2(2R + 1)2d . (77)

For instance, by setting ε 49 10 5, α 49 10 5, δ 2

3 10 4 and d 2, we obtain β 0.1. Finally, we need to set d such that Equation (73) holds. We can see that d 7050 satisfies this condition.

H Details of the Numerical Experiment

Our goal in the experiments is to evaluate the impact of increasing the radius of the initial feasible set, i.e. R, on the performance of our proposed algorithm and compare it with DPGD. Also, we want to show that our method without any additional hyperparmeter tuning can achieve a good excess error.

Data Generation. Let n denote the number of samples. We assume that 0.9n of the data is distributed as follows: let µ Rd be a uniformly random vector within Sd 1(50). We then sample 0.9n of the data points from N(µ, (0.01)2 Id). The remaining 0.1n of the points are sampled uniformly at random from Bd(100).

Hyperparameters. We set the discretization parameter to r = 0.05 in Algorithm 2 and failure probability to 5%. Additionally, we repeat each algorithm 10 times and report the mean. For the other hyperparameters, we used exactly the same hyperparameters as stated in Algorithm 3. For DPGD, we use the hyperparameters in Lemma A.1, and in particular, we choose T such that

103 104 105 106 107 108 109 1010 R

F( ; X(n))/F( ; X(n))

d = 200, n = 3000, = 2, = 1/n

103 104 105 106 107 108 109 1010 R

F( ; X(n))/F( ; X(n))

d = 200, n = 3000, = 3, = 1/n

Figure 2: Performance for Different Privacy Budget