# dppca_statistically_optimal_and_differentially_private_pca__e337062c.pdf

DP-PCA: Statistically Optimal and Differentially Private PCA

Xiyang Liu Paul Allen School of Computer Science & Engineering University of Washington xiyangl@cs.washington.edu

Weihao Kong Google Research weihaokong@google.com

Prateek Jain Google Research prajain@google.com

Sewoong Oh Paul Allen School of Computer Science & Engineering University of Washington, and Google Research sewoong@cs.washington.edu

We study the canonical statistical task of computing the principal component from n i.i.d. data in d dimensions under (ε, δ)-differential privacy. Although extensively studied in literature, existing solutions fall short on two key aspects: (i) even for Gaussian data, existing private algorithms require the number of samples n to scale super-linearly with d, i.e., n = Ω(d3/2), to obtain non-trivial results while non-private PCA requires only n = O(d), and (ii) existing techniques suffer from a non-vanishing error even when the randomness in each data point is arbitrarily small. We propose DP-PCA, which is a single-pass algorithm that overcomes both limitations. It is based on a private minibatch gradient ascent method that relies on private mean estimation to add minimal noise required to ensure privacy by adapting to the geometry of a given minibatch of gradients. For sub-Gaussian data, we provide nearly optimal statistical error rates even for n = O(d). Furthermore, we provide a lower bound showing that sub-Gaussian style assumption is necessary in obtaining the optimal error rate.

1 Introduction

Principal Component Analysis (PCA) is a fundamental statistical tool with multiple applications including dimensionality reduction, data visualization, and noise reduction. Naturally, it is a key part of most standard data analysis and ML pipelines. However, when applied to data collected from numerous individuals, such as the U.S. Census data, outcome of PCA might reveal highly sensitive personal information. We investigate the design of privacy preserving PCA algorithms and the involved privacy/utility tradeoffs, for computing the first principal component, that should serve as the building block of more general rank-k PCA.

Differential privacy (DP) is a widely accepted mathematical notion of privacy introduced in [21], which is a standard in releasing the U.S. Census data [2] and also deployed in commercial systems [72, 25, 27]. A query to a database is said to be (ε, δ)-differentialy private if a strong adversary who knows all other entries but one cannot infer that one entry from the query output, with high confidence. The parameters ε and δ restricts the confidence as measured by the Type-I and II errors [49]. Smaller values of ε [0, ) and δ [0, 1] imply stronger privacy and plausible deniability for the participants.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

For non-private PCA with n i.i.d. samples in d dimensions, the popular Oja s algorithm (provided in Algorithm 1) achieves the optimal error of sin(ˆv, v1) = Θ( p

d/n), where the error is measured by the sine function of the angle between the estimate, ˆv, and the principal component, v1, [45]. For differentially private PCA, there is a natural fundamental question: what is the extra cost we pay in the error rate for ensuring (ε, δ)-DP?

We introduce a novel approach we call DP-PCA (Algorithm 3) and show that it achieves an error bounded by sin(ˆv, v) = O( p

d/n + d/(εn)) for sub-Gaussian-like data defined in Assumption 1, which is a broad class of light-tailed distributions that includes Gaussian data as a special case. The second term characterizes the cost of privacy and this is tight; we prove a nearly matching information theoretic lower bound showing that no (ε, δ)-DP algorithm can achieve a smaller error. This significantly improves upon a long line of existing private algorithms for PCA, e.g., [14, 9, 38, 36, 23]. These existing algorithms are analyzed for fixed and non-stochastic data and achieve sub-optimal error rates of O( p

d/n + d3/2/(εn)) even in the stochastic setting we consider.

A remaining question is whether the sub-Gaussian-like assumption, namely Assumption A.4, is necessary or if it is an artifact of our analysis and our algorithm. It turns out that such an assumption on the lightness of the tail is critical; we prove an algorithmic independent and information theoretic lower bound (Theorem 5.4) to show that, without such an assumption, the cost of privacy is lower bounded by Ω( p

d/(εn)). This proves a separation of the error depending on the lightness of the tail.

We start with the formal description of the stochastic setting in Section 2 and present Oja s algorithm for non-private PCA. Our first attempt in making this algorithm private in Section 3 already achieves near-optimal error, if the data is strictly from a Gaussian distribution. However, there are two remaining challenges that we describe in detail in Section 4: (i) the excessive number of iterations of Private Oja s Algorithm (Algorithm 2) prevents using typical values of ε used in practice, and (ii) for general sub-Gaussian-like distributions, the error does not vanish even when the noise in the data (as measured by a certain fourth moment of a function of the data) vanishes. The first challenge is due to the analysis that requires amplification by shuffling [24] that is restrictive. The second is due to its reliance on gradient norm clipping [1] that does not adapt to the geometry of the current gradients. This drives the design of DP-PCA in Section 5 that critically relies on two techniques to overcome each challenge, respectively. First, minibatch SGD (instead of single sample SGD) significantly reduces the number iterations, thus obviating the need for amplification by shuffling. Next, private mean estimation (instead of gradient norm clipping and noise adding) adapts to the geometry of the problem and adds the minimal noise necessary to achieve privacy. The main idea of this geometry adaptive stochastic gradient update is explained in detail in Section 6, along with a sketch of a proof.

Notations. For a vector x Rd, we use x to denote the Euclidean norm. For a matrix X Rd d, we use X 2 = max v =1 Xv 2 to denote the spectral norm. We use Id to denote d d identity matrix. For n Z+, let [n] := {1, 2, . . . , n}. Let Sd 1 2 denote the unit d-sphere of ℓ2, i.e., Sd 1 2 := {x Rd : x = 1}. O() hides logarithmic factors in n, d, and the failure probability ζ.

2 Problem formulation and background on DP

Typical PCA assumes i.i.d. data {xi Rd} from a distribution and finds the first eigenvector of Σ = E[(xi E[xi])(xi E[xi]) ] Rd d. Our approach allows for a more general class of data {Ai Rd d} that recovers the standard case when Ai = (xi E[xi])(xi E[xi]) .

Assumption 1 ((Σ, {λi}d i=1, M, V, K, κ, a, γ2)-model). Let A1, A2, . . . , An Rd d be a sequence of (not necessarily symmetric) matrices sampled independently from the same distribution that satisfy the following with PSD matrices Σ Rd d and Hu Rd d, and positive scalar parameters M, V, K, κ, a, and γ2:

A.1. Let Σ := E[Ai], for a symmetric positive semidefinite (PSD) matrix Σ Rd d, λi denote the i-th largest eigenvalue of Σ, and κ := λ1/(λ1 λ2), A.2. Ai Σ 2 λ1M almost surely, A.3. max E (Ai Σ)(Ai Σ) 2 , E (Ai Σ) (Ai Σ) 2 λ2 1V ,

A.4. max u =1, v =1 E exp |u (A i Σ)v|2

K2λ2 1 Hu 2

1/(2a) 2, where Hu := (1/λ2 1)E[(Ai

Σ)uu (Ai Σ) ]. We denote γ2 := max u =1 Hu 2.

The first three assumptions are required for PCA even if privacy is not needed. The last assumption provides a Gaussian-like tail bound that determines how much noise we need to introduce in the algorithm for (ε, δ)-DP. The following lemma is useful in the analyses. Lemma 2.1. Under A.1 and A.4 in Assumption 1, for any unit vector u, v, with probability 1 ζ,

|u (A i Σ)v|2 K2λ2 1 Hu 2 log2a(2/ζ) . (1)

2.1 Oja s algorithm

In a non-private setting, the following streaming algorithm introduced in [69] achieves optimal sample complexity as analyzed in [45]. It is a projected stochastic gradient ascent on the objective defined on the empirical covariance: max w =1(1/n) Pn i=1 w Aiw.

Algorithm 1: (Non-private) Oja s Algorithm

1 Choose w0 uniformly at random from the unit sphere

2 for t = 1, 2, . . . , T do w t wt 1 + ηt Atwt 1 , wt w t/ w t

3 Return w T

Central to our analysis is the following error bound on Oja s Algorithm from [45]. Theorem 2.2 ([45, Theorem 4.1]). Under Assumptions A.1-A.3, suppose the step size ηt = α (λ1 λ2)(ξ+t) for some α > 1/2 and ξ := 20 max κMα, κ2 (V + 1) α2/log(1 + (ζ/100)) . If T > ξ then there exists a constant C > 0 such that Algorithm 1 outputs w T achieving w.p. 1 ζ,

sin2 (w T , v1) C log(1/ζ)

α2κ2V (2α 1)T + d ξ

2.2 Background on Differential Privacy

Differential privacy (DP), introduced in [21], is a de facto mathematical measure for privacy leakage of a database accessed via queries. It ensures that even an adversary who knows all other entries cannot identify with a high confidence whether a person of interest participated in a database or not. Definition 2.3 (Differential privacy [21]). Given two multisets S and S , we say the pair (S, S ) is neighboring if |S \ S | + |S \ S| 1. We say a stochastic query q over a dataset S satisfies (ε, δ)-differential privacy for some ε > 0 and δ (0, 1) if P(q(S) A) eεP(q(S ) A) + δ for all neighboring (S, S ) and all subset A of the range of q.

Small values of ε and δ ensures that the adversary cannot identify any single data point with high confidence, thus providing plausible deniability. We provide useful DP lemmas in Appendix B. Within our stochastic gradient descent approach to PCA, we rely on the Gaussian mechanism to privatize each update. The sensitivity of a query q is defined as q := supneighboring (S,S ) q(S) q(S ) .

Lemma 2.4 (Gaussian mechanism [22]). For a query q with sensitivity q, ε (0, 1), and δ (0, 1), the Gaussian mechanism outputs q(S) + N(0, ( q( p

2 log(1.25/δ))/ε)2Id) and achieves (ε, δ)-DP.

This is a special case of a family of output perturbation mechanisms which includes the Laplace mechanism [21] and stair-case mechanisms [34]. The latter is shown to be optimal in one-dimension [35] and for hypothesis testing under local DP [48]. Another mechanism we frequently use is the private histogram learner of [56], whose analysis is provide in Appendix B, along with various composition theorems to provide end-to-end guarantees.

2.3 Comparisons with existing results in private PCA

We briefly discuss the most closely related work and provide more previous work in Appendix A. Most existing results assume a fixed data under a deterministic setting where each sample has a bounded

Algorithm 2: Private Oja s Algorithm

Input: S = {Ai Rd d}n i=1, privacy (ε, δ), learning rates {ηt}n t=1 1 Randomly permute S and choose w0 uniformly at random from the unit sphere

2 Set DP noise multiplier: α C log(n/δ)/(ε n)

3 Set clipping threshold: β Cλ1

d(Kγ loga(nd/ζ) + 1)

4 for t=1, 2, ..., n do

5 Sample zt N(0, Id)

6 w t wt 1 + ηt clipβ (Atwt 1) + 2ηtβαzt where clipβ(x) = x min{1, β x 2 }

7 wt w t/ w t

8 Return wn

norm, xi β, and the goal is to find the top eigenvector of ˆΣ := (1/n) Pn i=1(xi ˆµ)(xi ˆµ) for the empirical mean ˆµ. For the purpose of comparisons, consider Gaussian xi N(0, Σ) with xi β = O( p

λ1d log(n/ζ)) for all i [n] with probability 1 ζ. The first line of approaches in [9, 14, 23] is a Gaussian mechanism that outputs PCA(bΣ + Z), where Z is a symmetric matrix with i.i.d. Gaussian entries with a variance ((β2/nε) p

2 log(1.25/δ))2 to ensure (ε, δ)-DP. The tightest result in [23, Theorem 7] achieves

sin(ˆv, v1) = O κ r

d n + d3/2p

log(1/δ) εn

with high probability, under a strong assumption that the spectral gap is very large: λ1 λ2 = ω(d3/2p

log(1/δ)/(εn)). In a typical scenario with λ1 = O(1), this requires a large sample size of n = ω(d3/2/ε). Since this Gaussian mechanism does not exploit the statistical properties of i.i.d. samples, the second term in this upper bound is larger by a factor of d1/2 compared to the proposed DP-PCA (Corollary 5.2). The error rate of Eq. (3) is also achieved in [38, 36] by adding Gaussian noise to the standard power method for computing the principal components. When the spectral gap, λ1 λ2, is smaller, it is possible to trade-off the dependence in κ and the sampling ratio d/n, which we do not address in this work but is surveyed in Appendix A.

3 First attempt: making Oja s Algorithm private

Following the standard recipe in training with DP-SGD, e.g., [1] and a recent work [75], we introduce Private Oja s Algorithm in Algorithm 2. At each gradient update, we first apply gradient norm clipping to limit the contribution of a single data point and next add an appropriately chosen Gaussian noise from Lemma 2.4 to achieve (ε, δ)-DP, end-to-end. The choice of clipping threshold β ensures that, with high probability under our assumption, we do not clip any gradients. The choice of noise multiplier α ensures (ε, δ)-DP.

One caveat in streaming algorithms is that we access data n times, each with a private mechanism, but accessing only a single data point at a time. To prevent excessive privacy loss due to such a large number of data accesses, we apply a random shuffling in line 1 Algorithm 2, in order to benefit from a standard amplification by shuffling [24, 29]. This gives an amplified privacy guarantee that allows us to add a small noise proportional to α = O(log(n/δ)/(ε n)). Without the shuffle amplification, we will instead need a larger noise scaling as α = O(log(n/δ)/ε), resulting in a suboptimal utility guarantee. However, this comes with a restriction that the amplification holds only for small values of ε = O( p

log(n/δ)/n). Our first contribution in the proposed DP-PCA (Algorithm 3) is to expand this range to ε = O(1), which includes the practical regime of interest ε [1/2, 5].

Lemma 3.1 (Privacy). If ε = O( p

log(n/δ)/n) and the noise multiplier is chosen to be α = Ω(log(n/δ)/(ε n)), then Algorithm 2 is (ε, δ)-DP.

Under Assumption 1, we select gradient norm clipping threshold β such that no gradient exceeds β. Lemma 3.2 (Gradient clipping). Let β = Cλ1

d(Kγ loga(nd/ζ) + 1) for some constant C > 0. Then with probability 1 ζ, Atwt 1 β for any fixed wt 1 independent of At, for all t [n].

We provide proofs of both lemmas and the next theorem in Appendix D. When no clipping is applied, we can use the standard analysis of Oja s Algorithm from [45] to prove the following utility guarantee.

Theorem 3.3 (Utility). Given n i.i.d. samples {Ai Rd d}n i=1 satisfying Assumption 1 with parameters (Σ, M, V, K, κ, a, γ2), if

n = O κ2 + κM + κ2V + d κ (γ + 1) log(1/δ)

with a large enough constant, then there exists a positive universal constant c1 and a choice of learning rate ηt that depends on (t, M, V , K, a, λ1, λ1 λ2, n, d, ε, δ) such that Algorithm 2 with a choice of ζ = 0.01 outputs wn that achieves with probability 0.99,

sin2 (wn, v1) = e O κ2 V

n + (γ + 1)2d2 log2(1/δ)

where e O( ) hides poly-logarithmic factors in n, d, 1/ε, and log(1/δ) and polynomial factors in K.

The first term in Eq. (5) matches the non-private error rate for Oja s algorithm in Eq. (2) with α = O(log n) and T = n, and the second term is the price we pay for ensuring (ε, δ)-DP.

Remark 3.4. For a canonical setting of a Gaussian data with Ai = xix i and xi N(0, Σ), we have, for example from [70, Lemma 1.12], that M = O(d log(n)), V = O(d), K = 4, a = 1, and γ2 = O(1). Theorem 3.3 implies the following error rate:

sin2 (wn, v1) = O κ2 d

n + d2 log2(1/δ)

Comparing to a lower bound in Theorem 5.3, this is already near optimal. However, for general distributions satisfying Assumption 1, Algorithm 2 (in particular the second term in Eq. (5)) can be significantly sub-optimal. We explain this second weakness of Private Oja s Algorithm in the following section (the first weakness is the restriction on ε = O( p

log(n/δ)/n)).

4 Two remaining challenges

We explain the two remaining challenges in Private Oja s Algorithm and propose techniques to overcomes these challenges that lead to the design of DP-PCA (Algorithm 3).

First challenge: restricted range of ε = O( p

log(n/δ)/n). This is due to the large number, n, of iterations that necessitates the use of the amplification by shuffling in Theorem D.1. We reduce the number of iterations with minibatch SGD. For T = O(log2 n) and t = 1, 2, . . . , T, we repeat

w t wt 1 + ηt

i=1+B(t 1) clipβ(Aiwt 1) + wηtβα

B zt , and wt w t/ w t , (7)

where zt N(0, Id) and the minibatch size is B = n/T . Since the dataset is accessed only T = O(log2 n) times, the end-to-end privacy is analyzed with the serial composition (Lemma B.3) instead of the amplification by shuffling. This ensures (ε, δ)-DP for any ε = O(1), resolving the first challenge, and still achieves the utility guarantee of Eq. (5).

Second challenge: excessive noise for privacy. This is best explained with an example. Example 4.1 (Signal and noise separation). Consider a setting with Ai = xix i and xi = si + ni where si = v with probability half and si = v otherwise for a unit norm vector v and ni N(0, σ2I). We want to find the principal component of Σ = E[xix i ] = vv + σ2I, which is v. This construction decomposes the signal and the noise. For Ai = vv + sin i + nis i + nin i , the signal component is determined by vv that is deterministic due to the sign cancelling. The noise component is sin i + nis i + nin i which is random. We can control the Signal-to-Noise Ratio (SNR), 1/σ2, by changing σ2, and we are particularly interested in the regime where σ2 is small. As we are interested in σ2 < 1, this satisfies Assumption 1 with λ1 = 1 + σ2, λ2 = σ2, V = O(dσ2), K = O(1), a = 1, and γ2 = σ2. Substituting this into Eq. (5), Private Oja s Algorithm achieves

sin2(wn, v1) = O σ2d

n + d2 log(1/δ)

where we are interested in σ2 < 1.

Figure 1: 2-d PCA under the Gaussian data from Remark 3.4 (left) shows that the average gradient (red arrow) is smaller than the range of the minibatch of 400 gradients (blue dots). Under Example 4.1 (right), the range can be made arbitrarily smaller than the average gradient.

Algorithm 3: Differentially Private Principal Component Analysis (DP-PCA)

Input: S = {Ai}n i=1, (ε, δ), batch size B Z+, learning rates {ηt} n/B t=1 , probability ζ (0, 1)

1 Choose w0 uniformly at random from the unit sphere

2 for t = 1, 2, . . . , T = n/B do

3 Run Private Top Eigenvalue Estimation (Algorithm 4) with (ε/2, δ/2)-DP and failure probability ζ/(2T) on {AB(t 1)+iwt 1} B/2 i=1 . Let the returned estimation be ˆΛt > 0.

4 Run Private Mean Estimation (Algorithm 5) with (ε/2, δ/2)-DP, failure probability ζ/(2T), and the estimated eigenvalue 2ˆΛt on AB(t 1)+ B/2 +iwt 1

i B/2 . Let the returned

mean gradient estimate be ˆgt Rd.

5 w t wt 1 + ηtˆgt , wt w t/ w t

6 Return w T

This is problematic since the second term, due to the DP noise, does not vanish as the randomness σ2 in the data decreases. We do not observe this for Gaussian data where signal and noise scale proportionally as shown below. We reduce the noise we add for privacy, by switching from a simple norm clipping, that adds noise proportional to the norm of the gradients, to private estimation, that only requires the noise to scale as the range of the gradients, i.e. the maximum distance between two gradients in the minibatch. The toy example above showcases that the range can be arbitrarily smaller than the maximum norm (Fig. 1). We want to emphasize that although the idea of using private estimation within an optimization has been conceptually proposed in abstract settings, e.g., in [51], DP-PCA is the first setting where (i) such separation between the norm and the range of the gradients holds under any statistical model, and hence (ii) the long line of recent advances in private estimation provides significant gain over the simple DP-SGD [1].

5 Differentially Private Principal Component Analysis (DP-PCA)

Combining the two ideas of minibatch SGD and private mean estimation, we propose DP-SGD. We use minibatch SGD of minibatch size B = O(n/ log2 n) to allow for larger range of ε = O(1). We use Private Mean Estimation to add an appropriate level of noise chosen adaptively according to Private Eigenvalue Estimation. We describe details of both sub-routines in Section 6.

We show an upper bound on the error achieved by DP-PCA under an appropriate choice of the learning rate. We provide a complete proof in Appendix E.1 that includes the explicit choice of the learning rate ηt in Eq. (67), and a proof sketch is provided in Section 6.1.

Theorem 5.1. For ε (0, 0.9), DP-PCA guarantees (ε, δ)-DP for all S, B, ζ, and δ. Given n i.i.d. samples {Ai Rd d}n i=1 satisfying Assumption 1 with parameters (Σ, M, V, K, κ, a, γ2), if

n = O eκ2 + d1/2(log(1/δ))3/2

ε + κM + κ2V + d κ γ (log(1/δ))1/2

ε + d log(1/δ)

with a large enough constant and δ 1/n, then there exists a positive universal constant c1 and a choice of learning rate ηt that depends on (t, M, V , K, a, λ1, λ1 λ2, n, d, ε, δ) such that T = n/B steps of DP-PCA in Algorithm 3 with choices of ζ = 0.01 and B = c1n/(log n)2,

outputs w T such that with probability 0.99,

sin (w T , v1) = e O

where e O( ) hides poly-logarithmic factors in n, d, 1/ε, and log(1/δ) and polynomial factors in K.

We further interpret this analysis and show that (i) DP-PCA is nearly optimal when the data is from a Gaussian distribution by comparing against a lower bound (Theorem 5.3); and (ii) DPPCA significantly improves upon the private Oja s algorithm under Example 4.1. We discuss the necessity of some of the assumptions at the end of this section, including how to agnostically find the appropriate learning rate scheduling.

Near-optimality of DP-PCA under Gaussian distributions. Consider the case of i.i.d. samples {xi}n i=1 from a Gaussian distribution from Remark 3.4. Corollary 5.2 (Upper bound; Gaussian distribution). Under the hypotheses of Theorem 5.1 and {Ai = xix i }n i=1 with Gaussian random vectors xi s, after T = n/B steps, DP-PCA outputs w T that achieves, with probability 0.99,

sin(w T , v1) = O

We prove a nearly matching lower bound, up to factors of p

λ1/λ2 and p

log(1/δ). One caveat is that the lower bound assumes pure-DP with δ = 0. We do not yet have a lower bound technique for approximate DP that is tight, and all known approximate DP lower bounds have gaps to achievable upper bounds in its dependence in log(1/δ), e.g., [5, 62]. We provide a proof in Appendix C.1.

Theorem 5.3 (Lower bound; Gaussian distribution). Let Mε be a class of (ε, 0)-DP estimators that map n i.i.d. samples to an estimate ˆv Rd. A set of Gaussian distributions with (λ1, λ2) as the first and second eigenvalues of the covariance matrix is denoted by P(λ1,λ2). For d>c where c > 0 is some absolute constant, there exists a universal constant C > 0 such that

inf ˆv Mε sup P P(λ1,λ2) ES P n [sin(ˆv(S), v1)] C min

Comparisons with private Oja s algorithm. We demonstrate that DP-PCA can significantly improve upon Private Oja s Algorithm with Example 4.1, where DP-PCA achieves an error bound of sin(w T , v1) = O σ p

log(1/δ)/(εn) . As the noise power σ2 decreases DP-PCA achieves a vanishing error, whereas Private Oja s Algorithm has a non-vanishing error in Eq. (8). This follows from the fact that the second term in the error bound in Eq. (10) scales as γ, which can be made arbitrarily smaller than the second term in Eq. (5) that scales as (γ + 1). Further, the error bound for DP-PCA holds for any ε = O(1), whereas Private Oja s Algorithm requires significantly smaller ε = O( p

log(n/δ)/n).

Remarks on the assumptions of Theorem 5.1. We have an exponential dependence of the sample complexity in the spectral gap, n exp(κ2). This ensures we have a large enough T = n/B to reduce the non-dominant second term in Eq. (2), in balancing the learning rate ηt and T (which is explicitly shown in Eqs. 69 and (70) in the Appendix). It is possible to get rid of this exponential dependence at the cost of an extra term of O(κ4γ2d2 log(1/δ)/(εn)2) in the error rate in Eq. (10), by selecting a slightly larger T = cκ2 log2 n. A Gaussian-like tail bound in Assumption A.4 is necessary to get the desired upper bound scaling as O(d p

log(1/δ)/(εn)) in Eq. 11, for example. The next lower bound shows that without such assumptions on the tail, the error due to privacy scales as Ω( p

d log(1/δ)/(εn)). We believe that the dependence in δ is loose, and it might be possible to get a tighter lower bound using [52]. We provide a proof and other lower bounds in Appendix C.

Theorem 5.4 (Lower bound without Assumption A.4). Let Mε be a class of (ε, δ)-DP estimators that map n i.i.d. samples to an estimate ˆv Rd. A set of distributions satisfying Assumptions A.1 A.3

with M = O(d + p

nε/d), V = O(d) and γ = O(1) is denoted by P. For d 2, there exists a universal constant C > 0 such that

inf ˆv Mε sup P P ES P n [sin(ˆv(S), v1)] Cκ min

d log ((1 e ε) /δ)

Currently, DP-PCA requires choices of the learning rates, ηt, that depend on possibly unknown quantities. Since we can privately evaluate the quality of our solution, one can instead run multiple instances of DP-PCA with varying ηt = c1/(c2 + t) and find the best choice of c1 > 0 and c2 > 0. Let w T (c1, c2) denote the resulting solution for one instance of {ηt = c1/(c2 + t)}T t=1. We first set a target error ζ. For each round i = 1, . . ., we will run algorithm for (c1, c2) = [2i 1, 2 i+1] [2 i+1, 2 i+2 . . . , 2i 1] and (c1, c2) = [2 i+1, 2 i+2 . . . , 2i 1] [2i 1, 2 i+1], and compute each sin(w T (c1, c2), v1) privately, each with privacy budget εi = ε 2i+1(2i 1), δi = δ 2i+1(2i 1). We terminate the algorithm once there there is a w T (c1, c2) satisfies sin(w T (c1, c2), v1) ζ. It is clear that this search meta-algorithm terminate in logarithmic round, and the total sample complexity only blows up by a poly-log factor.

6 Private mean estimation for the minibatch stochastic gradients

DP-PCA critically relies on private mean estimation to reduce variance of the noise required to achieve (ε, δ)-DP. We follow a common recipe from [56, 50, 54, 8, 15]. First, we privately find an approximate range of the gradients in the minibatch (Alg. 4). Next, we apply the Gaussian mechanism to the truncated gradients where the truncation is tailored to the estimated range (Alg. 5).

Step 1: estimating the range. We need to find an approximate range of the minibatch of gradients in order to adaptively truncate the gradients and bound the sensitivity. Inspired by a private preconditioning mechanism designed for mean estimation with unknown covariance from [53], we propose to use privately estimated top eigenvalue of the covariance matrix of the gradients. For details on the version of the histogram learner we use in Alg. 4 in Appendix E.2, we refer to [61, Lemma D.1]. Unlike the private preconditioning of [53] that estimates all eigenvalues and requires n = e O(d3/2 log(1/δ)/ε) samples, we only require the top eigenvalue and hence the next theorem shows that we only need n = e O(d log(1/δ)/ε). Theorem 6.1. Algorithm 4 is (ε, δ)-DP. Let gi = Aiu for some fixed vector u, where Ai satisfies A.1 and A.4 in Assumption 1 such that the mean is E[gi] = Σu and the covariance is E[(gi Σu)(gi Σu) ] = λ2 1Hu. With a large enough sample size scaling as

B = O K2 d log(d log(1/(δζ))/(ζε)) log2a(Bd/ζ) log(1/(ζδ))

= O K2 d log(1/δ)

Algorithm 4 outputs ˆΛ achieving ˆΛ (1/

2)λ2 1 Hu 2,

2λ2 1 Hu 2 with probability 1 ζ, where the pair (K > 0, a > 0) parametrizes the tail of the distribution in A.4 and O( ) hides logarithmic factors in B, d, 1/ζ, log(1/δ), and ε.

We provide a proof in Appendix E.2. There are other ways to privately estimate the range. Some approaches require known bounds such as σ2 min λ2 1(Hu)ii σ2 max for all i [d] [56], and other agnostic approaches are more involved such as instance optimal universal estimators of [17].

Step 2: Gaussian mechanism for mean estimation. Once we have a good estimate of the top eigenvalue from the previous section, we use it to select the bin size of the private histogram and compute the truncated empirical mean. Since truncated empirical mean has a bounded sensitivity, we can use Gaussian mechanism to achieve DP. The algorithm is now standard in DP mean estimation, e.g., [56, 50]. However, the analysis is slightly different since our assumptions on gi s are different. For completeness, we provide the Algorithm 5 in Appendix E.3.

The next lemma shows that the Private Mean Estimation is (ε, δ)-DP, and with high probability clipping does not apply to any of the gradients. The returned private mean, therefore, is distributed as a spherical Gaussian centered at the empirical mean of the gradients. This result requires that we have a good estimate of the top eigenvalue from Alg. 4 such that ˆΛ λ2 1 Hu 2. This analysis implies

that we get an unbiased estimate of the gradient mean (which is critical in the analysis) with noise scaling as O(λ1γ p

d log(1/δ)/(εB)), where γ2 = maxu: u =1 Hu 2 (which is critical in getting the tight sample complexity in the second term of the final utility guarantee in Eq. (10)). We provide a proof in Appendix E.3. Lemma 6.2. For ε (0, 0.9) and any δ (0, 1), Algorithm 5 is (ε, δ)-DP. Let gi = Aiu for some fixed vector u, where Ai satisfies A.1 and A.4 in Assumption 1 such that the mean is E[gi] = Σu and the covariance is E[(gi Σu)(gi Σu) ] = λ2 1Hu. If ˆΛ [λ2 1 Hu 2/

2λ2 1 Hu 2], δ 1/B, and B = Ω(( p

d log(1/δ)/ε) log(d/(ζδ))) then, with probability 1 ζ, gi g + h 3K p

ˆΛ loga(Bd/ζ), 3K p

ˆΛ loga(Bd/ζ) id for all i [B].

6.1 Proof sketch of Theorem 5.1

We choose B = Θ(n/ log2 n) such that we access the dataset only T = Θ(log2 n) times. Hence we do not need to rely on amplification by shuffling. To add Gaussian noise that scales as the standard deviation of the gradients in each minibatch (as opposed to potentially excessively large mean of the gradients), DP-PCA adopts techniques from recent advances in private mean estimation. Namely, we first get a private and accurate estimate of the range from Theorem 6.1. Using this estimate, ˆΛ, Private Mean Estimation returns an unbiased estimate of the empirical mean of the gradients, as long as no truncation has been applied as ensured by Lemma 6.2. This gives

w t wt 1 + ηt

i=1 AB(t 1)+iwt 1 + βtzt

for zt N(0, I) and βt = 8K

2ˆΛt loga(Bd/ζ)

2d log(2.5/δ) εB . Using rotation invariance of spherical Gaussian random vectors and the fact that wt 1 = 1, we can reformulate it as

w t wt 1 + ηt

i=1 AB(t 1)+i + βt Gt

wt 1 . (15)

This process can be analyzed with Theorem 2.2 with At substituting At.

7 Discussion

Under the canonical task of computing the principal component from i.i.d. samples, we show the first result achieving an optimal error rate. This critically relies on two ideas: minibatch SGD and private mean estimation. In particular, private mean estimation plays a critical role in the case when the range of the gradients is significantly smaller than the norm; we achieve an optimal error rate that cannot be achieved with the standard recipe of gradient clipping.

Assumption A.4 can be relaxed to heavy-tail bounds with bounded k-th moment on Ai, in which case we expect the second term in Eq. (10) to scale as O(d( p

log(1/δ)/εn)1 1/k), drawing analogy from a similar trend in a computationally inefficient DP-PCA without spectral gap [62, Corollary 6.10]. When a fraction of data is corrupted, recent advances in [84, 58, 46] provide optimal algorithms for PCA. However, existing approach of [62] for robust and private PCA is computationally intractable. Borrowing ideas from robust and private mean estimation in [61], one can design an efficient algorithm, but at the cost of sub-optimal sample complexity. It is an interesting direction to design an optimal and robust version of DP-PCA. Our lower bounds are loose in its dependence in log(1/δ). Recently, a promising lower bound technique has been introduced in [52] that might close this gap.

There are two ways to extend our framework to general rank-r PCA, whose analyses are promising research directions. First, applying Hotelling s deflation method [40], we can iteratively find the PCA components one by one, by alternating our DP-PCA and deflation. For example, in one step of the iteration, we only update the current iterate vector in the directions orthogonal to all the previously found PCA components. Repeating this steps gives the estimates of the top principal components. Secondly, we can directly apply Oja s algorithm. We keep track of a r-dimensional subspace in the Oja s update rule for PCA, and perform QR decomposition to keep the iterates on the Grassmannian manifold. It might be possible to extend the analysis of [42] to analyze the private version.

Acknowledgments and Disclosure of Funding

This work is supported in part by Google faculty research award and NSF grants CCF-1705007, CNS-2002664, IIS-1929955, DMS-2134012, CCF-2019844 as a part of NSF Institute for Foundations of Machine Learning (IFML), CNS-2112471 as a part of NSF AI Institute for Future Edge Networks and Distributed Intelligence (AI-EDGE).

[1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308 318, 2016.

[2] John M Abowd. The us census bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2867 2867, 2018.

[3] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Differentially private assouad, fano, and le cam. In Algorithmic Learning Theory, pages 48 78. PMLR, 2021.

[4] Maria-Florina Balcan, Simon Shaolei Du, Yining Wang, and Adams Wei Yu. An improved gap-dependency analysis of the noisy power method. In Conference on Learning Theory, pages 284 309. PMLR, 2016.

[5] Rina Foygel Barber and John C Duchi. Privacy and statistical risk: Formalisms and minimax bounds. ar Xiv preprint ar Xiv:1412.4451, 2014.

[6] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. Advances in Neural Information Processing Systems, 32, 2019.

[7] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464 473. IEEE, 2014.

[8] Sourav Biswas, Yihe Dong, Gautam Kamath, and Jonathan Ullman. Coinpress: Practical private mean and covariance estimation. ar Xiv preprint ar Xiv:2006.06618, 2020.

[9] Avrim Blum, Cynthia Dwork, Frank Mc Sherry, and Kobbi Nissim. Practical privacy: the sulq framework. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 128 138, 2005.

[10] Gavin Brown, Marco Gaboardi, Adam Smith, Jonathan Ullman, and Lydia Zakynthinou. Covariance-aware private mean estimation without private covariance estimation. Advances in Neural Information Processing Systems, 34, 2021.

[11] Mark Bun, Kobbi Nissim, and Uri Stemmer. Simultaneous private learning of multiple concepts. J. Mach. Learn. Res., 20:94 1, 2019.

[12] Mark Bun and Thomas Steinke. Average-case averages: Private algorithms for smooth sensitivity and mean estimation. Advances in Neural Information Processing Systems, 32, 2019.

[13] T Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. ar Xiv preprint ar Xiv:1902.04495, 2019.

[14] Kamalika Chaudhuri, Anand D Sarwate, and Kaushik Sinha. A near-optimal algorithm for differentially-private principal components. The Journal of Machine Learning Research, 14(1):2905 2943, 2013.

[15] Christian Covington, Xi He, James Honaker, and Gautam Kamath. Unbiased statistical estimation and valid confidence intervals under differential privacy. ar Xiv preprint ar Xiv:2110.14465, 2021.

[16] Christos Dimitrakakis, Blaine Nelson, Aikaterini Mitrokotsa, and Benjamin IP Rubinstein. Robust and private bayesian inference. In International Conference on Algorithmic Learning Theory, pages 291 305. Springer, 2014.

[17] Wei Dong and Ke Yi. Universal private estimators. ar Xiv preprint ar Xiv:2111.02598, 2021.

[18] John Duchi and Ryan Rogers. Lower bounds for locally private estimation via communication complexity. In Conference on Learning Theory, pages 1161 1191. PMLR, 2019.

[19] John C Duchi, Michael I Jordan, and Martin J Wainwright. Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association, 113(521):182 201, 2018.

[20] Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 371 380, 2009.

[21] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265 284. Springer, 2006.

[22] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211 407, 2014.

[23] Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 11 20, 2014.

[24] Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Abhradeep Thakurta. Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2468 2479. SIAM, 2019.

[25] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054 1067, 2014.

[26] Hossein Esfandiari, Vahab Mirrokni, and Shyam Narayanan. Tight and robust private mean estimation with few users. ar Xiv preprint ar Xiv:2110.11876, 2021.

[27] Giulia Fanti, Vasyl Pihur, and Úlfar Erlingsson. Building a rappor with the unknown: Privacypreserving learning of associations and data dictionaries. Proceedings on Privacy Enhancing Technologies, 2016(3):41 61, 2016.

[28] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 439 449, 2020.

[29] Vitaly Feldman, Audra Mc Millan, and Kunal Talwar. Hiding among the clones: A simple and nearly optimal analysis of privacy amplification by shuffling. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 954 964. IEEE, 2022.

[30] Vitaly Feldman and Thomas Steinke. Calibrating noise to variance in adaptive data analysis. In Conference On Learning Theory, pages 535 544. PMLR, 2018.

[31] James Foulds, Joseph Geumlek, Max Welling, and Kamalika Chaudhuri. On the theory and practice of privacy-preserving bayesian data analysis. ar Xiv preprint ar Xiv:1603.07294, 2016.

[32] Marco Gaboardi, Ryan Rogers, and Or Sheffet. Locally private mean estimation: z-test and tight confidence intervals. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2545 2554. PMLR, 2019.

[33] Arpita Gang, Bingqing Xiang, and Waheed U Bajwa. Distributed principal subspace analysis for partitioned big data: Algorithms, analysis, and implementation. IEEE Transactions on Signal and Information Processing over Networks, 7:699 715, 2021.

[34] Quan Geng, Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The staircase mechanism in differential privacy. IEEE Journal of Selected Topics in Signal Processing, 9(7):1176 1184, 2015.

[35] Quan Geng and Pramod Viswanath. The optimal mechanism in differential privacy. In 2014 IEEE international symposium on information theory, pages 2371 2375. IEEE, 2014.

[36] Moritz Hardt and Eric Price. The noisy power method: A meta algorithm with applications. Advances in neural information processing systems, 27, 2014.

[37] Moritz Hardt and Aaron Roth. Beating randomized response on incoherent matrices. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 1255 1268, 2012.

[38] Moritz Hardt and Aaron Roth. Beyond worst-case analysis in private singular vector computation. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 331 340, 2013.

[39] Samuel B Hopkins, Gautam Kamath, and Mahbod Majid. Efficient mean estimation with pure differential privacy via a sum-of-squares exponential mechanism. ar Xiv preprint ar Xiv:2111.12981, 2021.

[40] Harold Hotelling. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933.

[41] Lijie Hu, Shuo Ni, Hanshen Xiao, and Di Wang. High dimensional differentially private stochastic optimization with heavy-tailed data. ar Xiv preprint ar Xiv:2107.11136, 2021.

[42] De Huang, Jonathan Niles-Weed, and Rachel Ward. Streaming k-pca: Efficient guarantees for oja s algorithm, beyond rank-one updates. In Conference on Learning Theory, pages 2463 2498. PMLR, 2021.

[43] Hafiz Imtiaz and Anand D Sarwate. Differentially private distributed principal component analysis. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2206 2210. IEEE, 2018.

[44] Hafiz Imtiaz and Anand D Sarwate. Distributed differentially private algorithms for matrix and tensor factorization. IEEE journal of selected topics in signal processing, 12(6):1449 1464, 2018.

[45] Prateek Jain, Chi Jin, Sham M Kakade, Praneeth Netrapalli, and Aaron Sidford. Streaming pca: Matching matrix bernstein and near-optimal finite sample guarantees for oja s algorithm. In Conference on learning theory, pages 1147 1164. PMLR, 2016.

[46] Arun Jambulapati, Jerry Li, and Kevin Tian. Robust sub-gaussian principal component analysis and width-independent schatten packing. Advances in Neural Information Processing Systems, 33, 2020.

[47] Matthew Joseph, Janardhan Kulkarni, Jieming Mao, and Steven Z Wu. Locally private gaussian estimation. Advances in Neural Information Processing Systems, 32, 2019.

[48] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. Extremal mechanisms for local differential privacy. Advances in neural information processing systems, 27, 2014.

[49] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential privacy. In International conference on machine learning, pages 1376 1385. PMLR, 2015.

[50] Gautam Kamath, Jerry Li, Vikrant Singhal, and Jonathan Ullman. Privately learning highdimensional distributions. In Conference on Learning Theory, pages 1853 1902. PMLR, 2019.

[51] Gautam Kamath, Xingtu Liu, and Huanyu Zhang. Improved rates for differentially private stochastic convex optimization with heavy-tailed data. ar Xiv preprint ar Xiv:2106.01336, 2021.

[52] Gautam Kamath, Argyris Mouzakis, and Vikrant Singhal. New lower bounds for private estimation and a generalized fingerprinting lemma. ar Xiv preprint ar Xiv:2205.08532, 2022.

[53] Gautam Kamath, Argyris Mouzakis, Vikrant Singhal, Thomas Steinke, and Jonathan Ullman. A private and computationally-efficient estimator for unbounded gaussians. ar Xiv preprint ar Xiv:2111.04609, 2021.

[54] Gautam Kamath, Vikrant Singhal, and Jonathan Ullman. Private mean estimation of heavy-tailed distributions. ar Xiv preprint ar Xiv:2002.09464, 2020.

[55] Michael Kapralov and Kunal Talwar. On differentially private low rank approximation. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1395 1414. SIAM, 2013.

[56] Vishesh Karwa and Salil Vadhan. Finite sample differentially private confidence intervals. ar Xiv preprint ar Xiv:1711.03908, 2017.

[57] Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and high-dimensional regression. In Conference on Learning Theory, pages 25 1. JMLR Workshop and Conference Proceedings, 2012.

[58] Weihao Kong, Raghav Somani, Sham Kakade, and Sewoong Oh. Robust meta-learning for mixed linear regression with small batches. Advances in Neural Information Processing Systems, 33, 2020.

[59] Pravesh K Kothari, Pasin Manurangsi, and Ameya Velingker. Private robust estimation by stabilizing convex relaxations. ar Xiv preprint ar Xiv:2112.03548, 2021.

[60] Janardhan Kulkarni, Yin Tat Lee, and Daogao Liu. Private non-smooth empirical risk minimization and stochastic convex optimization in subquadratic steps. ar Xiv preprint ar Xiv:2103.15352, 2021.

[61] Xiyang Liu, Weihao Kong, Sham Kakade, and Sewoong Oh. Robust and differentially private mean estimation. Advances in Neural Information Processing Systems, 34, 2021.

[62] Xiyang Liu, Weihao Kong, and Sewoong Oh. Differential privacy and robust statistics in high dimensions. In Conference on Learning Theory, pages 1167 1246. PMLR, 2022.

[63] Pascal Massart. Concentration inequalities and model selection: Ecole d Eté de Probabilités de Saint-Flour XXXIII-2003. Springer, 2007.

[64] Frank Mc Sherry and Kunal Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 07), pages 94 103. IEEE, 2007.

[65] Frank D Mc Sherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 19 30, 2009.

[66] Jason Milionis, Alkis Kalavasis, Dimitris Fotakis, and Stratis Ioannidis. Differentially private regression with unbounded covariates. In International Conference on Artificial Intelligence and Statistics, pages 3242 3273. PMLR, 2022.

[67] Kentaro Minami, HItomi Arai, Issei Sato, and Hiroshi Nakagawa. Differential privacy without sensitivity. In Advances in Neural Information Processing Systems, pages 956 964, 2016.

[68] Darakhshan J Mir. Differential privacy: an exploration of the privacy-utility landscape. Rutgers The State University of New Jersey-New Brunswick, 2013.

[69] Erkki Oja. Simplified neuron model as a principal component analyzer. Journal of mathematical biology, 15(3):267 273, 1982.

[70] Phillippe Rigollet and Jan-Christian Hütter. High dimensional statistics. Lecture notes for course 18S997, 813(814):46, 2015.

[71] Or Sheffet. Old techniques in differentially private linear regression. In Algorithmic Learning Theory, pages 789 827. PMLR, 2019.

[72] Jun Tang, Aleksandra Korolova, Xiaolong Bai, Xueqiang Wang, and Xiaofeng Wang. Privacy loss in apple s implementation of differential privacy on macos 10.12. ar Xiv preprint ar Xiv:1709.02753, 2017.

[73] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389 434, 2012.

[74] Christos Tzamos, Emmanouil-Vasileios Vlatakis-Gkaragkounis, and Ilias Zadik. Optimal private median estimation under minimal distributional assumptions. Advances in Neural Information Processing Systems, 33:3301 3311, 2020.

[75] Prateek Varshney, Abhradeep Thakurta, and Prateek Jain. (nearly) optimal private linear regression via adaptive clipping. ar Xiv preprint ar Xiv:2207.04686, 2022.

[76] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

[77] Duy Vu and Aleksandra Slavkovic. Differential privacy for clinical trial data: Preliminary evaluations. In 2009 IEEE International Conference on Data Mining Workshops, pages 138 143. IEEE, 2009.

[78] Vincent Vu and Jing Lei. Minimax rates of estimation for sparse pca in high dimensions. In Artificial intelligence and statistics, pages 1278 1286. PMLR, 2012.

[79] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.

[80] Di Wang, Hanshen Xiao, Srinivas Devadas, and Jinhui Xu. On differentially private stochastic convex optimization with heavy-tailed data. In International Conference on Machine Learning, pages 10081 10091. PMLR, 2020.

[81] Sen Wang and J Morris Chang. Differentially private principal component analysis over horizontally partitioned data. In 2018 IEEE Conference on Dependable and Secure Computing (DSC), pages 1 8. IEEE, 2018.

[82] Yu-Xiang Wang. Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain. ar Xiv preprint ar Xiv:1803.02596, 2018.

[83] Yu-Xiang Wang, Stephen Fienberg, and Alex Smola. Privacy for free: Posterior sampling and stochastic gradient monte carlo. In International Conference on Machine Learning, pages 2493 2502. PMLR, 2015.

[84] Huan Xu, Constantine Caramanis, and Shie Mannor. Principal component analysis with contaminated data: The high dimensional case. ar Xiv preprint ar Xiv:1002.4658, 2010.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] , see Section 7

(c) Did you discuss any potential negative societal impacts of your work? [N/A] Our work is theoretical and does not have a direct negative societal impact. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] See Appendix. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] We don t have experiments (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] We didn t use existing assets. (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]