# privately_learning_subspaces__efcb74ab.pdf Privately Learning Subspaces Vikrant Singhal Cheriton School of Computer Science University of Waterloo Waterloo, ON - N2L 3G1, Canada vikrant.singhal@uwaterloo.ca Thomas Steinke Google Research, Brain Team Mountain View, CA, United States of America subspace@thomas-steinke.net Private data analysis suffers a costly curse of dimensionality. However, the data often has an underlying low-dimensional structure. For example, when optimizing via gradient descent, the gradients often lie in or near a low-dimensional subspace. If that low-dimensional structure can be identified, then we can avoid paying (in terms of privacy or accuracy) for the high ambient dimension. We present differentially private algorithms that take input data sampled from a low-dimensional linear subspace (possibly with a small amount of error) and output that subspace (or an approximation to it). These algorithms can serve as a pre-processing step for other procedures. 1 Introduction Differentially private algorithms generally have a poor dependence on the dimensionality of their input. That is, their error or sample complexity grows polynomially with the dimension. For example, for the simple task of estimating the mean of a distribution supported on [0, 1]d, we have per-coordinate error Θ( d/n) to attain differential privacy, where n is the number of samples. In contrast, the non-private error is Θ( p This cost of dimensionality is inherent [BUV14, SU17, DSS+15]. Any method with lower error is susceptible to tracing attacks (a.k.a. membership inference attacks). However, these lower bounds only apply when the data distribution is high-entropy. This leaves open the posssibility that we can circumvent the curse of dimensionality when the data has an underlying low-dimensional structure. Data often does possess an underlying low-dimensional structure. For example, the gradients that arise in deep learning tend to be close to a low-dimensional subspace [ACG+16, LXT+17, GARD18, LFLY18, LGZ+20, ZWB20, FT20]. Low dimensionality can arise from meaningful relationships that are at least locally linear, such as income versus tax paid. It can also arise because we are looking at a function of data with relatively few attributes. A long line of work [BLR08, HT10, HR10, Ull15, BBNS19, BCM+20, ZWB20, KRRT20, etc.] has shown how to exploit structure in the data to attain better privacy and accuracy. However, these approaches assume that this structure is known a priori or that it can be learned from non-private sources. This raises the question: Can we learn low-dimensional structure from the data subject to differential privacy? We consider the simple setting where the data lies in Rd but is in, or very close to a linear subspace, of dimension k. We focus on the setting where k d and we develop algorithms whose sample 0The full version of this article is available online at https://arxiv.org/abs/2106.00001 35th Conference on Neural Information Processing Systems (Neur IPS 2021). complexity does not depend on the ambient dimension d; a polynomial dependence on the true dimension k is unavoidable. Our algorithms identify the subspace in question or, if the data is perturbed slightly, an approximation to it. Identifying the subspace structure is interesting in its own right, but it also can be used as a pre-processing step for further analysis by projecting to the low-dimensional subspace, we ensure subsequent data analysis steps do not need to deal with high-dimensional data. 1.1 Our Contributions: Privately Learning Subspaces Exact Case We first consider the exact case, where the data X1, , Xn Rd are assumed to lie in a kdimensional subspace (rather than merely being near to it) i.e., rank (A) = k, where A = Pn i Xi XT i Rd d. In this case, we can also recover the subspace exactly. However, we must also make some non-degeneracy assumptions. We want to avoid a pathological input dataset such as the following. Suppose X1, , Xk are linearly independent, but Xk = Xk+1 = Xk+2 = = Xn. While we can easily reveal the repeated data point, we cannot reveal anything about the other points due to the privacy constraint. A natural non-degeneracy assumption would be to assume that the data points are in general position that is, that there are no non-trivial linear dependencies among the data points. This means that every set of k data points spans the subspace or, equivalently, no subspace of dimension k 1 contains more than k 1 data points. This is a very natural assumption if the data consists of n samples from a continuous distribution on the subspace, then this holds with probability 1. We relax this assumption slightly and assume that no subspace of dimension k 1 contains more than ℓdata points. We also assume that all points are non-zero. Note that we define subspaces to pass through the origin; our results can easily be extended to affine subspaces. Theorem 1.1 (Main Result Exact Case). For all n, d, k, ℓ N and ε, δ > 0 satisfying n O ℓ+ log(1/δ) ε , there exists a randomized algorithm M : Rd n Sk d satisfying the following. Here Sk d denotes the set of all k-dimensional subspaces of Rd. M is (ε, δ)-differentially private with respect to changing one column of its input. Let X = (X1, , Xn) Rd n. Suppose there exists a k-dimensional subspace S Sk d that contains all but ℓof the points i.e., |{i [n] : Xi S }| n ℓ. Further suppose that any (k 1)-dimensional subspace contains at most ℓpoints i.e., for all S Sk 1 d , we have |{i [n] : Xi S}| ℓ. Then P [M(X) = S ] = 1. The parameter ℓin Theorem 1.1 can be thought of as a robustness parameter. Ideally the data points are in general position, in which case ℓ= k 1. If a few points are corrupted, then we increase ℓ accordingly; our algorithm can tolerate the corruption of a small constant fraction of the data points. Theorem 1.1 is optimal in the sense that n Ω ℓ+ log(1/δ) ε samples are required. 1.2 Our Contributions: Privately Learning Subspaces Approximate Case Next we turn to the substantially more challenging approximate case, where the data X1, , Xn Rd are assumed to be close to a k-dimensional subspace, but are not assumed to be contained within that subspace. Our algorithm for the exact case is robust to changing a few points, but very brittle if we change all the points by a little bit. Tiny perturbations of the data points (due to numerical errors or measurement imprecision) could push the point outside the subspace, which would cause the algorithm to fail. Thus it is important to for us to cover the approximate case and our algorithm for the approximate is entirely different from our algorithm for the exact case. The approximate case requires us to precisely quantify how close the input data and our output are to the subspace and we also need to make quantitative non-degeneracy assumptions. It is easiest to formulate this via a distributional assumption. We will assume that the data comes from a Gaussian distribution where the covariance matrix has a certain eigenvalue gap. This is a strong assumption and we emphasize that this is only for ease of presentation; our algorithm works under weaker assumptions. Furthermore, we stress that the differential privacy guarantee is worst-case and does not depend on any distributional assumptions. We assume that the data is drawn from a multivariate Gaussian N(0, Σ). Let λ1(Σ) λ2(Σ) λd(Σ) be the eigenvalues of Σ Rd d. We assume that there are k large eigenvalues λ1(Σ), , λk(Σ) these represent the signal we want and d k small eigenvalues λk+1(Σ), , λd(Σ) these are the noise . Our goal is to recover the subspace spanned by the eigenvectors corresponding to the k largest eigenvalues λ1(Σ), , λk(Σ). Our assumption is that there is a large multiplicative gap between the large and small eigenvalues. Namely, we assume λk+1(Σ) λk(Σ) 1 poly(d). Theorem 1.2 (Main Result Approximate Case). For all n, d, k N and α, γ, ε, δ > 0 satisfying n Θ k log(1/δ) ε + ln(1/δ) ln(ln(1/δ)/ε) and γ2 Θ εα2n d2k log(1/δ) min 1 k , 1 log(k log(1/δ)/ε) there exists an algorithm M : Rd n Sk d satisfying the following. Here Sk d is the set of all k-dimensional subspaces of Rd represented as projection matricies i.e., Sk d = {Π Rd d : Π2 = Π = ΠT , rank(Π) = k}. M is (ε, δ)-differentially private with respect to changing one column of its input. Let X1, , Xn be independent samples from N(0, Σ). Let λ1(Σ) λ2(Σ) λd(Σ) be the eigenvalues of Σ Rd d. Suppose λk+1(Σ) γ2 λk(Σ). Let Π Sk d be the projection matrix onto the subspace spanned by the eigenvectors corresponding to the k largest eigenvalues of Σ. Then P [ M(X) Π α] 0.7. The sample complexity of our algorithm n = O(k log(1/δ)/ε) is independent of the ambient dimension d; this is ideal. However, there is a polynomial dependence on d in γ, which controls the multiplicative eigenvalue gap. This multiplicative eigenvalue gap is a strong assumption, but it is also a necessary assumption if we want the sample complexity n to be independent of the dimension d. In fact, it is necessary even without the differential privacy constraint [CZ16]. That is, if we did not assume an eigenvalue gap that depends polynomially on the ambient dimension d, then it would be impossible to estimate the subspace with sample complexity n that is independent of the ambient dimension d even in the non-private setting. Our algorithm is based on the subsample and aggregate framework [NRS07] and a differentially private histogram algorithm. These methods are generally quite robust and thus our algorithm is, too. For example, our algorithm can tolerate o(n/k) input points being corrupted arbitrarily. We also believe that our algorithm s utility guarantee is robust to relaxing the Gaussianity assumption. All that we require in the analysis is that the empirical covariance matrix of a few samples from the distribution is sufficiently close to its expectation Σ with high probability. 1.3 Related Work To the best of our knowledge, the problem of privately learning subspaces, as we formulate it, has not been studied before. However, a closely-related line of work is on Private Principal Component Analysis (PCA) and low-rank approximations. We briefly discuss this extensive line of work below, but first we note that, in our setting, all of these techniques have a sample complexity n that grows polynomially with the ambient dimension d. Thus, they do not evade privacy s curse of dimensionality. However, we make a stronger assumption than these prior works namely, we assume a large multiplicative eigenvalue gap. (Many of the prior works consider an additive eigenvalue gap, which is a weaker assumption.) There has been a lot of interest in Private PCA, matrix completion, and low-rank approximation. One motivation for this is the infamous Netflix prize, which can be interpreted as a matrix completion problem. The competition was cancelled after researchers showed that the public training data revealed the private movie viewing histories of many of Netflix s customers [NS06]. Thus privacy is a real concern for matrix analysis tasks. Many variants of these problems have been considered: Some provide approximations to the data matrix X = (X1, , Xn) Rd n; others approximate the covariance matrix A = Pn i Xi XT i Rd d (as we do). There are also different forms of approximation we can either produce a subspace or an approximation to the entire matrix, and the approximation can be measured by different norms (we consider the operator norm between projection matrices). Importantly, we define differential privacy to allow one data point Xi to be changed arbitrarily, whereas most of the prior work assumes a bound on the norm of the change or even assumes that only one coordinate of one vector can be changed. In the discussion below we focus on the techniques that have been considered for these problems, rather than the specific results and settings. [DTTZ14] consider the simple algorithm which adds independent Gaussian noise to each of entries of the covariance matrix A, and then perform analysis on the noisy matrix. (In fact, this algorithm predates the development of differential privacy [BDMN05] and was also analyzed under differential privacy by Mc Sherry and Mironov [MM09] and Chaudhuri, Sarwate, and Sinha [CSS12].) This simple algorithm is versatile and several bounds are provided for the accuracy of the noisy PCA. The downside of this is that a polynomial dependence on the ambient dimension d is inherent indeed, they prove a sample complexity lower bound of n = Ω( d) for any algorithm that identifies a useful approximation to the top eigenvector of A. This lower bound does not contradict our results because the relevant inputs do not satisfy our near low-rank assumption. [HR12] and [ABU18] apply techniques from dimensionality reduction to privately compute a lowrank approximation to the input matrix X. [HR13] and [HP13] use the power iteration method with noise injected at each step to compute low-rank approximations to the input matrix X. In all of these, the underlying privacy mechanism is still noise addition and the results still require the sample complexity to grow polynomially with the ambient dimension to obtain interesting guarantees. (However, the results can be dimension-independent if we define differential privacy so that only one entry as opposed to one column of the matrix X can be changed by 1. This is a significantly weaker privacy guarantee.) [BBDS12] and [She19] also use tools from dimensionality reduction; they approximate the covariance matrix A. However, they show that the dimensionality reduction step itself provides a privacy guarantee (whereas the aforementioned results did not exploit this and relied on noise added at a later stage). [She19] analyzes two additional techniques the addition of Wishart noise (i.e., Y Y T where the columns of Y are independent multivariate Gaussians) and sampling from an inverse Wishart distribution (which has a Bayesian interpretation). [CSS12], [KT13], [WSC+16], and [ADK+18] apply variants of the exponential mechanism [MT07] to privately select a low-rank approximation to the covariance matrix A. This method is nontrivial to implement and analyse, but it ultimately requires the sample complexity to grow polynomially in the ambient dimension. [GGB18] exploit smooth sensitivity [NRS07] to release a low-rank approximation to the matrix A. This allows adding less noise than worst case sensitivity, under an eigenvalue gap assumption. However, the sample complexity n is polynomial in the dimension d. Limitations of Prior Work Given the great variety of techniques and analyses that have been applied to differentially private matrix analysis problems, what is missing? We see that almost all of these techniques are ultimately based on some form of noise addition or the exponential mechanism. With the singular exception of the techniques of Sheffet [She19], all of these prior techniques satisfy pure1 or concentrated differential privacy [BS16]. This is enough to conclude that these techniques cannot yield the dimension-independent guarantees that we seek. No amount of postprocessing or careful analysis can avoid this limitation. This is because pure and concentrated differential privacy have strong group privacy properties, which means packing lower bounds [HT10] apply. We briefly sketch why concentrated differential privacy is incompatible with dimension-independent guarantees. Let the input be X1 = X2 = = Xn = ξ/ d for a uniformly random ξ { 1, +1}d. That is, the input is one random point repeated n times. If M satisfies O(1)-concentrated differential privacy, then it satisfies the mutual information bound I(M(X); X) O(n2) [BS16]. But, if M provides a meaningful approximation to X or A = XXT , then we must be able to recover an approximation to ξ from its output, whence I(M(X); X) Ω(d), as the entropy of X is d bits. This gives a lower bound of n Ω( d), even though X and A have rank k = 1. The above example shows that, even under the strongest assumptions (i.e., the data lies exactly in a rank-1 subspace), any good approximation to the subspace, to the data matrix X, or to the covariance matrix A = XXT must require the sample complexity n to grow polynomially in the ambient 1Pure differential privacy (a.k.a. pointwise differential privacy) is (ε, δ)-differential privacy with δ = 0. dimension d if we restrict to techniques that satisfy concentrated differential privacy. Almost all of the prior work in this general area is subject to this restriction. To avoid a sample complexity n that grows polynomially with the ambient dimension d, we need fundamentally new techniques. 1.4 Our Techniques For the exact case, we construct a score function for subspaces that has low sensitivity, assigns high score to the correct subspace, and assigns a low score to all other subspaces. Then we can simply apply a GAP-MAX algorithm to privately select the correct subspace [BDRS18]. The GAP-MAX algorithm satisfies (ε, δ)-differential privacy and outputs the correct subspace as long as the gap between its score and that of any other subspace is larger than O(log(1/δ)/ε). This works even though there are infinitely many subspaces to consider, which would not be possible under concentrated differential privacy. The simplest score function would simply be the number of input points that the subspace contains. This assigns high score to the correct subspace, but it also assigns high score to any larger subspace that contains the correct subspace. To remedy this, we subtract from the score the number of points contained in a strictly smaller subspace. That is, the score of subspace S is the number of points in S minus the maximum over all subspaces S S of the number of points contained in S . This GAP-MAX approach easily solves the exact case, but it does not readily extend to the approximate case. If we count points near to the subspace, rather than in it, then (infinitely) many subspaces will have high score, which violates the assumptions needed for GAP-MAX to work. Thus we use a completely different approach for the approximate case. We apply the subsample and aggregate paradigm of [NRS07]. That is, we split the dataset X1, , Xn into n/O(k) sub-datasets each of size O(k). We use each sub-dataset to compute an approximation to the subspace by doing a (non-private) PCA on the sub-dataset. Let Π be the projection matrix onto the correct subspace and Π1, , Πn/O(k) the projection matrices onto the approximations derived from the sub-datasets. With high probability Πj Π is small for most j. (Exactly how small depends on the eigengap.) Now we must privately aggregate the projection matrices Π1, , Πn/O(k) into a single projection matrix. Rather than directly trying to aggregate the projection matrices, we pick a set of reference points, project them onto the subspaces, and then aggregate the projected points. We draw p1, , p O(k) independently from a standard spherical Gaussian. Then Πjpi Πpi Πj Π O( k) is also small for all i and most j. We wish to privately approximate Πpi and to do this we have n/O(k) points Πjpi most of which are close to Πpi. This is now a location or mean estimation problem, which we can solve privately. Thus we obtain points ˆpi such that ˆpi Πpi is small for all i. From a PCA of these points we can obtain a projection ˆΠ with ˆΠ Π being small, as required. Finally, we discuss how to privately obtain (ˆp1, ˆp2, , ˆp O(k)) from (Π1p1, , Π1p O(k)), , (Πn/O(k)p1, , Πn/O(k)p O(k)). It is better here to treat (ˆp1, ˆp2, , ˆp O(k)) as a single vector in RO(kd), rather than as O(k) vectors in Rd. We split RO(kd) into cells and then run a differentially private histogram algorithm. If we construct the cells carefully, for most j we have that (Πjp1, , Πjp O(k)) is in the same histogram cell as the desired point (Πp1, , Πp O(k)). The histogram algorithm will thus identify this cell, and we take an arbitrary point from this cell as our estimate (ˆp1, ˆp2, , ˆp O(k)). The differentially private histogram algorithm is run over exponentially many cells, which is possible under (ε, δ)-differential privacy if n/O(k) O(log(1/δ)/ε). (Note that under concentrated differential privacy the histogram algorithm s sample complexity n would need to depend on the number of cells and, hence, the ambient dimension d.) The main technical ingredients in the analysis of our algorithm for the approximate case are matrix perturbation and concentration analysis and the location estimation procedure using differentially private histograms. Our matrix perturbation analysis uses a variant of the Davis-Kahan theorem to show that if the empirical covariance matrix is close to the true covariance matrix, then the subspaces corresponding to the top k eigenvalues of each are also close; this is applied to both the subsamples and the projection of the reference points. The matrix concentration results that we use show that the empirical covariance matrices in all the subsamples are close to the true covariance matrix. This is the only place where the multivariate Gaussian assumption arises. Any distribution that concentrates well will work. 2 Exact case Here, we discuss the case, where all n points lie exactly in a subspace s of dimension k of Rd. Our goal is to privately output that subspace. We do it under the assumption that all strict subspaces of s contain at most ℓpoints. If the points are in general position, then ℓ= k 1, as any strictly smaller subspace has dimension < k and cannot contain more points than its dimension. Let Sk d be the set of all k-dimensional subspaces of Rd. Let Sd be the set of all subspaces of Rd. We formally define that problem as follows. Problem 2.1. Assume (i) all but at most ℓ, input points are in some s Sk d , and (ii) every subspace of dimension < k contains at most ℓpoints. (If the points are in general position aside from being contained in s then ℓ= k 1.) The goal is to output a representation of s . We call these ℓpoints that do not lie in s , adversarial points . With the problem defined in Problem 2.1, we will state the main theorem of this section. Theorem 2.2. For any ε, δ > 0, ℓ k 1 0, and n O ℓ+ log(1/δ) there exists an (ε, δ)-DP algorithm M : Rd n Sk d , such that if X is a dataset of n points satisfying the conditions in Problem 2.1, then M(X) outputs a representation of s with probability 1. We prove Theorem 2.2 by proving the privacy and the accuracy guarantees of Algorithm 1. The algorithm performs a GAP-MAX. It assigns a score to all the relevant subspaces, that is, the subspaces spanned by the points of the dataset X. We show that the only subspace that has a high score is the true subspace s , and the rest of the subspaces have low scores. Then GAP-MAX outputs the true subspace successfully because of the gap between the scores of the best subspace and the second to the best one. For GAP-MAX to work all the time, we define a default option in the output space that has a high score, which we call NULL. Thus, the output space is now Y = Sd {NULL}. Also, for GAP-MAX to run in finite time, we filter Sd to select finite number of subspaces that have at least 0 scores on the basis of X. Note that this is a preprocessing step, and does not violate privacy as, we will show, all other subspaces already have 0 probability of getting output. We define the score function u : X n Y N as follows. ( |x s| sup{|x t| : t Sd, t s} if s Sd ℓ+ 4 log(1/δ) ε + 1 if s = NULL Note that this score function can be computed in finite time because for any m points and i > 0, if the points are contained in an i-dimensional subspace, then the subspace that contains all m points must lie within the set of subspaces spanned by m i+1 subsets of points. 2.1 Privacy Lemma 2.3. Algorithm 1 is (ε, δ)-differentially private. The proof of Lemma 2.3 closely follows the privacy analysis of GAP-MAX by [BDRS18]. The only novelty is that Algorithm 1 may output NULL in the case that the input is malformed (i.e., doesn t satisfy the assumptions of Problem 2.1). The key is that the score u(X, s) is low sensitivity. Thus max{0, u(X, s) u(X, s2) 1} also has low sensitivity. What we gain from subtracting the second-largest score and taking this maximum is that these values are also sparse only one (s = s1) is nonzero. This means we can add noise to all the values without paying for composition. We prove the privacy guarantees in the full version. 2.2 Accuracy We start by showing that the true subspace s has a high score, while the rest of the subspaces have low scores. Algorithm 1: DP Exact Subspace Estimator DPESEε,δ,k,ℓ(X) Input: Samples X Rd n. Parameters ε, δ, k, ℓ> 0. Output: ˆs Sk d . 1 Set Y {NULL} and sample noise ξ(NULL) from TLap(2, ε, δ). 2 Set score u(X, NULL) = ℓ+ 4 log(1/δ) // Identify candidate outputs. 3 for each subset S of X of size k do 4 Let s be the subspace spanned by S. 6 Sample noise ξ(s) from TLap(2, ε, δ). 7 Set score u(X, s) = |x s| sup{|x t| : t Sd, t s}. // Apply GAP-MAX. 9 Let s1 = arg maxs Y u(X, s) be the candidate with the largest score. 10 Let s2 = arg maxs Y\{s1} u(X, s) be the candidate with the second-largest score. 11 Let ˆs = arg maxs Y max{0, u(X, s) u(X, s2) 1} + ξ(s). // Truncated Laplace noise ξ TLap(2, ε, δ) 12 return ˆs. Lemma 2.4. Under the assumptions of Problem 2.1, u(x, s ) n 2ℓand u(x, s ) 2ℓfor s = s . Proof. We have u(x, s ) = |x s | |x s | for some s Sd with s s . The dimension of s is at most k 1 and, by the assumption (ii), |x s | ℓ. Let s Sd \ {s }. There are three cases to analyse: 1. Let s s . Then u(x, s ) |x s | |x s | ℓbecause the ℓadverserial points and the n ℓnon-adversarial points may not together lie in a subspace of dimension k. 2. Let s s . Let k be the dimension of s . Clearly k < k. By our assumption (ii), |s x| ℓ. Then u(x, s ) = |x s | |x t| ℓfor some t because the ℓadversarial points already don t lie in s , so they will not lie in any subspace of s . 3. Let s be incomparable to s . Let s = s s . Then u(x, s ) |x s | |x s | ℓ because the adversarial points may not lie in s , but could be in s \ s . This completes the proof. Now, we show that the algorithm is accurate. Lemma 2.5. If n 3ℓ+ 8 log(1/δ) ε + 2, then Algorithm 1 outputs s for Problem 2.1. Proof. From Lemma 2.4, we know that s has a score of at least n 2ℓ, and the next best subspace can have a score of at most ℓ. Also, the score of NULL is defined to be ℓ+ 4 log(1/δ) ε + 1. This means that the gap satisfies max{0, u(X, s ) u(X, s2) 1} n 3ℓ 4 log(1/δ) ε 1. Since the noise is bounded by 2 log(1/δ) ε , our bound on n implies that ˆs = s 2.3 Extensions of our Algorithm Determining the True Dimension k. Algorithm 1 actually does not need to know the parameter k, which is the dimension of the true subspace. That is, we could modify the algorithm to run over all subspaces, not just those of dimension k. Or, if we have an upper bound k kmax, then we could run the algorithm over all subspaces of dimension kmax. However, the analyst may want to know what k is before spending the privacy budget necessary to run the algorithm. If the dimension k turns out to be large, then it may not be worth running the algorithm. In other words, it would be desirable to be able to compute the dimension k without paying for the privacy cost of computing the subspace itself by running Algorithm 1. We now sketch an algorithm that would help determine the right choice of k when it is unkown, at a low cost in sample complexity. We essentially use the exponential mechanism [MT07] to find k. Given i = 1, . . . , d, let Si be the set of subspaces of dimension i that could be found by taking all combinations of i points from our dataset. Then we define SCORE(i) := max s Si u(X, s). Since u has sensitivity 1, SCORE(i) does too, as taking the maximum does not increase sensitivity. We then run the exponential mechanism over all i, and output the best i. This is private by the guarantees of the exponential mechanism. The intuition for accuracy is that only one subspace will have a high score (from the accuracy argument in the previous subsection), and all other subspaces will have low score. Therefore, the dimension of that subspace would be output with high probability by the mechanism, if the number of samples is at least n O(log(d)) ε since the sensitivity of SCORE is 1. If we know an upper bound on k, say kmax, we can evaluate over i = 1, . . . , kmax and only pay for kmax instead of d. Alternatively, if our goal is simply to determine whether k kmax or k > kmax, we could compute maxi kmax SCORE(i) and add Laplace or Gaussian noise. If this value is large, then probably k kmax, if it is small then probably k > kmax. This would have minimal cost and may suffice to determine whether or not it is worth running the full algorithm. Computational Efficiency. Algorithm 1 enumerates all subsets of input points of size k, so its running time is O(nk), which is not efficient unless k is very small. We now sketch a modification to the algorithm that would make it efficient. Rather than considering all subsets we consider a random collection of subsets. That is, we change the for loop in Algorithm 1 so that instead of considering each subset S of X of size k we consider some number of randomly chosen sets of k points. Intuitively, it suffices to consider such a random subset of the subspaces because any subspace s with large u(x, s) will be included in this subset with high probability. And any subspace s with small u(x, s) is not going to be output anyway. Formally, we must show that this new algorithm will with high probability output the same thing as the current algorithm. The failure probability is added to the δ of the (ε, δ)-differential privacy guarantee. This requires changing the analysis as well as some of the parameter settings in the algorithm. Note that our algorithm for the approximate case is computationally efficient. Thus, if computational efficiency is a concern, we can simply run that algorithm. 3 Approximate Case In this section, we discuss the case, where the data approximately lies in a k-dimensional subspace of Rd. We make a Gaussian distributional assumption, where the covariance is approximately kdimensional, though the results could be extended to distributions with heavier tails using the right inequalities. We formally define the problem: Problem 3.1. Let Σ Rd d be a symmetric, PSD matrix of rank k {1, . . . , d}, and let 0 < γ 1, such that λk+1 λk γ2. Suppose Π is the projection matrix corresponding to the subspace spanned by the eigenvectors of Σ corresponding to the eigenvalues λ1, . . . , λk. Given sample access to N( 0, Σ), and 0 < α < 1, output a projection matrix bΠ, such that Π bΠ α. We solve Problem 3.1 under the constraint of (ε, δ)-differential privacy. Throughout this section, we would refer to the subspace spanned by the top k eigenvectors of Σ as the true or actual subspace. Algorithm 2 solves Problem 3.1 and proves Theorem 1.2. Here is the operator norm. Remark 3.2. We scale the eigenvalues of Σ so that λk = 1 and λk+1 γ2. Also, for the purpose of the analysis, we will be splitting Σ = Σk + Σd k, where Σk is the covariance matrix formed by the top k eigenvalues and the corresponding eigenvectors of Σ and Σd k is remainder. Also, we assume the knowledge of γ and (an upper bound on) γ. Our solution is presented in Algorithm 2. The following theorem is the main result of the section. Theorem 3.3. Let Σ Rd d be an arbitrary, symmetric, PSD matrix of rank k {1, . . . , d}, and let 0 < γ < 1. Suppose Π is the projection matrix corresponding to the subspace spanned by the vectors of Σk. Then given γ2 O εα2n d2k ln(1/δ) min 1 k , 1 ln(k ln(1/δ)/ε) such that λk+1(Σ) γ2λk(Σ), for every ε, δ > 0, and 0 < α < 1, there exists and (ε, δ)-DP algorithm that takes n O k log(1/δ) ε + log(1/δ) log(log(1/δ)/ε) samples from N( 0, Σ), and outputs a projection matrix bΠ, such that Π bΠ α with probability at least 0.7. Algorithm 2 is a type of Subsample-and-Aggregate algorithm [NRS07]. Here, we consider multiple subspaces formed by the points from the same Gaussian, and privately find a subspace that is close to all those subspaces. Since the subspaces formed by the points would be close to the true subspace, the privately found subspace would be close to the true subspace. A little more formally, we first sample q public data points (called reference points ) from N( 0, I). Next, we divide the original dataset X into disjoint datasets of m samples each, and project all reference points on the subspaces spanned by every subset. Now, for every reference point, we do the following. We have t = n m projections of the reference point. Using DP histogram over Rd, we aggregate those projections in the histogram cells; with high probability all those projections will be close to one another, so they would lie within one histogram cell. We output a random point from the histogram cell corresponding to the reference point. With a total of q points output in this way, we finally output the projection matrix spanned by these points. In the algorithm C0, C1, and C2 are universal constants. We divide the proof of Theorem 3.3 into two parts: privacy (Lemma 3.4) and accuracy. 3.1 Privacy We analyse the privacy by understanding the sensitivities at the only sequence of steps invoking a differentially private mechanism, that is, the sequence of steps involving DP-histograms. Lemma 3.4. Algorithm 2 is (ε, δ)-differentially private. Proof. Changing one point in X can change only one of the Xj s. This can only change one point in Q, which in turn can only change the counts in two histogram cells by 1. Therefore, the sensitivity is 2. Because the sensitivity of the histogram step is bounded by 2, an application of DP-histogram is (ε, δ)-DP. Outputting a random point in the privately found histogram cell preserves privacy by post-processing. Hence, the claim. 3.2 Accuracy The accuracy analysis of Algorithm 2 is relatively complex and is deferred to the full version. The key ingredients come from the literature on matrix concentration bounds and matrix perturbation inequalities. We briefly outline the key steps: First, we apply matrix concentration to show that the empirical covariance matrix Xj(Xj)T of each subsample is, after rescaling, close to the true covariance matrix Σ with high probability. Second, we apply matrix perturbation inequalities to show that the top-k subspace Πj corresponding to the data matrix Xj is close to the true top-k subspace Π. It follows that most of the the projected reference points pj i are close to the desired value Πpi. Third, we show that the aggregated projections ˆpi are also close to the true projections Πi. Finally, we apply matrix perturbation inequalities again to show that the subspace derived from the aggregated projections ˆΠ is close to the true subspace Π. Algorithm 2: DP Approximate Subspace Estimator DPASEε,δ,α,γ,k(X) Input: Samples X1, . . . , Xn Rd. Parameters ε, δ, α, γ, k > 0. Output: Projection matrix bΠ Rd d of rank k. 1 Set parameters: t C0 ln(1/δ) ε m n/t q C1k ℓ C2γ 2 Sample reference points p1, . . . , pq from N( 0, I) independently. // Subsample from X, and form projection matrices. 3 for j 1, . . . , t do 4 Let Xj = (X(j 1)m+1, . . . , Xjm) Rd m. 5 Let Πj Rd d be the projection matrix onto the subspace spanned by the eigenvectors of Xj(Xj)T Rd d corresponding to the largest k eigenvalues. 6 for i 1, . . . , q do 7 pj i Πjpi 8 end // Create histogram cells with random offset. 10 Let λ be a random number in [0, 1). 11 Divide Rqd into Ω= {. . . , [λℓ+ iℓ, λℓ+ (i + 1)ℓ), . . . }qd, for all i Z. 12 Let each disjoint cell of length ℓbe a histogram bucket. // Perform private aggregation of subspaces. 13 For each i [q], let Qi Rd t be the dataset, where column j is pj i. 14 Let Q Rqd t be the vertical concatenation of all Qi s in order. 15 Run (ε, δ)-DP histogram over Ωusing Q to get ω Ωthat contains at least t 16 if no such ω exists then // Return the subspace. 19 Let bp = (bp1, . . . , bpd, . . . , bp(q 1)d+1, . . . , bpqd) be a random point in ω. 20 for each i [q] do 21 Let bpi = (bp(i 1)d+1, . . . , bpid). 23 Let bΠ be the projection matrix of the top-k subspace of (bp1, . . . , bpq). 24 return bΠ. 3.3 Handling Unknown Dimension k Our algorithm requires knowning the parameter k, which determines the dimension of the subspace and which corresponds to the location of the eigenvalue gap. What if we only have an upper bound k kmax? In this case, we can use generic techniques to select the parameter k [LT19, PS21]. These methods repeat our algorithm multiple times with different settings of the parameter k and output the one with the highest self-reported score. The key is that these methods incur only a constant factor overhead in the parameter ε despite repeating the algorithm a super-constant number of times. (They do, however, incur a poly(k) factor overhead in the parameter δ.) To apply these results we must modify our algorithm to also output a quality score for the subspace it has computed, such that if the right parameter k is chosen we have high score and if the wrong parameter is chosen we have low score. To compute this score we take auxiliary samples (i.e., we add to the sample complexity and ensure that these samples are independent from the generated subspaces). Given the projection matrix ˆΠ returned by our algorithm and an auxiliary sample x, we compute the projection ˆΠx and the orthogonal component x ˆΠx. Our success metric simply counts how many auxiliary samples satisfy x ˆΠx ˆγ x for some threshold ˆγ. Funding Disclosure V.S. is currently at the University of Waterloo supported by an NSERC Discovery Grant, and was at Northeastern University supported by NSF grants CCF-1750640, CNS-1816028, and CNS-1916020. V.S. was an intern at IBM during part of this work. T.S. is currently employed by Google, but was employed by IBM during part of this work. [ABU18] Raman Arora, Vladimir Braverman, and Jalaj Upadhyay. Differentially private robust low-rank approximation. Advances in neural information processing systems, 2018. [ACG+16] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308 318, 2016. [ADK+18] Kareem Amin, Travis Dick, Alex Kulesza, Andres Munoz Medina, and Sergei Vassilvitskii. Private covariance estimation via iterative eigenvector sampling. In 2018 NIPS workshop in Privacy-Preserving Machine Learning, volume 250, 2018. [BBDS12] Jeremiah Blocki, Avrim Blum, Anupam Datta, and Or Sheffet. The Johnson Lindenstrauss transform itself preserves differential privacy. In Proceedings of the 53rd Annual IEEE Symposium on Foundations of Computer Science, FOCS 12, pages 410 419, Washington, DC, USA, 2012. IEEE Computer Society. [BBNS19] Jaroslaw Błasiok, Mark Bun, Aleksandar Nikolov, and Thomas Steinke. Towards instance-optimal private query release. In Proceedings of the Thirtieth Annual ACMSIAM Symposium on Discrete Algorithms, pages 2480 2497. SIAM, 2019. [BCM+20] Raef Bassily, Albert Cheu, Shay Moran, Aleksandar Nikolov, Jonathan Ullman, and Steven Wu. Private query release assisted by public data. In International Conference on Machine Learning, pages 695 703. PMLR, 2020. [BDMN05] Avrim Blum, Cynthia Dwork, Frank Mc Sherry, and Kobbi Nissim. Practical privacy: The Su LQ framework. In Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 05, pages 128 138, New York, NY, USA, 2005. ACM. [BDRS18] Mark Bun, Cynthia Dwork, Guy N. Rothblum, and Thomas Steinke. Composable and versatile privacy via truncated cdp. In Proceedings of the 50th Annual ACM Symposium on the Theory of Computing, STOC 18, pages 74 86, New York, NY, USA, 2018. ACM. [BLR08] Avrim Blum, Katrina Ligett, and Aaron Roth. A learning theory approach to noninteractive database privacy. In STOC, 2008. [BNS16] Mark Bun, Kobbi Nissim, and Uri Stemmer. Simultaneous private learning of multiple concepts. In Proceedings of the 7th Conference on Innovations in Theoretical Computer Science, ITCS 16, pages 369 380, New York, NY, USA, 2016. ACM. [BS16] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Proceedings of the 14th Conference on Theory of Cryptography, TCC 16-B, pages 635 658, Berlin, Heidelberg, 2016. Springer. [BUV14] Mark Bun, Jonathan Ullman, and Salil Vadhan. Fingerprinting codes and the price of approximate differential privacy. In Proceedings of the 46th Annual ACM Symposium on the Theory of Computing, STOC 14, pages 1 10, New York, NY, USA, 2014. ACM. [CSS12] Kamalika Chaudhuri, Anand Sarwate, and Kaushik Sinha. Near-optimal differentially private principal components. Advances in Neural Information Processing Systems, 25:989 997, 2012. [CZ16] T. Cai and Anru Zhang. Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics. The Annals of Statistics, 46, 05 2016. [DMNS06] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography, TCC 06, pages 265 284, Berlin, Heidelberg, 2006. Springer. [DSS+15] Cynthia Dwork, Adam Smith, Thomas Steinke, Jonathan Ullman, and Salil Vadhan. Robust traceability from trace amounts. In Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science, FOCS 15, pages 650 669, Washington, DC, USA, 2015. IEEE Computer Society. [DTTZ14] Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Analyze Gauss: Optimal bounds for privacy-preserving principal component analysis. In Proceedings of the 46th Annual ACM Symposium on the Theory of Computing, STOC 14, pages 11 20, New York, NY, USA, 2014. ACM. [FT20] Yu Feng and Yuhai Tu. How neural networks find generalizable solutions: Self-tuned annealing in deep learning. ar Xiv preprint ar Xiv:2001.01678, 2020. [GARD18] Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. ar Xiv preprint ar Xiv:1812.04754, 2018. [GDGK20] Quan Geng, Wei Ding, Ruiqi Guo, and Sanjiv Kumar. Tight analysis of privacy and utility tradeoff in approximate differential privacy. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 89 99. PMLR, 26 28 Aug 2020. [GGB18] Alon Gonem and Ram Gilad-Bachrach. Smooth sensitivity based approach for differentially private PCA. In Algorithmic Learning Theory, ALT 18, pages 438 450. JMLR, Inc., 2018. [HP13] Moritz Hardt and Eric Price. The noisy power method: A meta algorithm with applications. ar Xiv preprint ar Xiv:1311.2495, 2013. [HR10] Moritz Hardt and Guy N Rothblum. A multiplicative weights mechanism for privacypreserving data analysis. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 61 70. IEEE, 2010. [HR12] Moritz Hardt and Aaron Roth. Beating randomized response on incoherent matrices. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 1255 1268, 2012. [HR13] Moritz Hardt and Aaron Roth. Beyond worst-case analysis in private singular vector computation. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 331 340, 2013. [HT10] Moritz Hardt and Kunal Talwar. On the geometry of differential privacy. In Proceedings of the 42nd Annual ACM Symposium on the Theory of Computing, STOC 10, pages 705 714, New York, NY, USA, 2010. ACM. [KRRT20] Peter Kairouz, Mónica Ribero, Keith Rush, and Abhradeep Thakurta. Fast dimension independent private adagrad on publicly estimated subspaces, 2020. [KT13] Michael Kapralov and Kunal Talwar. On differentially private low rank approximation. In Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 13, pages 1395 1414, Philadelphia, PA, USA, 2013. SIAM. [LFLY18] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. ar Xiv preprint ar Xiv:1804.08838, 2018. [LGZ+20] Xinyan Li, Qilong Gu, Yingxue Zhou, Tiancong Chen, and Arindam Banerjee. Hessian based analysis of sgd for deep nets: Dynamics and generalization. In Proceedings of the 2020 SIAM International Conference on Data Mining, pages 190 198. SIAM, 2020. [LT19] Jingcheng Liu and Kunal Talwar. Private selection from private candidates. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 298 309, 2019. [LXT+17] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. ar Xiv preprint ar Xiv:1712.09913, 2017. [MM09] Frank Mc Sherry and Ilya Mironov. Differentially private recommender systems: Building privacy into the netflix prize contenders. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 627 636, 2009. [MT07] Frank Mc Sherry and Kunal Talwar. Mechanism design via differential privacy. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS 07, pages 94 103, Washington, DC, USA, 2007. IEEE Computer Society. [NRS07] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data analysis. In Proceedings of the 39th Annual ACM Symposium on the Theory of Computing, STOC 07, pages 75 84, New York, NY, USA, 2007. ACM. [NS06] Arvind Narayanan and Vitaly Shmatikov. How to break anonymity of the netflix prize dataset. ar Xiv preprint cs/0610105, 2006. [PS21] Nicolas Papernot and Thomas Steinke. Hyperparameter tuning with renyi differential privacy, 2021. [She19] Or Sheffet. Old techniques in differentially private linear regression. In Algorithmic Learning Theory, pages 789 827. PMLR, 2019. [SU17] Thomas Steinke and Jonathan Ullman. Between pure and approximate differential privacy. The Journal of Privacy and Confidentiality, 7(2):3 22, 2017. [Ull15] Jonathan Ullman. Private multiplicative weights beyond linear queries. In Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 303 312, 2015. [Vad17] Salil Vadhan. The complexity of differential privacy. In Yehuda Lindell, editor, Tutorials on the Foundations of Cryptography: Dedicated to Oded Goldreich, chapter 7, pages 347 450. Springer International Publishing AG, Cham, Switzerland, 2017. [Ver18] Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018. [WSC+16] Lu Wei, Anand D Sarwate, Jukka Corander, Alfred Hero, and Vahid Tarokh. Analysis of a privacy-preserving pca algorithm using random matrix theory. In 2016 IEEE Global Conference on Signal and Information Processing (Global SIP), pages 1335 1339. IEEE, 2016. [ZWB20] Yingxue Zhou, Zhiwei Steven Wu, and Arindam Banerjee. Bypassing the ambient dimension: Private sgd with gradient subspace identification. ar Xiv preprint ar Xiv:2007.03813, 2020.