# differentially_private_covariance_revisited__1c93df12.pdf

Differentially Private Covariance Revisited

Wei Dong, Yuting Liang, Ke Yi {wdongac,yliangbs,yike}@cse.ust.hk Department of Computer Science Hong Kong University of Science and Technology

In this paper, we present two new algorithms for covariance estimation under concentrated differential privacy (z CDP). The first algorithm achieves a Frobenius error of Opd1{4?

d{nq, where tr is the trace of the covariance matrix. By taking tr 1, this also implies a worst-case error bound of Opd1{4{?nq, which improves the standard Gaussian mechanism s Opd{nq for the regime d ą rΩpn2{3q. Our second algorithm offers a tail-sensitive bound that could be much better on skewed data. The corresponding algorithms are also simple and efficient. Experimental results show that they offer significant improvements over prior work.

1 Introduction

Consider a dataset represented by a matrix X P Rdˆn, where each column Xi, i 1, . . . , n corresponds to an individual s information. As standard in the literature, we assume that all the Xi s live in Bd, the d-dimensional ℓ2-unit ball centered at the origin. In this paper, we revisit the problem of estimating the (empirical) covariance matrix Σp Xq : 1

i Xi XT i 1

n XXT under differential privacy (DP), a fundamental problem in high-dimensional data analytics and machine learning that requires little motivation. We often write Σp Xq as Σ when the context is clear. As with most prior work, we use the Frobenius norm }rΣ Σ}F to measure the error of the estimated covariance rΣ. To better focus, in the introduction we state all results under concentrated different privacy (z CDP) [10]; extensions of our results to pure-DP are given in Appendix I.

1.1 A Trace-sensitive Algorithm

For any symmetric matrix A, we use Pr As and Λr As to denote its matrices of eigenvectors and eigenvalues, respectively, such that A Pr AsΛr As Pr As T ; we use λir As to denote its ith largest eigenvalue. When A Σ Σp Xq, we simply write P PrΣs, Λ ΛrΣs, λi λirΣs, so that Λ diagpλ1, , λdq and Σ PΛPT . Let P r P1 P2 Pds, where Pi is the orthonormal basis vector corresponding to λi. Rudimentary linear algebra yields λk 1 n ř ip P T k Xiq2 for 1 ď k ď d and ||Xi||2 2 ř

kp P T k Xiq2 for 1 ď i ď n. Thus, it follows that

trrΣs trrΛs ÿ

i p P T k Xiq2 1

k p P T k Xiq2 1

i ||Xi||2 2.

That is, 0 ď trrΛs ď 1 is the average ℓ2 norm (squared) of the Xi s, and we simply write it as tr.

Recall that it is assumed that all the Xi s live in Bd. In practice, this is enforced by assuming an upper bound B on the norms and scaling down all Xi by B. As one often uses a conservatively large B, typical values of tr can be much smaller than 1, so a trace-sensitive algorithm would be

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Figure 1: Currently known worst-case error bounds (both axes are in log scale).

more desirable. Indeed, Amin et al. [2] take this approach, describing an algorithm with error1 Opd3{4?

d{nq under z CDP2. Note that the ?

d{n term inherits from mean estimation and the first term is the extra difficulty for covariance estimation. In this paper, we improve this term to d1{4?

tr{?n (we have a similar, albeit lesser, improvement under pure-DP; see Appendix I). Our algorithm is very simple: We first estimate Λ using the Gaussian mechanism (this is the same as in [2, 22]), then we estimate P by doing an eigendecomposition of Σ masked with Gaussian noise. Intuitively, we obtain a ?

d-factor improvement over the iterative methods of [2, 22], because we can obtain all eigenvectors from one noisy Σ, while the iterative methods must allocate the privacy budget to all d eigenvectors. Our algorithm is also more efficient, performing just two eigendecompositions and one matrix multiplication, whereas the algorithm in [2, 22] needs Opdq such operations.

Implication to worst-case bounds. Covariance matrix has also been studied in the traditional worst-case setting, i.e., the bound should only depend on d and n. Dwork et al. [17] show that the ℓ2-sensitivity of Σ, i.e., max X X1 }Σp Xq Σp X1q}F where X X1 denotes two neighboring datasets that differ by one column, is Op1{nq. Thus, the standard Gaussian mechanism achieves an error of Opd{nq by adding an independent Gaussian noise of scale Op1{nq to each of the d2 entries of Σ. By taking tr 1, our trace-sensitive bound degenerates into Opd1{4{?nq. Note that the ?

d{n term is dominated by d1{4{?n for d ă Opn2q, which is the parameter regime that allows non-trivial utility (i.e., the error is less than 1).

To better understand the situation, it is instructive to compare covariance estimation with mean estimation (where data are also drawn from the ℓ2 unit ball and the error is measured in ℓ2 norm), as the hardness of covariance estimation lies between d-dimensional mean estimation (only estimating the diagonal entries of Σ) and d2-dimensional mean estimation (treating Σ as a d2-dimensional vector). This observation implies a lower bound rΩp ?

d{nq following from the same lower bound for mean estimation [19]3, and an upper bound Opd{nq attained by the Gaussian mechanism. For d ă Op?nq, Kasiviswanathan et al. [23] prove a higher lower bound4 Ωpd{nq, which means that the complexity of covariance estimation is same as d2-dimensional mean estimation in the lowdimensional regime, so one cannot hope to beat the Gaussian mechanism for small d. However, in the high-dimensional regime, our result indicates that the covariance problem is strictly easier, due to the correlations of the d2 entries of Σ. Another interesting consequence is that our error bound has utility for d up to Opn2q (utility is lost when the error is Op1q, as returning a zero matrix can already achieve this error). This is the highest d that allows for any utility, since even mean estimation requires d ă Opn2q to have utility under z CDP [19, 10]. We pictorially show the currently known

1We use the O notation to suppress the dependency on the privacy parameters and all polylogarithmic factors. We use e as the base of log (unless stated otherwise) and define logpxq 1 for any x ď e. 2Their paper states the error bound under pure-DP and for estimating XXT (i.e., without normalization by 1{n); we show how this bound is derived from their result in Appendix C. 3This paper proves the lower bound under the statistical setting; in Appendix D, we show how it implies the claimed lower bound under the empirical setting. 4Their lower bound is under approximate-DP, which also holds under z CDP.

(worst-case) upper and lower bounds in Figure 1. It remains an interesting open problem to close the gap for rΩp?nq ă d ă Opn2q.

Through private communication with Aleksandar Nikolov, it is observed that the projection mechanism [31, 15] can also be shown to have error Opd1{4{?nq when applied to the covariance problem. In Appendix E, we make this connection more explicit, while also giving an efficient implementation. However, the projection mechanism is not trace-sensitive.

1.2 A Tail-sensitive Algorithm

A trace-sensitive bound only makes use of the average ℓ2 norm, which cannot capture the full distribution. Next, we design an algorithm with an error bound that more closely depends on the distribution of the norms. We characterize this distribution using the τ-tail (Ip q is the indicator function):

i }Xi}2 2 Ip}Xi}2 ą τq, τ P r0, 1s. (1)

Note that γp X, τq decreases as τ increases. In particular, γp X, 0q tr, γp X, 1q 0.

A common technique to reduce noise, at the expense of some bias, is to clip all the Xi s so that they have norms at most τ, for some threshold τ. This yields an error of Noisep X, τq γp X, τq, where Noisep X, τq denotes the error bound of the mechanism when all the Xi s have norm bounded by τ, and γp X, τq is the (additional) bias caused by clipping. Opting for the better of the Gaussian mechanism or our trace-sensitive mechanism, we have

Noisep X, τq O

The technical challenge is therefore choosing a good τ in a differentially private manner. We design a DP mechanism to choose the optimal τ up to a polylogarithmic multiplicative factor and an exponentially small additive term. It also adaptively selects the better of Gaussian mechanism or the trace-sensitive mechanism depending on the relationship between d, n, and a privatized tr. More precisely, our adaptive mechanism achieves an error of

O min τ p Noisep X, τq γp X, τqq 2 dn . (3)

Note that this tail-sensitive bound is always no worse (modulo the 2 dn term) than Noisep X, 1q (i.e., without clipping), and can be much better for certain norm distributions. In particular, the tail-sensitive bound would work very well on many real datasets with skewed distributions, e.g., most data vectors have small norms with a few having large norms. For example, suppose d n3{4, and a constant number of data vectors have ℓ2 norm 1 while the others have norm n 1{4. Then ?

tr Θpn 1{4q, so Noisep X, 1q takes the trace-sensitive bound, which is Opn 9{16q. On the other hand, (3) is at most Opn 13{16q by taking τ n 1{4.

2 Related Work

Mean estimation and covariance estimation are perhaps the most fundamental problems in statistics and machine learning, and how to obtain the best estimates while respecting individual s privacy has attracted a lot of attention in recent years. Mean estimation under differential privacy is now relatively well understood, with the optimal worst-case error being Θp ?

d{nq [19], achieved by the standard Gaussian mechanism [17]. In contrast, the covariance problem is more elusive. As indicated in Figure 1, its complexity is probably a piecewise linear (in the log-log scale) function.

When most data have norms much smaller than the upper bound given a priori, the worst-case bounds above are no longer optimal. In these cases, it is more desirable to have an error bound that is instance-specific. Clipping is a common technique for mean estimation [3, 18, 4, 34, 29] and it is known that running the Gaussian mechanism after clipping X with a certain quantile of the norms of the Xi s achieves instance-optimality in a certain sense [3, 18]. However, for covariance estimation, we show in Appendix F that no quantile can be the optimal clipping threshold achieving the bound in

(3). Nevertheless, the bound in (3) is only achieving the optimal clipping threshold; we cannot say that is instance-optimal, since Noisep q is not even known to be worst-case optimal.

Closely related to covariance estimation are the PCA problem and low-rank approximation. Instead of finding all eigenvalues and eigenvectors, they only aim at finding the largest one or a few. For these problems, iterative methods [2, 22, 38, 17, 11, 36] should perform better than the Gaussian mechanism or our algorithm, both of which try to recover the full covariance matrix.

Many covariance estimation algorithms have been proposed under the statistical setting, where the Xi s are i.i.d. samples drawn from a certain distribution, e.g., a multivariate Gaussian [19, 9, 8, 1, 21, 28, 5, 25]. Instead of the Frobenius error, many of them adopt the Mahalanobis error }rΣ Σ}Σ : }Σ 1{2 rΣΣ 1{2 I}F , which can be considered as a normalized version of the former. It is known that λd}A Σ}Σ ď }A Σ}F ď λ1}A Σ}Σ, so when ΣD is well-conditioned, i.e., λ1{λd Op1q, any Frobenius error directly translates to a Mahalanobis error. However, for the Mahalanobis error, the more challenging question is how to deal with an ill-conditioned Σ, for which [19, 8] have provided elegant solutions for the case where D is a multivariate Gaussian. It would be interesting to see if their methods can be combined with the tail-sensitive techniques in this paper to solve this problem for other distribution families, in particular, heavy-tailed distributions. For the lower bound, very recently, Kamath et al. [20] proved a similar lower bound for the low-dimensional regime as in [23] but under the statistical setting.

3 Preliminaries

3.1 Differential Privacy

We say that X, X1 P Rdˆn are neighbors if they differ by one column, denoted X X1. Definition 1 (Differential Privacy (DP) [16]). For ε ą 0 and δ ě 0, a randomized mechanism M : Rdˆn Ñ Y satisfies pε, δq-DP if for any X X1 and any S Ď Y, Prr Mp Xq P Ss ď eε Prr Mp X1q P Ss δ.

In particular, we call it pure-DP if δ 0; otherwise approximate-DP.

Definition 2 (Concentrated Differential Privacy (z CDP) [10]). For ρ ą 0, a randomized mechanism M : Rdˆn Ñ Y satisfies ρ-z CDP if for any X X1, Dα p Mp Xq||Mp X1qq ď ρ α for all α ą 1, where Dα p Mp Xq||Mp X1qq is the α-Rényi divergence between Mp Xq and Mp X1q.

The relationship between these DP definitions is as follows. Pure-DP, also written as ε-DP, implies

2 -z CDP, which further implies ε2

δ , δ -DP for any δ ą 0.

To preserve ε-DP for a query Q, a standard mechanism is to add independent Laplace noises with scale proportional to the (global) ℓ1-sensitivity of Q to each dimension. Lemma 1 (Laplace Mechanism [13]). Given Q : Rdˆn Ñ Rk, let GSQ : max X X1 }Qp Xq Qp X1q}1. The mechanism Mp Xq Qp Xq GSQ

ε Y where Y Lapp1qk, preserves ε-DP.

The following composition property of ε-DP allows us to design algorithms in a modular fashion.

Lemma 2 (Basic Composition). If M is an adaptive composition of mechanisms M1, M2, . . . , Mt, where each Mi satisfies εi-DP, then M satisfies př

For ρ-z CDP, the standard method is the Gaussian mechanism: Lemma 3 (Gaussian Mechanism [10]). Given Q : Rdˆn Ñ Rk, let GSQ : max X X1 }Qp Xq Qp X1q}2. The mechanism Mp Xq Qp Xq GSQ ?2ρ Y where Y N p0, Ikˆkq, preserves ρ-z CDP.

It has been shown that the covariance matrix has an ℓ2-sensitivity of ?

2 n [8]. Thus, the Gaussian mechanism for covariance, denoted Gauss Cov, simply adds an independent Gaussian noise with scale 1 ?ρn to each entry of Σ. Considering that Σ is symmetric, symmetric noises also suffice, which preserve the symmetry of the privatized Σ. More precisely, we draw a random noise matrix W where wj,k Np0, 1q i.i.d. for 1 ď j ď k ď d and wk,j wj,k, denoted as W SGWpdq. Then Gauss Cov outputs rΣGau Σ 1 ?ρn W.

A similar composition property exists for ρ-z CDP.

Lemma 4 (Composition Theorem [10]). If M is an adaptive composition of algorithms M1, M2, . . . , Mt, where each Mi satisfies ρi-z CDP, then M satisfies př

i ρiq-z CDP.

3.2 The Sparse Vector Technique

The Sparse Vector Technique (SVT) [14] has as input a sequence of scalar queries, f1p Xq, f2p Xq, . . . , ftp Xq, where each has sensitivity 1, and a threshold T. It aims to find the first query (if there is) whose answer is approximately above T. See Appendix A for the detailed algorithm. The SVT has been shown to satisfy ε-DP with following utility guarantee.

Lemma 5 (Extension of Theorem 3.24 in [16]). With probability at least 1 β, SVT returns a k such that, for any i ă k, fip Xq ď T 6

ε logp2t{βq, and if k t 1, then fkp Xq ě T 6

ε logp2t{βq.

3.3 Concentration Inequalities

Lemma 6 ([26]). Given Y N p0, Idˆdq, with probability at least 1 β,

}Y}2 ď ηpd, βq : b

d logp1{βq 2 logp1{βq.

Lemma 7 ([8, 26]). Given W SGWpdq, with probability at least 1 β,

}W}2 ď υpd, βq : 2 ?

d 2d1{6 log1{3 d 6p1 plog d{dq1{3q?log d a

logp1 plog d{dq1{3qq 2 a

2 logp1{βq.

Also, with probability at least 1 β,

}W}F ď ωpd, βq : b

d logp2{βqp1 a

2pd 1qq 6 logp2{βq.

Ignoring polylogarithmic factors, ηpd, βq and υpd, βq are both in Op ?

dq, while ωpd, βq is in Opdq. These concentration inequalities are very useful for error analysis. For example, the bound on }W}F immediately implies that Gauss Cov has error 1 ?ρn ωpd, βq Opd{nq.

4 Trace-sensitive Algorithm

The state-of-the-art trace-sensitive algorithm [2] first obtains an estimate of the eigenvalues, and then iteratively finds the eigenvectors by the exponential mechanism (EM), so we denote this algorithm as EMCov. Under z CDP, it has an error of Opd3{4?

d{nq. Below, we present an algorithm that is simpler, faster, and more accurate, improving the trace-dependent term by a ?

The first step of our algorithm Separate Cov (shown in Algorithm 1) is basically the same as EMCov, where we obtain an estimate of the eigenvalues with half of the privacy budget. [2] uses the Laplace mechanism for pure-DP; for z CDP, we use the Gaussian mechanism, which relies on the ℓ2-sensitivity of Λ, which we provide in Lemma 10 in the Appendix B. For the eigenvectors, we use Gauss Cov to obtain a privatized rΣGau with the other half of the privacy budget, and perform an eigendecomposition. Finally, we assemble the eigenvalues of eigenvectors to obtain a privatized Σ. It should be clear that, after computing Σ, Separate Cov just needs two eigendecompositions and one full matrix multiplication, plus some Opd2q-time operations. On the other hand, EMCov performs Opdq eigendecompositions and matrix multiplications, plus a nontrivial sampling procedure for the EM.

That Separate Cov satisfies ρ-z CDP easily follows from the privacy of the Gaussian mechanism and the composition property. The utility is given by the following theorem:

Theorem 1. Given any ρ ą 0, for any X P Bn d , and any β ą 0, with probability at least 1 β,

Separate Cov returns a rΣSep such that }rΣSep Σ}F ď 21.25?

tr ρ1{4?n c

2 ?ρn η d, β

Algorithm 1 Separate Cov

Input: data X P Bn d ; privacy parameter ρ ą 0. 1: Λ Ð the eigenvalues of Σ 1

2: rΛSep Ð Λ ?

2 ?ρn Y, where Y Np0, Idˆdq

3: rΣGau Ð Gauss Covp X, ρ

4: r PSep Ð P rΣGau ı

5: rΣSep Ð r PSep rΛSep r PT Sep 6: return rΣSep

Remark While Separate Cov strictly improves over EMCov, it does not dominate Gauss Cov: When tr ă Opd3{2{nq, Separate Cov is better; otherwise, Gauss Cov is better. EMCov is better than Gauss Cov for a smaller trace range: tr ă Op ?

Theorem 1 implies our worst-case bound by taking tr 1: Theorem 2. Given any ρ ą 0, for any X P Bn d , and any β ą 0, with probability at least 1 β,

Separate Cov returns a rΣSep such that }rΣSep Σ}F O d1{4

5 Tail-sensitive Algorithm

5.1 Clipped Covariance

Clipping is a common technique to reduce the sensitivity of functions at the expense of some bias. Given τ ě 0 and a vector X P Rd, let Clipp X, τq min 1, τ }X}2

X. Similarly, for any X P Rdˆn, Clipp X, τq denotes the matrix whose columns have been clipped to have norm at most τ. Clipping can be applied to both Gauss Cov and Separate Cov with a given τ: just run the mechanism on 1 τ Clipp X, τq and scale the result back by τ 2. We denote the clipped versions of the two mechanisms as Clip Gauss Cov and Clip Separate Cov, respectively.

The following lemma bounds the bias caused by clipping in terms of the τ-tail as defined in (1).

Lemma 8. }Σp Xq Σp Clipp X, τqq}F ď 1

i }Xi}2 2 τ 2 I p}Xi}2 ě τq ď γp X, τq.

Thus, running the better of Clip Gauss Cov and Clip Separate Cov yields a total error of Noisep X, τ, ρ, βq γp X, τq, where

Noisep X, τ, ρ, βq min

τ 2 ?ρn ωpd, βq, 21.25τ ?

2τ 2 ?ρn η ˆ d, β

which is the exact version of (2). Note that the trace-sensitive term is only scaled by τ, which follows from the proof of Theorem 1 when all Xi live in τ Bd.

Ideally, we would like to find the optimal noise-bias trade-off, i.e., achieving an error of minτp Noisep X, τq γp X, τqq. Two issues need to be addressed towards this goal: The first, minor, issue is that tr is sensitive, so we cannot use it directly to decide whether to use Clip Gauss Cov or Clip Separate Cov. This can be addressed by using a privatized upper bound of tr. The more challenging problem is how to find the optimal τ in a DP fashion. This problem has been well studied for the clipped mean estimator [18, 3], where it can be shown that setting τ to be the Op ?

dq-th largest }Xi}2 results in the optimal noise-bias trade-off [18]. Then the problem boils down to finding a privatized quantile, for which multiple solutions exist [18, 12, 32, 6, 37]. For the clipped mean estimator, using such a quantile of the norms results in the optimal trade-off because Noisep X, τq takes the simple form Opτ ?

d{nq. In fact, if we only had Clip Gauss Cov, setting τ to be the Opdq-th largest }Xi}2 would also yield an optimal trade-off, as Clip Gauss Cov is really just clipped mean in d2 dimensions. However, due to the trace-sensitive noise term, it is no longer the case. In Appendix F, we give examples showing that no quantile, whose rank may arbitrarily depend on d, n, tr, can achieve

an optimal trade-off even ignoring polylogarithmic factors. It thus calls for a new threshold-finding mechanism, which we describe next.

5.2 Adaptive Covariance: Finding the Optimal Clipping Threshold

Our basic idea is to try successively smaller values τ 1, 1

4, . . . . As we reduce τ, the noise decreases while the bias increases. We should stop when they are approximately balanced, which would yield a near-optimal τ.

To do so in a DP manner, we need to quantify the noise and bias. Consider the bias first. Given a τ, we divide the interval pτ, 1s into sub-intervals pτ, 2τs, p2τ, 4τs, . . . , p 1

2, 1s. For any X P X such that }X}2 P p2s, 2s 1s, let q X Clipp X, τq and then by Lemma 8,

}XXT q X q XT }F ď 22s 2 τ 2. (5) That is, clipping X can at most lead to 1

n p22s 2 τ 2q bias. Besides, since }X}2 P p2s, 2s 1s, we have 22s 2 τ 2 ď 22s 2 ď 2 }X}2 2. (6) Then, given X, for any s P Z, we define Countsp Xq : ˇˇ Xi : }Xi}2 P p2s, 2s 1s (ˇˇ. It is easy to see for any X X1, Counts differs by at most 1, so does the sum of any subset of Counts s. We can define an upper bound on the bias: y Biasp X, τq : 1

n řsă0 s log2pτq Counts p22s 2 τ 2q. Let q X Clipp X, τq. By (5) and (6), we have 1 n}XXT q X q XT } ď y Biasp X, τq ď 2 γp X, τq. (7)

By the property of Counts s, given any τ, the sensitivity of y Biasp , τq is bounded by 1

Now we turn to the noise. Recall that Noisep X, τ, ρ, βq is the smaller of two parts. The first part Gauss Noisepτ, ρ, βq : τ 2 1 ?ρn ωpd, βq is independent of X, so can be used directly. The second

part depends on tr, is thus sensitive. Since its sensitivity is 1

n, we can easily privatize it by adding a

Gaussian noise of scale Θ 1 ?ρn . For technical reasons, we need to use an upper bound, so we add

Θ logp1{βq ?ρn to it so as to obtain a privatized ptr ě tr. Then we set

Separate Noisepptr, τ, ρ, βq : τ 21.25?ptr

2 ?ρn η ˆ d, β

and use { Noisepptr, τ, ρ, βq : min Gauss Noisepτ, ρ, βq, Separate Noisepptr, τ, ρ, βq as a DP upper bound of Noisep X, τ, ρ, βq. Note that given ptr, { Noisepptr, τ, ρ, βq is independent of X.

Finally, we run SVT on the following sequence of sensitivity-1 queries with T 0:

Diffp X, ptr, τ, ρ, βq : n y Biasp X, τq { Noisepptr, τ, ρ, βq , τ 1, 1

2, . . . , 2 dn.

The SVT would return a τ that balances the bias and noise. After finding such a τ, we choose to run either Gauss Cov or Separate Cov by comparing Gauss Noisepτ, ρ, βq and Separate Noisepptr, τ, ρ, βq. As the sequence consists of dn queries, SVT has an error of Oplogpdnqq, which, as we will show, affects the optimality by a logarithmic factor. Meanwhile, the smallest τ we search over will induce an additive 2 dn error.

The algorithm above can almost give us the desired error bound in (3), except that one thing may go wrong: The SVT introduces an error that is a logarithmic factor larger than the optimum, but at least rΩp1{nq. This would be fine as long as there is one Xi with }Xi}2 ě rΩp1q, so that the optimum error is rΩp1{nq. However, when all the Xi s have very small norms, say 1{n2, the rΩp1{nq error from SVT would not preserve optimality. To address this issue, we first find the radius radp Xq maxi }Xi}2, and use it to clip X. The following lemma shows that, under DP, it is possible to find a 2-approximation of radp Xq plus an additive b so that only Oplog logp1{bqq vectors are clipped. This allows us to set b 2 dn while only incurring an Oplog dnq error. Nicely, they match the additive and multiplicative errors that already exist from the SVT, so there is no asymptotic degradation in the optimality.

Algorithm 2 Adaptive Cov

Input: data X P Bn d ; privacy parameter ρ ą 0; high probablity parameter β.

1: r Ð Priv Radiusp X, ?ρ

2: r X Ð Clipp X, rq 3: rtr Ð 1

n ř i } r Xi}2 2 4: ptr Ð min rtr 2 r2

?ρn Np0, 1q 2 ?

logp8{βq, r2

5: t Ð log2p rq 1 SVT ! Diff r X, ptr, r, ρ

2 , Diff p r X, ptr, r

2 , . . . , Diff p r X, ptr, 2 dn, ρ

2 ) , 0, ?ρ ?

6: τ Ð min 2 t 1, r

7: if Separate Noisepptr, τ, ρ

2 q ě Gauss Noisep τ, ρ

8: rΣAda Ð Clip Gauss Covp r X, ρ

2, τq 9: else 10: rΣAda Ð Clip Separate Covp r X, ρ

11: return rΣAda

Lemma 9 ([12]). For any ε ą 0, β ą 0 and b ą 0, given X P Bn d , with probability at least 1 β, Priv Radius returns a r Priv Radiusp X, ε, β, bq such that r ď 2 radp Xq b and |t}Xi}2 ą ru| O 1 ε log logpradp Xq{bq

The complete algorithm is given in Algorithm 2. Its privacy follows from the privacy of Priv Radius, SVT, Gauss Cov, Separate Cov, and the composition theorem of z CDP; its utility is analyzed in the following theorem:

Theorem 3. Given any ρ ą 0 and β ą 0, for any X P Bn d , with probability at least 1 β, Adaptive Cov returns a rΣAda such that rΣAda Σ F O ˆ min τ

ˆ Noise ˆ X, τ, ρ

logp1{βq1{4

ρ1{4 logpnq γp X, τq logpdn{βq ?ρ

O min τ p Noise p X, τq γp X, τqq 2 dn .

6 Experiments

We conducted experiments5 to evaluate our algorithms on both synthetic and real-world datasets. We compare Separate Cov and Adaptive Cov against Gauss Cov [17], EMCov [2]. We implemented EMCov in Python following the pseudo-code provided in [2] and the descriptions of the sampling algorithm in [24]. We also tested Coin Press [8], but since it is designed to minimize the Mahalanobis error, it does not perform well when measured in Frobenius error. The two distance measures coincide when Σ is well-conditioned but in this case, Coin Press degenerates into Gauss Cov. Therefore, we omit it from the reported results. As a baseline, we include returning a zero matrix, which has error Optrq, hence a trivial trace-sensitive algorithm. When radp Xq is much smaller than 1, it is unfair for Gauss Cov and EMCov, so we scale all datasets such that 0.5 ď radp Xq ď 1. As a result, we do not need the step to obtain a private radius in Adaptive Cov, either. Each experiment is repeated 50 times, and we report the average error. Furthermore, we have also conducted experiments under pure-DP; the results can be found in Appendix J.

6.1 Synthetic Datasets

We generate synthetic datasets by first following the method in [2], to obtain a matrix X ZU, where U P Rdˆd is sampled from Up0, 1q, and Z P Rnˆd is sampled from Np0, Iq. Then the vectors in X are adjusted to be centred at 0 and their norms scaled. In [2], the vectors are scaled to have unit ℓ2 norm; in our experiments, to better control tr and data skewness, we scale the norms so that they

5The code can be found at https://github.com/hkust DB/Private Covariance.

24 25 26 27 28 29 d

Frobenius Error

EMCov Gauss Cov Separate Cov

Adaptive Cov Zero Matrix

(a) n 1000.

24 25 26 27 28 29 d

Frobenius Error

(b) n 4000.

24 25 26 27 28 29 d

Frobenius Error

(c) n 16000.

Figure 2: Results on synthetic datasets fixing tr 1.

20 21 22 23 24 25 N

Frobenius Error

EMCov Gauss Cov Separate Cov

Adaptive Cov Zero Matrix

(a) various tr.

24 25 26 27 28 29 d

Frobenius Error

(b) various d.

103 104 105 106 n

Frobenius Error

(c) various n.

10 4 10 3 10 2 10 1 100 101 10 4

Frobenius Error

(d) various ρ.

Figure 3: Results on synthetic datasets as d, n, N or ρ varies.

follow the Zipf s law. More precisely, we divide the norms into N bins. The number of vectors in the k-th bin is proportional to 1{ks and their norm is 2k N. The parameter s characterizes the skewness, which we fix as s 3. Note that N 1 corresponds to the unit-norm case with tr 1.

The results on tr 1 case are shown in Figure 2, which correspond to the worst-case bounds. The ρ here is fixed at 0.1 and we examine the error growth w.r.t. d for n 1000, 4000, 16000. The results generally agree with the theory: For low d, Gauss Cov is (slightly) better than Separate Cov, while the latter is much better for high d. Adaptive Cov is able to choose the better of the two adaptively, with a small cost due to allocating some privacy budget to estimate tr. Actually, if Adaptive Cov is given the precondition that all norms are 1, this extra cost can be saved.

Next, we vary one of the parameters while fixing the others at their default values d 200, n 50000, N 4 and ρ 0.1, and the results are reported in Figure 3. The most interesting case is Figure 3(a), where we increase N, hence reducing tr, which demonstrates the trace-sensitive bounds. Clearly, Gauss Cov is not trace-sensitive, while the other 4 methods are. In fact, returning the zero matrix is the best trace-sensitive algorithm if tr is sufficiently small. However, this may not be very meaningful in practice, as N 25 means that most data have norm 2 31 but a few have norm 1. These few may be outliers and should be removed anyway. Figure 3(b) (d) shows that higher d, smaller n, and smaller ρ all have similar effects, i.e., Separate Cov becomes better while Gauss Cov becomes worse, while Adaptive Cov is able to pick the better one.

10 4 10 3 10 2 10 1 100 101 10 4

Frobenius Error

EMCov Gauss Cov Separate Cov

Adaptive Cov Zero Matrix

(a) n 6000, tr ă 1.

10 4 10 3 10 2 10 1 100 101 10 4

Frobenius Error

(b) n 20000, tr ă 1.

10 4 10 3 10 2 10 1 100 101 10 4

Frobenius Error

(c) n 60000, tr ă 1.

Figure 4: Results on MNIST dataset.

6.2 Real-world Datasets

We also evaluate the algorithms on two real-world datasets. The first dataset is the MNIST [27] dataset, which contains images of handwritten digits. We use its training dataset which contains 60, 000 images represented as vectors in Zd 255, where d 784 28 ˆ 28. These vectors are normalized by 255 ?

d in the experiments. We estimate rΣ using samples containing all the digits, we also estimate rΣ corresponding to individual digits (reported in the Appendix J). In the first case, rΣ can be used for further dimensionality reduction analysis; in the second case, individual rΣ can be used for modelling the distributions of individual digits, which together can be used in a collective

10 4 10 3 10 2 10 1 100 101 10 4

Frobenius Error

(a) d 32, tr 1.

10 4 10 3 10 2 10 1 100 101 10 4

Frobenius Error

(b) d 128, tr 1.

10 4 10 3 10 2 10 1 100 101 10 4

Frobenius Error

(c) d 512, tr 1.

10 4 10 3 10 2 10 1 100 101 10 4

Frobenius Error

EMCov Gauss Cov Separate Cov

Adaptive Cov Zero Matrix

(d) d 32, tr ă 1.

10 4 10 3 10 2 10 1 100 101 10 4

Frobenius Error

(e) d 128, tr ă 1.

10 4 10 3 10 2 10 1 100 101 10 4

Frobenius Error

(f) d 512, tr ă 1.

Figure 5: Results on the news dataset.

model for classification (e.g. using a mixture model). The second dataset contains news commentary data [33] consisting of approximately 15, 000 articles, each containing 500 4300 words, which we convert into vectors of various dimensions using the hashing trick implemented in the scikit-learn package. In this case, the estimated rΣ can be used to help with further feature selection for NLP models, for example. These vectors are normalized to have unit ℓ2 norm or normalized by the max ℓ2 norm.

The experimental results on these two real dataset are shown in Figure 4 and 5, where we vary n, d, and ρ, respectively. On these results, we see that Gauss Cov never outperforms Separate Cov, except for a very small advantage in a few cases where we have tr 1, a low d, and a high ρ. Another interesting observation is that Adaptive Cov outperforms both Gauss Cov and Separate Cov in many cases, something that is not apparent on the synthetic datasets. We believe that this is because these real datasets have heavier tails than the Zipf distribution (we used s 3 for Zipf), which makes the adaptive clipping threshold selection more effective. This really demonstrates the benefits of a tail-sensitive bound.

Acknowledgements

This work has been supported by HKRGC under grants 16201819, 16205420, and 16205422. We would like to thank Aleksandar Nikolov for helpful discussions on the projection mechanism and the anonymous reviewers who have made valuable suggestions on improving the presentation of the paper.

[1] Ishaq Aden-Ali, Hassan Ashtiani, and Gautam Kamath. On the sample complexity of privately learning unbounded high-dimensional gaussians. In Algorithmic Learning Theory, pages 185 216. PMLR, 2021.

[2] Kareem Amin, Travis Dick, Alex Kulesza, Andrés Munoz Medina, and Sergei Vassilvitskii. Differentially private covariance estimation. In Neur IPS, pages 14190 14199, 2019.

[3] Kareem Amin, Alex Kulesza, Andres Munoz, and Sergei Vassilvtiskii. Bounding user contributions: A bias-variance trade-off in differential privacy. In International Conference on Machine Learning, pages 263 271. PMLR, 2019.

[4] Galen Andrew, Om Thakkar, Brendan Mc Mahan, and Swaroop Ramaswamy. Differentially private learning with adaptive clipping. Advances in Neural Information Processing Systems, 34, 2021.

[5] Hassan Ashtiani and Christopher Liaw. Private and polynomial time algorithms for learning gaussians and beyond. In Conference on Learning Theory, pages 1075 1076. PMLR, 2022.

[6] Hilal Asi and John C Duchi. Instance-optimality in differential privacy via approximate inverse sensitivity mechanisms. Advances in neural information processing systems, 33, 2020.

[7] Afonso S Bandeira and Ramon Van Handel. Sharp nonasymptotic bounds on the norm of random matrices with independent entries. The Annals of Probability, 44(4):2479 2506, 2016.

[8] Sourav Biswas, Yihe Dong, Gautam Kamath, and Jonathan Ullman. Coinpress: Practical private mean and covariance estimation. Advances in Neural Information Processing Systems, 33, 2020.

[9] Mark Bun, Gautam Kamath, Thomas Steinke, and Steven Z Wu. Private hypothesis selection. Advances in Neural Information Processing Systems, 32, 2019.

[10] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635 658. Springer, 2016.

[11] Kamalika Chaudhuri, Anand D Sarwate, and Kaushik Sinha. A near-optimal algorithm for differentially-private principal components. Journal of Machine Learning Research, 14, 2013.

[12] Wei Dong and Ke Yi. Universal private estimators. ar Xiv preprint ar Xiv:2111.02598, 2021.

[13] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265 284. Springer, 2006.

[14] Cynthia Dwork, Moni Naor, Omer Reingold, Guy N Rothblum, and Salil Vadhan. On the complexity of differentially private data release: efficient algorithms and hardness results. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 381 390, 2009.

[15] Cynthia Dwork, Aleksandar Nikolov, and Kunal Talwar. Efficient algorithms for privately releasing marginals via convex relaxations. Discrete & Computational Geometry, 53(3):650 673, 2015.

[16] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211 407, 2014.

[17] Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 11 20, 2014.

[18] Ziyue Huang, Yuting Liang, and Ke Yi. Instance-optimal mean estimation under differential privacy. Advances in Neural Information Processing Systems, 2021.

[19] Gautam Kamath, Jerry Li, Vikrant Singhal, and Jonathan Ullman. Privately learning highdimensional distributions. In Proceedings of the 32nd Annual Conference on Learning Theory, COLT 19, pages 1853 1902, 2019.

[20] Gautam Kamath, Argyris Mouzakis, and Vikrant Singhal. New lower bounds for private estimation and a generalized fingerprinting lemma. ar Xiv preprint ar Xiv:2205.08532, 2022.

[21] Gautam Kamath, Argyris Mouzakis, Vikrant Singhal, Thomas Steinke, and Jonathan Ullman. A private and computationally-efficient estimator for unbounded gaussians. In Conference on Learning Theory, pages 544 572. PMLR, 2022.

[22] Michael Kapralov and Kunal Talwar. On differentially private low rank approximation. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1395 1414. SIAM, 2013.

[23] Shiva Prasad Kasiviswanathan, Mark Rudelson, Adam Smith, and Jonathan Ullman. The price of privately releasing contingency tables and the spectra of random matrices with correlated rows. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 775 784, 2010.

[24] John T Kent, Asaad M Ganeiber, and Kanti V Mardia. A new unified approach for the simulation of a wide class of directional distributions. Journal of Computational and Graphical Statistics, 27(2):291 301, 2018.

[25] Pravesh Kothari, Pasin Manurangsi, and Ameya Velingker. Private robust estimation by stabilizing convex relaxations. In Conference on Learning Theory, pages 723 777. PMLR, 2022.

[26] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302 1338, 2000.

[27] Yann Le Cun, Corinna Cortes, and Christopher J.C. Burges. The mnist database of handwritten digits, 1998. Available online at: http://yann.lecun.com/exdb/mnist/. Last accessed: May. 2022.

[28] Xiyang Liu, Weihao Kong, and Sewoong Oh. Differential privacy and robust statistics in high dimensions. In Conference on Learning Theory, pages 1167 1246. PMLR, 2022.

[29] H Brendan Mc Mahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. ar Xiv preprint ar Xiv:1710.06963, 2017.

[30] Aleksandar Nikolov. Private query release via the johnson-lindenstrauss transform. ar Xiv preprint ar Xiv:2208.07410, 2022.

[31] Aleksandar Nikolov, Kunal Talwar, and Li Zhang. The geometry of differential privacy: the sparse and approximate cases. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 351 360, 2013.

[32] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data analysis. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 75 84, 2007.

[33] Sixth Conference on Machine Translation WMT21. News commentary crawl v16, 2021. Available online at: https://www.statmt.org/wmt20/translation-task.html. Last accessed: May. 2022.

[34] Venkatadheeraj Pichapati, Ananda Theertha Suresh, Felix X Yu, Sashank J Reddi, and Sanjiv Kumar. Adaclip: Adaptive clipping for private sgd. ar Xiv preprint ar Xiv:1908.07643, 2019.

[35] Holger Sambale. Some notes on concentration for α-subexponential random variables. ar Xiv preprint ar Xiv:2002.10761, 2020.

[36] Vikrant Singhal and Thomas Steinke. Privately learning subspaces. Advances in Neural Information Processing Systems, 34:1312 1324, 2021.

[37] Adam Smith. Privacy-preserving statistical estimation with optimal convergence rates. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 813 822, 2011.

[38] Jalaj Upadhyay. The price of privacy for low-rank factorization. In Neur IPS, 2018.

[39] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.