# smoothly_bounding_user_contributions_in_differential_privacy__cee2ceef.pdf

Smoothly Bounding User Contributions in Differential Privacy

Alessandro Epasto Google Research 111 8th Ave, New York, NY, 10011 aepasto@google.com

Mohammad Mahdian Google Research 111 8th Ave, New York, NY, 10011 mahdian@google.com

Jieming Mao Google Research 111 8th Ave, New York, NY, 10011 maojm@google.com

Vahab Mirrokni Google Research 111 8th Ave, New York, NY, 10011 mirrokni@google.com

Lijie Ren Google Research 111 8th Ave, New York, NY, 10011 renlijie@google.com

A differentially private algorithm guarantees that the input of a single user won t signiﬁcantly change the output distribution of the algorithm. When a user contributes more data points, more information can be collected to improve the algorithm s performance. But at the same time, more noise might need to be added to the algorithm in order to keep the algorithm differentially private and this might hurt the algorithm s performance. [AKMV19] initiates the study on bounding user contributions and proposes a very natural algorithm which limits the number of samples each user can contribute by a threshold. For a better trade-off between utility and privacy guarantee, we propose a method which smoothly bounds user contributions by setting appropriate weights on data points and apply it to estimating the mean/quantiles, linear regression, and empirical risk minimization. We show that our algorithm provably outperforms the sample limiting algorithm. We conclude with experimental evaluations which validate our theoretical results.

1 Introduction

The notion of Differential Privacy, introduced by [DMNS06], aims to capture the requirement that the output of an algorithm should not reveal much about the information provided by a single user. The classical deﬁnition of differential privacy assumes each user controls one row in the input data set, and guarantees that the removal (or change) of one row in the data set does not change the output signiﬁcantly.

In many applications of differential privacy, a single user might contribute more than one data point. A prominent example, which is the focus of this paper, is private machine learning, where a user often provides several points in the training data set. While the standard deﬁnition of differential privacy can still capture such settings by deﬁning a row as the collection of all data points belonging to the same user, an important and useful nuance is lost in this translation. Most importantly, when a user contributes many data points, the algorithm designer must balance between the value of the information contained in these data points, and the added noise it will have to add to the output to make it private with respect to this user.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

[AKMV19] initiated the study of this problem, and proposed a natural algorithm which limits the number of samples each user can contribute by a threshold. This threshold is then optimized to strike the right balance between the error due to the noise, and the bias introduced by removing the samples.

This sample limiting algorithm has two drawbacks: (i) It completely discards some data points from users who have too many data points and the information of these data points is lost. (ii) Some data points may contain more useful information than the others but the sample limiting algorithm treats all data points the same when deciding which data points to discard. Our goal in this paper is to answer this question: is it possible to signiﬁcantly improve over sample limiting by bounding the contribution of each user in a way that is more smooth and careful about the information contained in each sample?

To answer this question, we propose a weighted averaging method to smoothly bound user contributions. The main idea of this method is to set appropriate weights on data points instead of completely discarding some data points.

1.1 Our results

In Section 3, as a warm-up, we study a simple problem: estimating the mean. For this problem, ﬁnding the optimal algorithm corresponds to ﬁnding the right weights when averaging samples. We compute the overall error of the algorithm in terms of these weights, and show how the optimal set of weights can be found. We then compare the error of such an optimal algorithm with that of the best sample limiting approach. We present examples showing that the error of the sample limiting method can be asymptotically 1.5 times higher than that of our algorithm. However, as we prove, this gap cannot exceed 4.

In Section 4, we extend the weighted averaging algorithm to empirical risk minimization by minimizing a weighted version of empirical risk. Our main technical contribution is to prove a weighted version of uniform convergence, which could be of independent interest. We also extend the weighted averaging algorithm to estimating quantiles in a similar way (in Appendix B). Similarly as the warmup problem, for ERM and estimating quantiles, our weighted averaging algorithm has advantage over the sample limiting algorithm, but the advantage is limited.

In Section 5, we study linear regression with label privacy (deﬁned in Section 5). We show that label privacy allows us to design the weights better based on the usefulness of data points and the weighted averaging algorithm can have a much bigger advantage over the sample limiting algorithm. In particular, we prove that the optimal algorithm can be parameterized by a matrix C, and calculate the error as a function of this matrix C. Next, we prove that this function is convex, and therefore, one can compute the optimal C in polynomial time. In Section 5.3, we study the gap between the error of our algorithm and the sample limiting algorithm with the best possible threshold, and prove that this gap can be unbounded. In other words, there are instances where our algorithm has an error that is better than the best sample limiting algorithm by an arbitrary factor.

Finally, in Section 6, we analyze the performance of our algorithm and its comparison with sample limiting empirically using some real-world data sets as well as generated data for linear regression with label privacy. This empirical study shows that our algorithms achieve lower loss compared to the baseline methods (e.g., sample limiting) conﬁrming our theoretical results. We also include in Appendix E experiment results on logistic regression using our ERM algorithm.

1.2 Related work

Differential privacy is proposed by the seminal work of [DMNS06]. For a detailed survey on differential privacy, see [DR14].

Differentially private linear regression and its general form, empirical risk minimization have been well-studied [CM08, CMS11, KST12, JKT12, TS13, SCS13, DJW13, JT14, BST14, Ull15, TTZ15, STU17, WLK+17, WYX17, ZZMW17, Wan18, She19a, She19b, INS+19, BFTT19, WX19, FKT20]. In particular, [WX19] studies label privacy which is similar to the setting we have in Section 5. These results are in the case when each user has only one data point.

Motivated by federated learning, [AKMV19] initiates the study of bounding user contributions in differential privacy. [TAM19, PSY+19] study how to adaptively bound user contributions in

differentially private stochastic gradient descent for federated learning. For a detailed survey on federated learning, see [KMA+19]. More broadly, our setting of each user having multiple data points can be considered as a special case of personalized/heterogeneous differential privacy [JYC15, SCS15, AGK17] and is very related to group privacy which is introduced in [Dwo06].

2 Preliminaries

2.1 Differential privacy

We deﬁne the notion of differential privacy for an algorithm A that takes as input a data set D from the space D of all possible data sets, and produces an output A(D) in the space of outputs O. Typically, D is a collection of n data points for some n. To deﬁne differential privacy, we need a notion of neighboring data sets. In the classical setting of differential privacy, two data sets D, D D are called neighboring data sets, denoted D D , if one is obtained from the other by removing one data point. In Section 2.2, we will discuss a more general notion that captures settings where a user controls more than one data point.

Deﬁnition 1 (Differential Privacy [DMNS06]). A randomized algorithm A is (ε, δ)-differentially private ((ε, δ)-DP for short) if for all neighboring data sets D, D D, and all subsets of outcomes S O,

Pr[A(D) S] eε Pr[A(D ) S] + δ.

When δ = 0, we say that A is ε-DP.

The Laplace mechanism [DMNS06] is a standard technique to achieve differential privacy by adding Laplace noise of appropriate scale to the outcome of computation.

Deﬁnition 2 (ℓ1-sensitivity). The ℓ1-sensitivity of a function f : D Rd is: f = max D D f(D) f(D ) 1.

Deﬁnition 3 (Laplace Mechanism [DMNS06]). Given any function f : D Rd, the Laplace mechanism is deﬁned as f(D) + (W1, ..., Wd) where Wi s are i.i.d random variables drawn from Lap( f/ε). Here Lap(b) is the Laplace distribution with mean 0 and variance 2b2.

Theorem 1 ([DMNS06]). The Laplace mechanism is ε-DP.

2.2 User-level differential privacy

In this paper, we consider the setting in which there are m users owning n data points and m n. Therefore, a single user can have more than one data point. For each user l [m], we use Sl to denote the set of indices of data points owned by user l, and let sl = |Sl|. We assume Sl s are publicly known. We focus on the case when user data points are sampled from the same distribution. When user data distributions are heterogeneous, additional bias need to be deal with (as in [AKMV19]), and we do not consider this case.

The user-level differential privacy deﬁnition mostly follows Deﬁnition 1. The only difference is that now two data sets D, D are considered to be neighboring data sets if they are the same except all data points from a single user l.

3 Warm-up: estimating the mean

For warm-up, we consider a simple setting where we have n data points y1, ..., yn generated as yi = β + ξi and we want to estimate the unknown mean β differentially privately. Here noise ξi s are independent with mean 0 and variance σ2. We assume all yi s are bounded, i.e., yi [0, B].

We want to minimize the expected squared error of our estimate β: E[(β β)2]. This is just the variance of β when β is unbiased (E[ β] = β). The expectation is over the randomness of our algorithm and ξi s.

3.1 The weighted averaging algorithm

In this subsection, we propose our weighted averaging algorithm WAc which is parameterized by non-negative weights c1, ..., cn with P

i ci = 1. This algorithm simply computes the weighted average of the input yi s, and applies the Laplace mechanism to this average:

Algorithm 1 Weighted Averaging WAc Input: y1, . . . , yn R Parameters: c1, . . . , cn R 0, with Pn i=1 ci = 1

1: ˆβ c1y1 + + cnyn

2: β ˆβ + Lap B maxm l=1 P

3: return β

In the following two lemmas, we prove that WAc is ε-DP and analyze its variance.

Lemma 1. For every c, the algorithm WAc is ε-DP.

Proof. If some user l change its input yi s for i Sl, ˆβ would be changed additively by at most B P i Sl ci. Therefore, the ℓ1-sensitivity of ˆβ is B maxl P i Sl ci. By Theorem 1, we know that β is ε-DP.

It is easy to check that β is unbiased. We will just analyze its variance.

Lemma 2. For every c, the variance of the output of WAc can be written as: Var β = σ2(c2 1 +

+ c2 n) + 2 B maxm l=1 P

i Sl ci ε 2 .

Proof. For the variance of β, since c1y1, ..., cnyn and the Laplace noise are independent, we have

i=1 Var(ciyi) + Var Lap B maxm l=1 P

= σ2(c2 1 + + c2 n) + 2 B maxm l=1 P

Next, we characterize the weight vector c that minimizes Var( β). The proof of Lemma 3 can be found in the supplementary material.

Lemma 3. Let c = (c 1, ..., c n) be the vector that minimizes Var( β). There exists h, such that,

2. For each data point i of user q, c i = min(h,sq) sq Pm l=1 min(h,sl).

Deﬁne nh = Pm l=1 min(h, sl). Using the characterization we get in Lemma 3, we show in the next claim that minimizing Var( β) can be simpliﬁed into minimizing a function of a single variable h.

Claim 1. For the weighted averaging algorithm, the minimum of Var( β) equals to

minh:s1 h sm v(h), where v(h) = σ2 Pm l=1 sl min(h,sl)

2 + 2 B h ε nh

Proof. For any h > 0, when setting ci = min(h,sq) sq Pm l=1 min(h,sl) for any data point i of any user q, it is

easy to check that Var( β) = v(h). Then by Lemma 3, we get the claim.

Regarding weights computation, Lemma 3 and Claim 1 show that ﬁnding the optimal weight vector c is equivalent to minimizing a function v(h) of a single parameter h [s1, sm]. Optimizing this function of a single parameter can be simply done by setting the derivative to be 0 and considering locations where the derivative is not continuous.

3.2 The sample limiting algorithm

The sample limiting algorithm picks an integer threshold h between s1 and sm. For each user l, the sample limiting algorithm arbitrarily selects min(sl, h) data points and apply the Laplace mechanism to the average of nh = Pm l=1 min(h, sl) selected data points. In other words, if we let T denote the

set of selected samples, the sample limiting algorithm outputs β = 1 nh P

i T yi + Lap B h ε nh

It is easy to see that the sample limiting algorithm is a special case of the weighted averaging algorithm with weights ci = 1[i T]/nh. Therefore, by Lemma 1, we know the output of the sample limiting algorithm is ε-DP. By Lemma 2, the variance of β can be written as follows.

Claim 2. For the sample limiting algorithm with integer threshold h, Var( β) can be written as the

following function: v (h) = σ2

nh + 2 B h ε nh

3.3 Comparing the variances

In this subsection, we compare the minimum variances of two algorithms we describe earlier. By Claims 1 and 2, we just need to compare minh:s1 h sm v(h) and minh:s1 h sm,h N v (h).

Since the sample limiting algorithm is a special case of the weighted averaging algorithm, we know its variance is greater than or equal to that of the best weighted averaging algorithm. We now show that there are examples where the variance of the sample limiting algorithm is larger than the variance of the weighted averaging algorithm by a factor that asymptotically converges to 3/2.

Theorem 2. For every g N, there is an instance where the variance of the best weighted averaging algorithm is less than 2g+1

4g2 , while the variance of the best sample limiting algorithm is at least 3 4g. For every g, 3 4g 2g+1

4g2 , and the ratio between these two numbers converges to 3/2 as g goes to inﬁnity.

On the other hand, we show that the gap between the two minimum variances is at most a factor of 4.

Theorem 3. In every instance, the variance of the best sample limiting algorithm is at most 4 times the variance of the best weighted averaging algorithm.

4 Extension to empirical risk minimization

In this section, we extend the weighted averaging algorithm to empirical risk minimization. Missing proofs and similar extension to estimating quantiles can be found in the supplementary material.

Here we give the setting of empirical risk minimization (ERM). We are given n data points D = (X1, ..., Xn) from a universe X. They are sampled independently from an unknown distribution µ. We need to optimize over a closed, convex set C bounded by B (i.e. for all θ C, θ 2 B) and we are given a loss function l. For each data point X X, l( , X) deﬁnes a loss function on C. We assume l( , X) is convex and L-Lipschitz. Our goal is to minimize the population risk Lµ(θ) = EX µ[l(θ, X)] over θ C and we deﬁne θ C to be the optimal solution: θ arg minθ C Lµ(θ).

Now we describe our weighted ERM algorithm (Algorithm 2) parametrized by non-negative weights c = (c1, ..., cn). The main idea is to consider the weighted empirical risk ˆL(θ, c, D) = Pn i=1 cil(θ, Xi) for dataset D = (X1, ..., Xn). In order to apply standard (record-level) differentially private ERM algorithms and ensure differential privacy with respect to each user, we deﬁne a new loss function l . Let Mj to be the meta-data of user j: Mj = (Sj, {ci, Xi}i Sj). We deﬁne l (θ, Mj) = P

i Sj cil(θ, Xi) (i.e. weighted empirical loss of each user). In general, we can use any DP ERM algorithms for the new loss l . For concreteness, we use Algorithm 1 of [BST14] which achieves nearly optimal empirical risk bound.

Algorithm 2 Weighted ERM Input: X1, ..., Xn X, loss function l, Parameters: c1, . . . , cn R 0, with Pn i=1 ci = 1 1: Deﬁne user-level weighted loss function l as stated in the above paragraph 2: Run (ε, δ)-DP-ERM algorithm (Algorithm 1 of [BST14]) over user-level loss l and m users, and obtain its output θ 3: return θ

Theorem 4. Algorithm 2 is (ε, δ)-DP and ED µn,alg[Lµ( θ)] Lµ(θ ) is bounded by

v u u u tlog4(m/δ) maxm j=1 P

ε2 + log(d) log(n)

To prove Theorem 4, we prove a weighted version of the uniform convergence result (Theorem 5 of [SSSS09]) to give an upper bound on the generalization error of our weighted ERM algorithm and this result might be of independent interest.

Theorem 5 (Weighted uniform convergence). For any γ > 0 and non-negative weights c = (c1, ..., cn) with Pn i=1 ci = 1, with probability at least 1 γ over D µn, we have

sup θ C |ˆL(θ, c, D) Lµ(θ)| O

v u u td log(d/γ) log(n)

Tradeoff weights. In ERM and also estimating quantiles (details in the supplementary material), for optimizing the algorithm performance, we need to pick weights to minimize a formula in the form

of maxm j=1 P

i Sj ci 2 +A Pn i=1 c2 i , where A depends on parameters of the problem (for example,

in Theorem 4, by simply moving terms around, A would be Θ(log(d) log(n)ε2/ log4(m/δ))). This general form intuitively explains the tradeoff we need to make: A Pn i=1 c2 i measures how hard it

is to optimize the weighted objective with a user-level private algorithm and maxm j=1 P

measures how well the weighted objective generalizes.

It is not hard to see that Var β in Section 3 is also in this form, and the results about characterizing the minimizer and comparisons to the sample limiting algorithm also apply here. And we can also use the same method to compute weights c = (c1, ..., cn).

5 Linear regression with label privacy

In this section, we consider linear regression. We are given n data points of the form (Xi, yi), where Xi Rd and yi [0, B] for i = 1, . . . , n. The yi values are generated as yi = β Xi + ξi, for a vector β Rd unknown to us, and random variables ξi representing noise. These random variables are assumed to be independent, each with a mean of 0 and a variance of σ2. We provide a detailed preliminaries to linear regression in the supplementary matirel.

We focus on label-privacy (introduced in [CH11]) in which we protect the privacy of label yi s and Xi s are public data. In other words, D and D are considered to be neighboring databases if they have the same Xi s and their yi s are also the same except data points from just one user l.

Our goal is to have an ε-DP algorithm which estimate β by outputting an unbiased estimator β and minimizes the squared error E h Pd j=1(βj βj)2i . Since β is unbiased, minimizing the squared

error is equivalent to minimizing the variance Pd j=1 Var( βj).

5.1 The Algorithm

The weighted averaging algorithm of Section 3.1 generalizes the simple averaging algorithm and is a generic linear unbiased estimator for the simple problem. Similarly, for linear regression, we need a generalization of the OLS (ordinary least squared) estimator that provides a generic linear unbiased estimator for β. Such a generalization has already been proposed by [Ait34], albeit for a different purpose1.

As a fact (see the supplementary material for details), any linear unbiased estimator can be written as ˆβ = Cy, for a d n matrix C = [ci,j]d n satisfying CX = Id. Our generalization of the weighted averaging algorithm to the higher-dimensional case is to use such an estimator followed by the Laplace mechanism:

Algorithm 3 Generalized Weighted Averaging GWAC Input: X Rn d, y Rn 1 Parameters: C Rd n satisfying CX = Id

1: ˆβ Cy 2: b Bε 1 maxl Pd j=1 P

i Sl |cj,i| 3: independently draw values W1, . . . , Wd from Lap(b) 4: β ˆβ + (W1, . . . , Wd) 5: return β

In the following theorem, we provide the performance of our algorithm and its proof can be found in the supplementary material. Theorem 6. For every d n matrix C satisfying CX = Id, the algorithm GWAC is ε-DP, and the total variance Pd j=1 Var( βj) of the vector β produced by the algorithm GWAC can be written as:

i=1 c2 j,i + 2d

i Sl |cj,i|

This total variance is a convex function and can be minimized in polynomial time.

5.2 The sample limiting algorithm

We generalize the sample limiting algorithm in Section 3.2 to higher dimensions. The sample limiting algorithm is parameterized by an integer threshold h. For each user l, it randomly picks min(h, sl) data points from the user. Let U be the features and v be the labels of the sample data points. The sample limiting algorithm ﬁrst computes the OLS of the sample data ˆβs = (U T U) 1U T v. Let Cs = (U T U) 1U T . For each user l [m], let S l be the set of corresponding row numbers in U and v. The algorithm ﬁnally outputs a vector βs obtained by adding to each entry of ˆβs a value drawn i.i.d. from Lap(b), for b = Bε 1 maxm l=1 Pd j=1 P i Ss l |cs j,i|.

Similarly to Section 3.2, after ﬁxing the selected points, the sample limiting algorithm can be considered as a special case of the GWAC algorithm if we expand Cs to n rows. Therefore the output of the sample limiting algorithm is ε-DP and Pd j=1 Var( βs j) is

i=1 cs j,i 2 + 2d

Bε 1 m max l=1

i Sl |cs j,i|

5.3 An unbounded gap

Here we give an example which shows that the optimal GWA algorithm can have a much smaller variance than the sample limiting algorithm. This advantage of GWA algorithm can also be seen in the experiments mentioned in Section 6.

1[Ait34] proposed the Generalized Least Squares method to solve linear regression when the noise in different observations are correlated.

In this example (Example 1), data points are from two orthogonal directions (i.e. Xi s have either Xi,1 = 0 or Xi,2 = 0). To control the user contributions, the sample limiting algorithm wants to pick h big for the ﬁrst dimension and to pick h small for the second dimension. The sample limiting algorithm has to pick the same threshold h for both dimensions. Intuitively, it cannot avoid big user contributions. A formal proof is provided in the supplementary material.

Example 1. d = 2. Set σ = 0 and 2d(B/ε)2 = 1. Let g be some integer parameter and set the number of users to be m = 2g2 + 2. Now consider the data points of users:

User 1 has 1 data point with Xi = (g, 0).

Each of user 2, ..., g2 + 1 has g data points with same Xi = (1, 0)

User g2 + 2 has g data points with same Xi = (0, 1).

Each of user g2 + 3, ..., 2g2 + 2 has 1 data point with Xi = (0, 1)

Claim 3. In Example 1, the minimum variance of the sample limiting algorithm is at least 1/4g3 and the minimum variance of the generalized weighted averaging algorithm is at most 1/g4.

Example 1 contained one data point whose norm of Xi is much larger than the others. We provide a different example in the supplementary material to show that a gap still exists even when |Xi| = 1 for all data points.

6 Experiments

In this section we perform an empirical evaluation of our algorithm and we compare it with the sample limiting algorithm for linear regression in the label-privacy case. In Appendix E, we also provide experimental results on logistic regression using our ERM algorithm of Setion 4.

Datasets We evaluated all methods on two publicly-available datasets containing real-world data as well as synthetic datasets with ground-truth generated with standard open-source libraries. We stress that no private data has been used in these experiments. We only brieﬂy describe our and experimental setup here, more details are available in Appendix E.

Synthetic data: We generated regression problem instances with sklearn s make_regression (n [600, 3000] samples, d = 10 features, bias=0.0 and noise=20). To model user contributions we used the Zipf s (power law) distribution for the number of rows of a user (users contributions are often heavy tailed [AH02]). Real-world datasets: We used also two UCI Machine Learning Datasets. drugs [GKMZ18] (n = 3107, d = 8, m = 502 users with min 1 and max 63 samples) and news [MT18] (n = 3452, d = 10, m = 297 users with min 1 and max 878 samples).

Avg Square Error

Average Square Error vs n samples -- eps = 1, alpha=1.5

sample h* sample h=max

Figure 1: Average squared errors synthetic dataset, α = 1.5 and ε = 1.

Experimental set up Experiments are repeated 10 times and we report mean of each metric computed. For quality we use the average squared error for the prediction. We evaluate our general setting algorithm in Section 5 using ε = 1, 2, 3 values. For σ2, we treat it as public knowledge and compute it with OLS regression. For sample limiting we evaluate the best threshold h and the whole datasets (i.e. h set to max user contribution).

Results on the synthetic dataset We compare all methods on datasets with varying numbers of samples n and different parameter α of the Zipf s distribution. Lower α values correspond to more uneven distributions (i.e. some users may many more data points than others). The results for α = 1.5, ε = 1, are plot in Figures 1. As expected, the larger the number of samples the lower the loss of all methods, however in every setting our method has always signiﬁcantly lower

Dataset ε Our method Sample limit h Sample limit hmax

1 3.1 24.8 95.4 2 2.5 7.7 25.2 3 2.3 4.5 12.4

1 1696.3 96344.4 4862670.7 2 440.1 24110.0 1201568.9 3 166.2 10648.3 550989.2

Table 1: Average squared errors for our method, sample limit with best threshold (h ), and using all data (hmax).

squared error, even orders of magnitude lower (notice the y-axis is in log scale). We now ﬁx ε = 1 and n = 3000 samples and analyze the effect of the parameter α in Figure 2. Recall that α controls the inequality in the distribution of the user s contributions. As expected, our method is comparatively much better for low α (i.e., more unequal distributions of user contributions), but it performs always better.

Avg Square Error

Average Square Error vs alpha -- eps = 1, n=3000

sample h* sample h=max

Figure 2: Average squared errors Synthetic dataset, α = 1.5, ε = 1, n = 3000.

Results on real-world datasets We now report the results for the real-world datasets. Our results are summarized in Table 1. The results conﬁrm all empirical observations on the synthetic datasets: the loss decreases for increasing ε for all methods, but our method has always signiﬁcantly lower loss than both sample limiting with best and max threshold. Notice that the squared error is overall larger for news than for drugs, this is explained by the larger range of the y s values (in news the values are in [0, 71] vs [1, 10] for drugs).

7 Conclusion

In this paper, we propose the weighted averaging method for smoothly bounding user contribution in differential privacy. We apply this method to estimating the mean and quantiles, empirical risk minimization, and linear regression. We show it has advantage over the sampling limiting algorithm, especially in the label-privacy case.

Broader Impact

Privacy is a fundamental concern in machine learning. Respecting the privacy of the users is a requirement of any real system and differential privacy allows to formalize such requirement. In this paper we provided algorithms with improved trade-offs of utility vs differential privacy. This may enable better outcomes for the users of a system at the same level of privacy. We stress that privacy is only one of the requirements of a real system. Any machine learning technology must also responsibly ensure utility of the system and fairness of the system to the users. Privacy requirements may negatively affect utility, and it is known that differential privacy potentially disparately impacts certain users [BPS19]. Such considerations are beyond the scope of the paper and we refer to the emerging literature on responsible machine learning for addressing them [KR19].

Funding Transparency Statement

No third-party funding has been used for this research.

[AGK17] Mohammad Alaggan, Sébastien Gambs, and Anne-Marie Kermarrec. Heterogeneous differential privacy. Journal of Privacy and Conﬁdentiality, 7(2), Jan. 2017.

[AH02] Lada A Adamic and Bernardo A Huberman. Zipf s law and the internet. Glottometrics, 3(1):143 150, 2002. [Ait34] AC Aitken. On Least-squares and Linear Combinations of Observations. Proceedings of the Royal Society of Edinburgh, 55:42 48, 1934. [AKMV19] Kareem Amin, Alex Kulesza, Andres Munoz, and Sergei Vassilvitskii. Bounding user contributions: A bias-variance trade-off in differential privacy. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 263 271, June 2019. [BFTT19] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 11279 11288, 2019. [BPS19] Eugene Bagdasaryan, Omid Poursaeed, and Vitaly Shmatikov. Differential privacy has disparate impact on model accuracy. In Advances in Neural Information Processing Systems 32, pages 15479 15488. 2019. [BST14] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efﬁcient algorithms and tight error bounds. In Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, FOCS 14, page 464 473, USA, 2014. IEEE Computer Society. [CH11] Kamalika Chaudhuri and Daniel Hsu. Sample complexity bounds for differentially private learning. In Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pages 155 186, June 2011. [CM08] Kamalika Chaudhuri and Claire Monteleoni. Privacy-preserving logistic regression. In Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS 08, page 289 296, Red Hook, NY, USA, 2008. Curran Associates Inc. [CMS11] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. J. Mach. Learn. Res., 12(null):1069 1109, July 2011. [DJW13] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 429 438, 2013. [DMNS06] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC 06, page 265 284, 2006. [DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3 4):211 407, August 2014. [Dwo06] Cynthia Dwork. Differential privacy. In Michele Bugliesi, Bart Preneel, Vladimiro Sassone, and Ingo Wegener, editors, Automata, Languages and Programming, pages 1 12, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg. [FKT20] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal rates in linear time. In STOC, 2020. [Gau26] Carl Friedrich Gauss. Theoria Combinationis Observationum Erroribus Minimis Obnoxiae, Parts 1, 2 and suppl. Werke 4, 1 108. 1821, 1823, 1826. [GKMZ18] Felix Gräßer, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder. Aspect-based sentiment analysis of drug reviews applying cross-domain and cross-data learning. In Proceedings of the 2018 International Conference on Digital Health, pages 121 125, 2018. [INS+19] R. Iyengar, J. P. Near, D. Song, O. Thakkar, A. Thakurta, and L. Wang. Towards practical differentially private convex optimization. In 2019 IEEE Symposium on Security and Privacy (SP), pages 299 316, 2019. [JKT12] Prateek Jain, Pravesh Kothari, and Abhradeep Thakurta. Differentially private online learning. In Shie Mannor, Nathan Srebro, and Robert C. Williamson, editors, Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of

Machine Learning Research, pages 24.1 24.34, Edinburgh, Scotland, 25 27 Jun 2012. PMLR.

[JT14] Prateek Jain and Abhradeep Guha Thakurta. (near) dimension independent risk bounds for differentially private learning. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 476 484, Bejing, China, 22 24 Jun 2014. PMLR.

[JYC15] Z. Jorgensen, T. Yu, and G. Cormode. Conservative or liberal? personalized differential privacy. In 2015 IEEE 31st International Conference on Data Engineering, pages 1023 1034, 2015.

[KMA+19] Peter Kairouz, H. Brendan Mc Mahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D Oliveira, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaïd Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Konecný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Özgür, Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, and Sen Zhao. Advances and open problems in federated learning. Co RR, abs/1912.04977, 2019.

[KR19] Michael Kearns and Aaron Roth. The Ethical Algorithm: The Science of Socially Aware Algorithm Design. Oxford University Press, 2019.

[KST12] Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and high-dimensional regression. In Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, pages 25.1 25.40, Edinburgh, Scotland, 25 27 Jun 2012. PMLR.

[MT07] Frank Mc Sherry and Kunal Talwar. Mechanism design via differential privacy. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS 07, page 94 103, USA, 2007. IEEE Computer Society.

[MT18] Nuno Moniz and Luis Torgo. Multi-source social feedback of online news feeds. Co RR, 2018.

[PSY+19] Venkatadheeraj Pichapati, Ananda Theertha Suresh, Felix X. Yu, Sashank J. Reddi, and Sanjiv Kumar. Adaclip: Adaptive clipping for private SGD. Co RR, abs/1908.07643, 2019.

[SCS13] S. Song, K. Chaudhuri, and A. D. Sarwate. Stochastic gradient descent with differentially private updates. In 2013 IEEE Global Conference on Signal and Information Processing, pages 245 248, 2013.

[SCS15] Shuang Song, Kamalika Chaudhuri, and Anand Sarwate. Learning from Data with Heterogeneous Noise using SGD. volume 38 of Proceedings of Machine Learning Research, pages 894 902, San Diego, California, USA, 09 12 May 2015. PMLR.

[She19a] Or Sheffet. Differentially private ordinary least squares. Journal of Privacy and Conﬁdentiality, 9(1), Mar. 2019.

[She19b] Or Sheffet. Old techniques in differentially private linear regression. In Aurélien Garivier and Satyen Kale, editors, Proceedings of the 30th International Conference on Algorithmic Learning Theory, volume 98 of Proceedings of Machine Learning Research, pages 789 827, Chicago, Illinois, 22 24 Mar 2019. PMLR.

[Smi11] Adam Smith. Privacy-preserving statistical estimation with optimal convergence rates. In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, STOC 11, page 813 822, New York, NY, USA, 2011. Association for Computing Machinery.

[SSSS09] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic convex optimization. In COLT 2009 - The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009, 2009.

[STU17] A. Smith, A. Thakurta, and J. Upadhyay. Is interaction necessary for distributed private learning? In 2017 IEEE Symposium on Security and Privacy (SP), pages 58 77, 2017. [TAM19] Om Thakkar, Galen Andrew, and H. Brendan Mc Mahan. Differentially private learning with adaptive clipping. Co RR, abs/1905.03871, 2019. [TS13] Abhradeep Guha Thakurta and Adam Smith. Differentially private feature selection via stability arguments, and the robustness of the lasso. In Shai Shalev-Shwartz and Ingo Steinwart, editors, Proceedings of the 26th Annual Conference on Learning Theory, volume 30 of Proceedings of Machine Learning Research, pages 819 850, Princeton, NJ, USA, 12 14 Jun 2013. PMLR. [TTZ15] Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Nearly-optimal private lasso. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS 15, page 3025 3033, Cambridge, MA, USA, 2015. MIT Press. [Ull15] Jonathan Ullman. Private multiplicative weights beyond linear queries. In Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 15, page 303 312, New York, NY, USA, 2015. Association for Computing Machinery. [Wan18] Yu-Xiang Wang. Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain. In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artiﬁcial Intelligence, pages 93 103, 2018. [WLK+17] Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey Naughton. Bolt-on differential privacy for scalable stochastic gradient descent-based analytics. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD 17, page 1307 1322, New York, NY, USA, 2017. Association for Computing Machinery. [WX19] Di Wang and Jinhui Xu. On sparse linear regression in the local differential privacy model. volume 97 of Proceedings of Machine Learning Research, pages 6628 6637, Long Beach, California, USA, 09 15 Jun 2019. PMLR. [WYX17] Di Wang, Minwei Ye, and Jinhui Xu. Differentially private empirical risk minimization revisited: Faster and more general. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2722 2731. Curran Associates, Inc., 2017. [ZZMW17] Jiaqi Zhang, Kai Zheng, Wenlong Mou, and Liwei Wang. Efﬁcient private ERM for smooth objectives. In Carles Sierra, editor, Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pages 3922 3928. ijcai.org, 2017.