# on_userlevel_private_convex_optimization__9a1d8761.pdf On User-Level Private Convex Optimization Badih Ghazi * 1 Pritish Kamath * 1 Ravi Kumar * 1 Pasin Manurangsi * 2 Raghu Meka * 1 3 Chiyuan Zhang * 1 We introduce a new mechanism for stochastic convex optimization (SCO) with user-level differential privacy guarantees. The convergence rates of this mechanism are similar to those in the prior work of Levy et al. (2021); Narayanan et al. (2022), but with two important improvements. Our mechanism does not require any smoothness assumptions on the loss. Furthermore, our bounds are also the first where the minimum number of users needed for user-level privacy has no dependence on the dimension and only a logarithmic dependence on the desired excess error. The main idea underlying the new mechanism is to show that the optimizers of strongly convex losses have low local deletion sensitivity, along with an output perturbation method for functions with low local deletion sensitivity, which could be of independent interest. 1. Introduction Differential Privacy (DP) (Dwork et al., 2006b) is a formal notion that protects the privacy of each user contributing to a dataset when releasing statistics about the dataset. The settings considered in literature have typically involved each user contributing a single item to the dataset. Thus the most commonly used notion of DP protects the privacy of each item, and we refer to it as item-level DP. However, when a dataset contains multiple items contributed by each user, it is essential to simultaneously protect the privacy of all items contributed by any individual user; this notion has come to be known as user-level DP. Convex optimization is one of the most basic and powerful computational tools in statistics and machine learning. In the most abstract setting, each item corresponds to a loss 1Google Research, Mountain View, CA 2Google Research, Thailand 3UCLA, Los Angles, CA. Correspondence to: Pasin Manurangsi , Pritish Kamath . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). function. The goal is to return a value that achieves as small a loss as possible, either averaged over the data (empirical risk minimization) or the population distribution underlying the data (stochastic convex optimization). Given its importance, a large body of work has tackled the convex optimization problem under item-level DP (e.g., Chaudhuri & Monteleoni (2008); Chaudhuri et al. (2011); Kifer et al. (2012); Bassily et al. (2014; 2019); Wang et al. (2017); Feldman et al. (2020); Asi et al. (2021); Gopi et al. (2022)) with the optimal risk bounds established in many standard settings, such as when the loss is Lipschitz or strongly convex. User-level DP has also been studied recently in various learning tasks (Liu et al., 2020; Ghazi et al., 2021); see also the survey by Kairouz et al. (2021, Section 4.3.2) for its relevance in federated learning, where the question of determining trade-offs between item-level and userlevel DP is highlighted. Levy et al. (2021); Narayanan et al. (2022) have studied convex optimization with user-level DP; these results have two main drawbacks: they require the loss function to be smooth and they do not achieve good risk bounds in some regime of parameters. A question in Levy et al. (2021) was if the smoothness requirement can be removed. In this paper, we resolve this question in the affirmative by introducing novel mechanisms for convex optimization under user-level DP. En route, we also improve existing excess risk bounds for a large regime of parameters. 1.1. Background We introduce some notation to state our results concretely. For n, m N, suppose there are n users, and let the input to the ith user be xi := (xi,1, . . . , xi,m). Two datasets x = (x1, . . . , xn) and x = (x 1, . . . , x n) are said to be user-level neighbors, denoted x x , if there is an index i0 [n] such that xi = x i for all i [n] {i0}.1 We recall the definition of DP, extended from Dwork et al. (2006a;b); see also Dwork & Roth (2014); Vadhan (2017): Definition 1.1 ((User-Level) Differential Privacy (DP)). Let ε > 0 and δ [0, 1]. A randomized algorithm M : X n m O is (ε, δ)-differentially private ((ε, δ)-DP) if, for all x x and all (measurable) outcomes E O, it holds that Pr[M(x) E] eε Pr[M(x ) E] + δ. 1We use item-level to refer to the case where m = 1. On User-Level Private Convex Optimization Throughout the paper, we assume that ε (0, 1] and δ (0, 1/2], and we will not state this explicitly. Convex Optimization. A convex optimization (CO) problem over a parameter space K Rd and domain X, is specified by a loss function ℓ: K X R, that is convex in the first argument. Here, ℓis said to be G-Lipschitz if all sub-gradients have norm at most2 G, i.e., θℓ(θ; x) G for all θ, x. Moreover, ℓis said to be µ-strongly convex if for all x X, ℓ(θ; x) µ 2 θ 2 is convex. We consider the case where K Rd has ℓ2-diameter at most R; we use Bd(θ, r) to denote the ℓ2-ball of radius r centered at θ. The empirical loss on dataset x = (x1, . . . , xn) is L(θ; x) := 1 nm P j [m] ℓ(θ; xi,j), whereas the population loss over a distribution D on X is L(θ; D) := E x D [ℓ(θ; x)] . For a loss function ℓand dataset x, let θ ℓ,K(x) denote an element of argminθ K L(θ; x) (ties broken arbitrarily), and let θ ℓ,K(D) be defined similarly. When ℓ, K are clear from context, we may drop them and simply write θ (x) or θ (D). When there is no ambiguity in x and D, we may drop them and simply write θ . Empirical risk minimization (ERM) corresponds to the goal of minimizing L(θ; x) and stochastic convex optimization (SCO) to the goal of minimizing L(θ; D). If bθ denotes the output of our algorithm, its excess risk is defined as E[L(bθ; x) L(θ ; x)] and E[L(bθ; D) L(θ ; D)] for ERM and SCO, respectively. 1.2. Our Results We provide user-level DP algorithms for both the ERM as well as the SCO problems. For both problems, we consider the basic case of Lipschitz (including non-smooth) losses and the case of Lipschitz strongly convex losses. DP-ERM. We give an algorithm for any Lipschitz and convex loss function with excess risk O d n m that works for any n eΩε(1). Previously, no user-level DP algorithm was known without a smoothness assumption on the loss function. Even with smoothness, the known algorithm of Narayanan et al. (2022) incurs an additional additive error of e Oε(1/ n). In particular, the previous excess risk does not converge to zero if we fix the number of users (n) and let m . Concretely, to achieve excess risk α, Narayanan et al. (2022) need n eΩε(1/α2). In contrast, we only need a logarithmic dependence of n eΩε(log(1/α)). Additionally, for loss functions that are also strongly convex, we improve the excess risk bound to e Oε d n2m . Again, no previous user-level DP algorithm, without the smoothness assumption, was known in this setting . 2We use to denote the Euclidean, i.e., ℓ2-norm. DP-SCO. Here, we give algorithms with similar excess risk bounds except with additive terms of e O 1 nm and e O 1 nm for the convex and strongly convex cases, respectively. These additive terms are known to be tight, even without privacy. Again, previous results (Levy et al., 2021; Narayanan et al., 2022) are only known under the smoothness assumption and the excess risk bounds do not converge to zero when n is fixed. A summary of the previous and new bounds is in Table 1. Tightness of our Risk Bounds. Our excess risk bounds are nearly tight for a large regime of parameters. In particular, Levy et al. (2021) proved a lower bound of Ω 1 n + d εn m for DP-SCO. It is not hard to extend this to prove a lower bound of Ω 1 nm + d ε2n2m for the strongly convex DP-SCO case. These two lower bounds hold for any n Θ( d/ε). For DP-ERM, it is possible to extend these lower bounds to get Ω d εn m and Ω d εn2m lower bounds for the convex and strongly convex cases, respectively; however, these DP-ERM lower bounds require an additional assumption that n = O(d/ε2). We discuss these lower bounds in more detail in Appendix D. 2. Technical Overview Our main technical contribution is an improved output perturbation algorithm for user-level DP compared to itemlevel DP. Recall that in item-level DP, the output perturbation algorithm (Chaudhuri et al., 2011) computes the empirical risk minimizer θ and outputs θ + Z where Z N(σ2 I) for a suitable σ; naturally, the smaller the σ for which DP guarantees hold, the better the accuracy. It is known that for strongly convex loss functions, this algorithm is DP for σ = e Oε 1 n . As discussed more below, we give a similar algorithm that only requires σ = e Oε( 1 n m). This improvement is critical in our results. Deletion Sensitivity. We exploit a refined notion of sensitivity to facilitate our improved output perturbation algorithm. Bounding the sensitivity of the quantity to be computed is one of the most used methods for achieving DP guarantees. Indeed, the DP guarantee of the output perturbation algorithm in (Chaudhuri et al., 2011) for item-level privacy follows from the fact that the (ℓ2-)sensitivity of the empirical risk minimizer is O 1 n (Shalev-Shwartz et al., 2009). Formally, θ (x) θ (x ) O(1/n), (1) for any two neighboring datasets x, x . Ideally, we would like the sensitivity of θ to become e O 1 n m for some notion of sensitivity . However, the On User-Level Private Convex Optimization Additional Item-Level DP User-Level DP User-Level DP Assumptions on ℓ (Previous Work) (Previous Work) (Our Results) (no additional assumption) e Oε d n m for n eΩε(1) (Bassily et al., 2014) e Oε 1 n + d n m (Theorem 4.1) Smooth for n eΩε(1) (Narayanan et al., 2022) ERM Strongly Convex e Oε d n2 e Oε d n2m for n eΩε(1) Strongly Convex (Bassily et al., 2014) e Oε d n2m (Theorem 4.3) & Smooth for n eΩε(1) (Narayanan et al., 2022) (no additional assumption) e Oε 1 n + d n e Oε 1 nm + d n m for n eΩε(1) (Bassily et al., 2019) e Oε 1 nm + d n m (Theorem 5.1) SCO Smooth for n eΩε(min{ 3 m, m/d}) (Narayanan et al., 2022) Strongly Convex e Oε 1 n + d n2 e Oε 1 nm + d n2m for n eΩε(1) (Feldman et al., 2020) (Theorem 5.3) Table 1. Summary of our results and previous results. In all rows, the loss function is assumed to be convex and Lipschitz. The e Oε hides polynomial dependency on the convexity, Lipschitzness, strong convexity and smoothness parameters, ε, and polylogarithmic dependency on 1/δ, n, m. We remark that, while it seems plausible to derive bounds using their techniques, Levy et al. (2021); Narayanan et al. (2022) did not explicitly consider the strongly convex (and smooth) case for DP-SCO. standard notion of sensitivity as above (or even local sensitivity (Nissim et al., 2007)) does not work: even for mean estimation3, we can change a user to have all their input vectors far from the mean, resulting in the same O(1/n) sensitivity as before. Instead, we use the notion of deletion sensitivity. Here, instead of considering x that results from changing a user s data in x, we only consider x that results from removing a user s data completely. Bounding Deletion Sensitivity of Empirical Risk Minimizer. We show that the (local) deletion sensitivity of θ (x) is at most e O 1 n m . To describe our approach, let us briefly recall the proof of (1) (item level setting, i.e., m = 1) from Shalev-Shwartz et al. (2009). The proof proceeds by bounding the norm of the gradient at θ := θ (x) with respect to x (i.e., L(θ (x); x ) ); strong convexity then implies that θ (x ) is close to θ (x). The gradient norm bound is based on the observation that L(θ ; x) = 0 due to optimality, and that L(θ ; x) L(θ ; x ) is only 1/n times a difference of the gradients of two input points (that got changed from x to x ). These two claims yield the desired O(1/n) bound. For the user-level setting, the situation is similar except 3This corresponds to ℓ(θ; x) = θ x 2 (here x Rd) for which the empirical risk minimizer θ (x) is the average over all the input points. that L(θ ; x) L(θ ; x ) now becomes O 1 nm times the gradient of all input points of a single user (that got removed from x to x ). An observation we use here is that in SCO where all nm input points are drawn i.i.d. we may view the input generation as a two-step process: (i) draw the nm input points, and (ii) randomly allocate these nm input points to n users. With this view in mind, L(θ ; x) = 0 means that the sum of the nm gradients is zero. The randomness in (ii) means that L(θ ; x) L(θ ; x ) is now O 1 nm times the sum of m vectors randomly chosen from these nm vectors that sum to zero. Applying concentration inequalities (and a union bound), we can show that w.h.p. L(θ ; x) L(θ ; x ) e O 1 n m as desired. Noise Addition Algorithm for Deletion Sensitivity. Adding noise is still not trivial, even after bounding the (local) deletion sensitivity. As stated earlier, since we do not have the bound for the (standard) sensitivity, adding Gaussian noise directly to θ (x) will not yield the desired DP guarantee. To overcome this, we give an algorithm that adds noise w.r.t. the (local) deletion sensitivity. At a highlevel, our algorithm has to perform a test to ensure that x is sufficiently stable (akin to propose-test-release (Dwork & Lei, 2009)) before adding Gaussian noise. Our algorithm is an adaptation of that of Kohli & Laskowsk (2021), which focuses on the real-valued case and adds Laplace noise. On User-Level Private Convex Optimization From Output Perturbation to DP-SCO/DP-ERM. Finally, once we have the improved output perturbation algorithm, we use the localization-based algorithms (called Phased-SCO/Phased-ERM) of Feldman et al. (2020) with the enhanced output perturbation algorithm as subroutines to arrive at our results for DP-SCO/DP-ERM in the convex case. The strongly convex case follows from a known black-box reduction from Bassily et al. (2014). Remark. Our algorithm for ERM guarantees an e Oε excess risk w.h.p. over the input being a random permutation of any given dataset x. We emphasize that this is a mild assumption on the distribution of the dataset, and the same guarantees immediately follow for stronger assumptions such as the dataset x being drawn from any exchangeable distribution, e.g., drawn i.i.d. from D. Furthermore, we stress that it is impossible to have an excess risk bound for ERM that is better than e Oε( d/n) for worst-case datasets since xi,j could be all the same for each i, which becomes essentially identical to the item-level setting with m = 1. Comparison to Previous Work. Previous work (Levy et al., 2021; Narayanan et al., 2022) on user-level DP-SCO and DP-ERM tackle the problem using privatized first-order methods (i.e., variants of gradient descent), sometimes with regularization. The main tool in these works is a user-level DP algorithm for mean estimation of vectors, which is used to aggregate the gradients with errors smaller than in the item-level setting. Such a result needs to rely on the fact that the average of the gradients of each user is wellconcentrated; this can be interpreted as the average gradient having low deletion sensitivity. As discussed earlier, our result significantly generalizes this by showing that this also holds for the minimizer of any strongly convex function. Our algorithms also provide a novel paradigm of output perturbation for user-level DP learning deviating from the first-order methods explored in previous works. In addition to the aforementioned work of Kohli & Laskowsk (2021), a notion similar to local deletion sensitivity has been studied in the context of DP graph analysis under the names of empirical sensitivity (Chen & Zhou, 2013) and down sensitivity (Raskhodnikova & Smith, 2016). Several mechanisms were developed using this notion, including an algorithm for monotonic real-valued functions (Chen & Zhou, 2013) and for many graph parameters. However, we are not aware of a generic algorithm for the high-dimensional case similar to our Algorithm 1. 3. Output Perturbation for Strongly Convex Losses At the heart of our results is a new DP output perturbation algorithm (Algorithm 2) for strongly convex losses. The guarantee of this algorithm does not hold for any worst- case dataset, but instead holds for a random permutation of any given dataset. In particular, for any permutation π over [n] [m], let xπ be the permutation of x by π, i.e., xπ i,j := xπ(i,j). As discussed in the previous section, this is a mild assumption but is required for our results. Theorem 3.1. Fix a G-Lipschitz and µ-strongly convex loss ℓand a sufficiently large constant C. For all ε, δ, β > 0 and n C log(1/δ)/ε, there exists an (ε, δ)-DP algorithm SCOutput Pert, such that for all x, with probability 1 β over a random permutation π over [n] [m], SCOutput Pert(xπ) outputs θ (xπ) + N(0, σ2 I) where log n log(1/δ)/ε+log(1/β) µn m (log(1/δ))1.5 The expected ℓ2-distance between the output estimate and the true minimizer thus scales as e Oε d n m . This should be compared with the item-level (i.e., m = 1) setting where the bound is e Oε( d/n) (Chaudhuri & Monteleoni, 2008; Chaudhuri et al., 2011). 3.1. Deletion Sensitivity & A Generic Output Perturbation Algorithm Before we can prove Theorem 3.1, we need to introduce the notion of local deletion sensitivity and present a generic output perturbation algorithm for low local deletion sensitivity functions and datasets. We stress that the algorithm in this section works for any such function and can be applied beyond the context of convex optimization. Local Deletion Sensitivity. For any x = (x1, . . . , xn), let x i, denote the dataset obtained by deleting the ith user s data xi from x. For any subset S [n], let x S denote the dataset obtained by deleting xi from x for all i S. Definition 3.2. The local (ℓ2-)deletion sensitivity of function f at dataset x with n users is defined as f(x) := maxi [n] f(x) f(x i) . Moreover, for r N, let r f(x) := max S [n],|S| r f(x S). The difference between the usual definition of local sensitivity (Nissim et al., 2007) and that of local deletion sensitivity is that the latter definition only applies to removal of a user s data. This means that standard frameworks such as proposetest-release (Dwork & Lei, 2009) cannot be directly used here. We however show that this sensitivity notion still allows us to design an algorithm with small error on any dataset for which r f(x) is small for any sufficiently large r = Θ(log(1/δ)/ε). The guarantee is given below. Theorem 3.3. Let f : X m Rd, and > 0 be a predefined parameter. There exists an (ε, δ)-DP algorithm that either outputs or a vector in Rd. Furthermore, there exists r = O(log(1/δ)/ε) such that, on input dataset x that satisfies r f(x) , it never returns and simply returns f(x) + N(0, σ2 I) where σ = O (log(1/δ))1.5 On User-Level Private Convex Optimization The general idea is to find a sufficiently stable dataset ˆx and add noise to f(ˆx). Although we may wish to just set ˆx = x directly and check that the local sensitivity is small, we cannot do this, as changing a single datapoint can change whether we pass the test. Therefore, similar to the propose-test-release framework, we check for x S for all subsets S with |S| R1 where R1 is a shifted truncated discrete Laplace random variable, as defined below. This allows us to maintain the closeness of acceptance probability across neighboring input datasets. The full description is given in Algorithm 1. As stated earlier, our algorithm is a modification of that of Kohli & Laskowsk (2021), which uses Laplace noise and a different distribution of R1. Definition 3.4 (Shifted Truncated Discrete Laplace Distribution). For any ε, δ > 0, let κ = κ(ε, δ) := 1 + ln(1/δ)/ε and let TDLap(ε, δ) be the distribution supported on {0, . . . , 2κ} with probability mass function at x being proportional to exp ( ε |x κ|). Algorithm 1 Del Output Pertε,δ, (f; x) 1: Input: Dataset x, function f : X m Rd 2: Parameters: Privacy parameters ε, δ; Target deletion sensitivity parameter 2, δ δ eε+2, κ κ(ε, δ), σ 2 ln(2/δ)(8κ ) ε 4: Sample R1 TDLap(ε, δ) See Definition 3.4 5: X R1 stable x S : |S| R1, 4κ |S| f(x S) 6: if |X R1 stable| = then 7: return 8: end if 9: Choose x S X R1 stable with smallest |S| Ties broken arbitrarily 10: return f(x S) + N(0, σ2 I) To prove Theorem 3.3, the following observation is useful. Observation 3.5. For neighboring datasets x, x , and all r1 Z 0, if X r1 stable(x ) = , then X r1+1 stable(x) = . Proof. Suppose x = (x 1, . . . , x n). Let x S be an element of X r1 stable(x ). That is, we have |S | r1 and r2 f(x S ) for r2 = 4κ r1. Let i [n] denote the user on which x and x differ. We consider two cases: If i S , then we simply have x S = x S and therefore x S also belongs to X r1+1 stable(x). If i / S , let S = S {i}. We have |S| r1+1 and r2 1(x S) r2(x S ). This means that x S X r1+1 stable(x) as well. Proof of Theorem 3.3. Let A be the Del Output Pert algorithm (Algorithm 1). Privacy Analysis. Let x, x be neighboring datasets. First, Pr[A(x) = ] = P2κ r1=0 1[A(x) = | R1 = r1] Pr[R1 = r1] = P2κ r1=0 1[X r1 stable(x) = ] Pr[R1 = r1] Pr[R1 = 0] + P2κ r1=1 1[X r1 stable(x) = ] Pr[R1 = r1] δ + P2κ r1=1 1[X r1 stable(x) = ] eε Pr[R1 = r1 1] δ + P2κ r1=1 1[X r1 1 stable(x ) = ] eε Pr[R1 = r1 1] δ + eε P2κ r1=0 1[A(x ) = | R1 = r1] Pr[R1 = r1] = δ + eε Pr[A(x ) = ]. (2) Next, consider any set S0 Rd. We have Pr[A(x) S0] P2κ r1=0 Pr[A(x) S0 | R1 = r1] Pr[R1 = r1] = P2κ 1 r1=0 Pr[A(x) S0 | R1 = r1] Pr[R1 = r1] + Pr[R1 = 2κ] δ + eε P2κ 1 r1=0 Pr[A(x) S0 | R1 = r1] Pr[R1 = r1 + 1], (3) where the last inequality follows since R1 TDLap(ε, δ). To bound the term Pr[A(x) S0 | R1 = r1] for r1 < 2κ, observe that if it is non-zero, then it must be that A(x) = or equivalently that X r1 stable(x) = ; Observation 3.5 then implies that X r1+1 stable(x) = , or equivalently, A(x ) = . Let ˆxr1 and ˆx r1+1 be the sets chosen on Line 9 when we run the algorithm on input ˆx, R1 = r1 and ˆx , R1 = r1 + 1 respectively. Let x := ˆxr1 ˆx r1+1. We have |x | |x| r1 (r1+1) |x| 4κ. Therefore, since ˆxr1 X r1 stable(x) and ˆx r1+1 X r1+1 stable(x ), we must have f(ˆxr1) f(x ) |ˆx ˆx | 4κ , and similarly, f(ˆx r1+1) f(x ) |ˆx ˆx | 4κ . Combining the two, we can conclude that f(ˆxr1) f(ˆx r1+1) 8κ . From the privacy guarantee of the Gaussian mechanism (e.g., Dwork & Roth (2014, Appendix A)), we have Pr[f(ˆxr1) + N(σ2 I) S0] eε Pr[f(ˆx r1+1) + N(σ2 I) S0] + δ. Note that Pr[f(ˆxr1) + N(σ2 I) S0] = Pr[A(x) S0 | R1 = r1], while Pr[f(ˆx r1+1) + N(σ2 I) S0] = Pr[A(x ) S0 | R1 = r1 + 1]. Plugging this back to (3), we get Pr[A(x) S0] On User-Level Private Convex Optimization δ + eε P2κ 1 r1=0 Pr[A(x) S0 | R1 = r1] Pr[R1 = r1 + 1] = δ + eε P2κ 1 r1=0 eε Pr[A(x ) S0 | R1 = r1 + 1] + δ Pr[R1 = r1 + 1] (eε + 1)δ + e2ε P2κ r1=0 Pr[A(x ) S0 | R1 = r1] Pr[R1 = r1] = (eε + 1)δ + e2ε Pr[A(x ) S0]. (4) Now, consider any set S of outcomes. Let S0 = S Rd and S = S { }. Then, we have Pr[A(x) S] = Pr[A(x) S0] + Pr[A(x) S ] (4),(2) (eε + 1)δ + e2ε Pr[A(x ) S0] δ + eε Pr[A(x ) = S ] (eε + 2)δ + e2ε Pr[A(x ) S] δ + eε Pr[A(x ) S]. Therefore, the algorithm is (ε, δ)-DP as desired. Accuracy Analysis. Let x be any dataset such that 4κ x . This means that, for any 0 R1 4κ, 4κ R1 x . In other words, x belongs to X R1 stable. Thus, we always have ˆx = x and the output is simply drawn from f(x) + N(0, σ2 I) as claimed. 3.2. Deletion Sensitivity of Optimizers of Strongly Convex Losses Having provided a generic noising algorithm for functions with low local deletion sensitivity, the next step is to show that the function that we care about for convex optimization the empirical risk minimizer has low deletion sensitivity (with high probability), as formalized below. Theorem 3.6. Let ℓbe any µ-strongly convex loss function such that ℓ(θ; x) G for all θ K, x X. For all x X n m and β < 1/e, with probability 1 β over the choice of a random permutation π over [n] [m], we have θ (xπ) θ (xπ n) 5G log(1/β) µ(n 1) m . Before proving this, we note that by applying a union bound over all the n users and all subsets S of size at most r, we arrive at Corollary 3.7. Theorem 3.1 now follows by defining SCOutput Pert (Algorithm 2) that invokes Del Output Pert on the function f being the empirical loss, and combining Corollary 3.7 with Theorem 3.3 (setting r = 4κ). Corollary 3.7. Let ℓbe any G-Lipschitz, µ-strongly convex loss. For all x X n m and r n/2, with probability 1 β over the choice of a random permutation π over Algorithm 2 SCOutput Pertε,δ,β,G,µ,K(ℓ; x) 1: Input: Dataset x, loss function ℓ: K X R 2: Parameters: Privacy parameters ε, δ; Target failure probability β; Lipschitz parameter G; Strong convexity µ log(1/β) µn m 4: return Del Output Pertε,δ, (f; x), where f( ) := argminθ L(θ; ) [n] [m], we have r θ (xπ) O G r log n+log(1/β) In order to prove Theorem 3.6, we use the following lemma, proved in Appendix A. Lemma 3.8. Let v1, . . . , v N Rd be any set of vectors satisfying P i vi = 0 and vi G for all i. For all β < 1/e, over choice of a random permutation π over [N], it holds that j [m] vπ(j) > 5G p m log(1/β) i β. Proof of Theorem 3.6. Let θ := θ (x); note that due to the symmetric nature of L(θ; ), it holds that θ (x) = θ (xπ) for all permutations π. Let θ ,π n := θ (xπ n). Since L( ; xπ n) is µ-strongly convex4, we have that L(θ ; xπ n) L(θ ,π n; xπ n) µ θ θ ,π n . (5) Next, we upper bound L(θ ; xπ n) . 0 = L(θ ; xπ) = n 1 n L(θ ; xπ n) + 1 n L(θ ; xπ n), and hence L(θ ; xπ n) = 1 n 1 L(θ ; xπ n) = 1 (n 1)m P j [m] ℓ(θ; xπ(n,j)) . (6) i [n],j [m] ℓ(θ; xi,j) = 0 and ℓ(θ; xi,j) G, we have from Lemma 3.8 that j [m] ℓ(θ; xπ(n,j)) 5G p m log(1/β) i Putting (6) and (7) together, we have that, L(θ ; xπ n) 5G q Combining this with (5), and noting that L(θ ,π n; xπ n) = 0 since θ ,π n is the minimizer of L( ; xπ n), completes the proof. 4f is µ-strongly convex iff f(θ) f(θ ) µ θ θ . On User-Level Private Convex Optimization 4. User-Level DP-ERM In this section, we describe our algorithms for DP-ERM and prove their excess risk bounds. As in Section 3, our guarantee holds for a random permutation of any dataset a mild but necessary assumption. 4.1. Convex Losses The formal guarantee when the loss is only assumed to be convex (and Lipschitz) is given below. Theorem 4.1. For any G-Lipschitz loss ℓ, there exists an (ε, δ)-DP mechanism that, for all n eΩ log(1/δ) log(m) outputs bθ such that Eπ,bθ M(xπ)[L(bθ; xπ)] L(θ ; xπ) e Oε RG where e Oε hides a multiplicative factor of poly(log(1/δ), log(nm), 1/ε). We use Phased-ERM algorithm similar to Feldman et al. (2020), which requires solving a regularized ERM in each step. Our proof below closely follows their proof, although we change the algorithm slightly because their proof is for SCO whereas the analysis below is for ERM. In particular, for ERM, we need to use the full dataset in each step. We also change some parameters accordingly. The full description is in Algorithm 3; note that on line 12, we only optimize over the set Ki, and use Lipschitz constant 2G and strong convexity parameter λi. Algorithm 3 Phased-ERM. 1: Input: Dataset x, loss function ℓ: K X R that is convex and G-Lipschitz 2: Parameters: Privacy parameters ε, δ; Regularizer coefficient λ; Target failure probability β 3: T log(nm) Number of iterations 4: ε ε/T, δ δ/T Per-iteration privacy budgets 5: β β/T Per-iteration failure probability 6: bθ0 arbitrary element of K Initial parameter 7: for i = 1, . . . , T do 8: λi λ 4i 9: Ri G/λi 10: Let ℓi(θ; x) := ℓ(θ; x) + λi 2 θ bθi 1 2 11: Ki K Bd (θi 1, Ri) 12: bθi SCOutput Pertε ,δ ,β ,2G,λi,Ki(ℓi; x) 13: end for 14: return bθT To analyze the accuracy, let θ i := θ ℓi,Ki(x) for all i [T]. It should be noted that θ i is also equal to θ ℓi,K(x) (where the optimization is over K instead of Ki). Furthermore, within Ki, the loss Li is 2G-Lipschitz. We start with the following lemma, which is an analogue of Feldman et al. (2020, Lemma 4.7). Lemma 4.2. For any θ K and i [T], we have L(θ i ; x) L(θ; x) λi 2 bθi 1 θ 2. Proof. This is simply because L(θ i ; x) L(θ; x) Li(θ i ; x) L(θ; x) Li(θ; x) L(θ; x) = λi 2 θ bθi 1 2. We are now ready to prove Theorem 4.1. The usual analysis of the standard Phased-ERM algorithm in Feldman et al. (2020) where SCOutput Pert is replaced by an algorithm that just adds Gaussian noise to the true optimizer shows that it has small excess risk. We then relate Algorithm 3 back to this standard version by using Theorem 3.1 to show that the output distribution of our algorithm (over random π) is very close in total variation distance to this standard version. This idea is formalized below. Proof of Theorem 4.1. We run the Phased-ERM algorithm (Algorithm 3) where we set λ = G d Rn m and β = 1 nm; throughout the proof, we will use M as a shorthand for this algorithm. The privacy guarantee follows immediately from the fact that each call to SCOutput Pert is (ε , δ )-DP and the basic composition of DP. To understand its accuracy guarantee, let us start by considering another algorithm M that is the same as M except that on line 12 we do not call SCOutput Pert but instead directly let bθi θ i (x) + N(0, σ2 i I) where σi := σ(ε , δ , β , 2G, λi) is as in Theorem 3.1. For convenience, we let θ0 = θ (x). We have L(bθT ; x) L(θ 0; x) = (L(bθT ; x) L(θ T ; x)) + PT i=1 L(θ i ; x) L(θ i 1; x) G bθT θ T + PT i=1 λi 2 bθi 1 θ i 1 2 O λR2 + G bθT θ T + PT 1 i=1 λi+1 2 bθi θ i 2, where we used Lemma 4.2 for the first inequality. Thus, we have (where the expectation is over the randomness of M , i.e., noise drawn from N(0, σ2 i I) for each i [T]) Ebθ M (x)[L(bθ; x)] L(θ ; x) d σT + PT 1 i=1 λi+1 d σ2 i e Oε λR2 + G2 d λT n m + PT 1 i=1 λi+1 d G2 On User-Level Private Convex Optimization e Oε λR2 + d G2 λ n2m , where the last inequality comes from our setting λi = λ 4i. Finally, from our setting of λ = G d Rn m, we can conclude Ebθ M (x)[L(bθ; x)] L(θ ; x) e Oε RG d n m . (8) Let P denote the distribution of the output of running M on x, and let P denote the distribution of the output of running M on xπ where π is a uniformly random permutation. Next, we will show that dtv(P , P) β. (9) Before proving this, note that (8) and (9) together imply the bound in the theorem because we have Eπ,bθ M(xπ)[L(bθ; x)] L(θ ; x) (9) Ebθ M (x)[L(bθ; x)] L(θ ; x) + β RG (8) e Oε RG We are left with proving (9). To do this, for every i {0, . . . , T}, consider a hybrid algorithm Mi where, in the first i iterations, we follow M and, in the remaining iterations, we follow M. Let Pi denote the probability distribution of the output of Mi on input xπ where π is a uniformly random permutation. Notice that P0 = P and PT = P . For every i [T], consider Pi and Pi 1. They differ only in the ith iteration. Thus, dtv(Pi, Pi 1) is at most the probability that SCOutput Pert does not output a sample from θ i + N(0, σ2 i I). By Theorem 3.1, the probability (over π) that this happens is at most5 β . Therefore, we have that dtv(Pi, Pi 1) β . Thus, dtv(P, P ) P i [T ] dtv(Pi 1, Pi) T β = β, concluding our proof. 4.2. Strongly Convex Losses For strongly convex losses, we can get an improved bound: Theorem 4.3. For any G-Lipschitz, µ-strongly convex loss ℓ, there exists an (ε, δ)-DP mechanism that, for all n eΩ log(1/δ) log(m) ε , outputs bθ such that Eπ,bθ M(xπ)[L(bθ; xπ)] L(θ ; xπ) e Oε G2d µn2m , where e Oε hides a multiplicative factor of poly(log(1/δ), log(nm), 1/ε). 5Note that the distribution of θ i is independent of π since we are running M for the first i 1 steps. Thereby, we can still apply Theorem 3.1, which only relies on the randomness in π. We obtain the above result by reducing back to the convex case. This reduction essentially dates back to Bassily et al. (2014) and works as follows: first we apply the output perturbation algorithm (Theorem 3.1). With high probability, the output is within a ball of radius R = e Oε G d µn m . We then run Theorem 4.1 using this R, which yields the final excess risk of e Oε G d n m = e Oε G2d µn2m as desired. The full proof is presented in Appendix B.1. 5. User-Level DP-SCO We next describe our algorithms for DP-SCO and their excess risk guarantees. 5.1. Convex Losses For the convex (and Lipschitz) loss case, the risk bound is similar to that of Theorem 4.1 except with an additional additive term O(1/ nm): Theorem 5.1. For any G-Lipschitz convex loss ℓ, there exists an (ε, δ)-DP mechanism that, for all n e O log(1/δ) log(m) ε , outputs bθ such that Ex Dn m,bθ M(x)[L(bθ; D)] L(θ ; D) d n m + 1 nm , where e Oε hides a multiplicative factor of poly(log(1/δ), log(nm), 1/ε). The arguments in the previous section also extend to SCO. The idea as before is to replace the output perturbation step in Algorithm 3 of Feldman et al. (2020) with SCOutput Pert. The full algorithm is presented in Algorithm 4; note that in the ith iteration, we only use the input from users 2 in, . . . , 2 i+1n to perform SCOutput Pert. Similar to before, let θ i := θ ℓi,Ki(D) for all i [T]. Again, note that θ i = θ ℓi,K(D) (where the optimization is over K instead of Ki). Furthermore, within Ki, the loss Li is 2G-Lipschitz. We use the following lemma, analogous to Lemma 4.2. Lemma 5.2. For any θ K and i [T], we have L(θ i ; D) L(θ; D) λi 2 bθi 1 θ 2 + 16G2 Proof. The objective ℓi(θ; x(i)) is (2G)-Lipschitz and is λistrongly convex. Therefore, by Shalev-Shwartz et al. (2009, Theorem 7), we get that L(θ i ; D) L(θ; D) λi 2 θ bθi 1 2 + 4(2G)2 Proof of Theorem 5.1. We run the Phased-SCO algorithm (Algorithm 4) where we set λ = G d Rn m and β = 1 nm; we On User-Level Private Convex Optimization Algorithm 4 Phased-SCO 1: Input: Dataset x, loss function ℓ: Θ X R that is convex and G-Lipschitz 2: Parameters: Privacy parameters ε, δ; Regularizer coefficient λ; Target failure probability β 3: N0 = C log(1/δ)/ε for some sufficiently large constant C 4: T log(n/N0) Number of iterations 5: β = β/T Per-iteration failure probability 6: bθ0 arbitrary element of K Initial parameter 7: for i = 1, . . . , T do 8: λi = λ 4i 9: Ri = G/λi 10: Let ℓi(θ; x) := ℓ(θ; x) + λi 2 θ bθi 1 2 11: Ki K Bd (θi 1, Ri) 12: x(i) = (xℓ: ℓ [2 in, 2 i+1n]) 13: bθi SCOutput Pertε,δ,β ,2G,λi,Ki(ℓi; x(i)) 14: end for 15: return bθT will use M as a shorthand for this algorithm. The main difference from Algorithm 3 is that we use different batch sizes (and do not reuse sample points across phases). The analysis is similar to the proof of Theorem 4.1 with corresponding changes to Lemma 4.2. The privacy guarantee follows immediately from the fact that each call to SCOutput Pert is (ε, δ)-DP and the parallel composition of DP (Mc Sherry, 2010). Note that we maintain a minimum batch size as required for SCOutput Pert so that we maintain DP. Further, as in the analysis of Theorem 4.1, we can consider another algorithm M which is the same as M except that on line 13 it does not call SCOutput Pert but instead directly lets bθi θ i (x) + N(0, σ2 i I) where σi := σ(ε , δ , β , 2G, λi) is as in Theorem 3.1. Proof of Theorem 5.1 is completed by following the same analysis as in the proof of Theorem 4.1 with Lemma 4.2 replaced with Lemma 5.2. 5.2. Strongly Convex Losses We obtain better excess risk bounds for the case of strongly convex losses, as stated below. The proof is similar to that of DP-ERM for strongly convex loss, i.e., we use output perturbation and then run DP-SCO for convex losses (Theorem 5.1) using a smaller radius. An additional step here is to show that an empirical minimizer is e O G µ nm -close to the population minimizer w.h.p. (which might be of independent interest; see Proposition B.1). The full proof is presented in Appendix B.2. Theorem 5.3. For any G-Lipschitz, µ-strongly convex loss ℓ, there exists an (ε, δ)-DP mechanism that, for all n e O log(1/δ) log(m) ε , outputs bθ such that Eπ,bθ M(xπ)[L(bθ; D)] L(θ ; D) µ d n2m + 1 nm , where e Oε hides a multiplicative factor of poly(log(1/δ), log(nm), 1/ε). 6. Discussion & Open Problems Although we do not discuss the running time of our algorithm, it can be seen that they run in n O(log(1/δ)/ε)(md)O(1) time; the bottleneck comes from the step to compute X R1 stable in Del Output Pert, which requires enumerating all subsets S of size R1 = O log(1/δ) ε . In Appendix C, we describe a speed up for all our DP-SCO/ERM results that makes the algorithm run in polynomial (in n, m, d) time with high probability. However, with the remaining o(1) probability, the algorithm may still take n O(log(1/δ)/ε)(md)O(1) time. It remains open whether we can get an algorithm whose running time is polynomial in the worst case. As discussed in the introduction, it was not known whether excess risk bounds that we achieve were obtainable (even with inefficient algorithms) before. Another question is whether we can get tight dependency on δ, ε. Specifically, our ERM excess risk bound in the convex case has a factor of O (log(1/δ))2 ε2.5 , while previous bounds only had e O . Note that our larger dependency is indeed due to the generic output perturbation algorithm (Del Output Pert), which requires the noise scale σ to be inflated by a factor of κ = O log(1/δ) ε , and the union bound performed for Corollary 3.7 which includes another κ factor. Therefore, this question may be related to the previous question. Acknowledgments Pritish Kamath would like to thank Gene Li, Ohad Shamir, and Nathan Srebro for helpful discussions about stochastic convex optimization. Pasin Manurangsi would also like to thank Adam Sealfon for useful discussions and for pointers to DP graph analysis literature. On User-Level Private Convex Optimization Agarwal, A., Bartlett, P. L., Ravikumar, P., and Wainwright, M. J. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Trans. Inf. Theory, 58(5):3235 3249, 2012. Asi, H., Feldman, V., Koren, T., and Talwar, K. Private stochastic convex optimization: Optimal rates in ℓ1 geometry. In ICML, pp. 393 403, 2021. Bassily, R., Smith, A. D., and Thakurta, A. Private empirical risk minimization: Efficient algorithms and tight error bounds. In FOCS, pp. 464 473, 2014. Bassily, R., Feldman, V., Talwar, K., and Thakurta, A. G. Private stochastic convex optimization with optimal rates. In Neur IPS, pp. 11279 11288, 2019. Chaudhuri, K. and Monteleoni, C. Privacy-preserving logistic regression. In NIPS, pp. 289 296, 2008. Chaudhuri, K., Monteleoni, C., and Sarwate, A. D. Differentially private empirical risk minimization. J. Mach. Learn. Res., 12:1069 1109, 2011. Chen, S. and Zhou, S. Recursive mechanism: towards node differential privacy and unrestricted joins. In SIGMOD, pp. 653 664, 2013. Dwork, C. and Lei, J. Differential privacy and robust statistics. In STOC, pp. 371 380, 2009. Dwork, C. and Roth, A. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211 407, 2014. Dwork, C., Kenthapadi, K., Mc Sherry, F., Mironov, I., and Naor, M. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, pp. 486 503, 2006a. Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. D. Calibrating noise to sensitivity in private data analysis. In TCC, pp. 265 284, 2006b. Feldman, V., Koren, T., and Talwar, K. Private stochastic convex optimization: optimal rates in linear time. In STOC, pp. 439 449, 2020. Ghazi, B., Kumar, R., and Manurangsi, P. User-level differentially private learning via correlated sampling. In Neur IPS, pp. 20172 20184, 2021. Gopi, S., Lee, Y. T., and Liu, D. Private convex optimization via exponential mechanism. In COLT, pp. 1948 1989, 2022. Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. Found. Trends Machine Learning, 14 (1-2), 2021. Kamath, G., Li, J., Singhal, V., and Ullman, J. R. Privately learning high-dimensional distributions. In COLT, pp. 1853 1902, 2019. Kifer, D., Smith, A. D., and Thakurta, A. Private convex optimization for empirical risk minimization with applications to high-dimensional regression. In COLT, pp. 25.1 25.40, 2012. Kohli, N. and Laskowsk, P. Differential privacy for blackbox statistical analyses. In TPDP, 2021. Laurent, B. and Massart, P. Adaptive estimation of a quadratic functional by model selection. Ann. Stat., 28 (5):1302 1338, 2000. Levy, D., Sun, Z., Amin, K., Kale, S., Kulesza, A., Mohri, M., and Suresh, A. T. Learning with user-level privacy. In Neur IPS, pp. 12466 12479, 2021. Liu, Y., Suresh, A. T., Yu, F. X., Kumar, S., and Riley, M. Learning discrete distributions: user vs item-level privacy. In Neur IPS, 2020. Mc Sherry, F. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. Commun. ACM, 53(9):89 97, 2010. Narayanan, S., Mirrokni, V. S., and Esfandiari, H. Tight and robust private mean estimation with few users. In ICML, pp. 16383 16412, 2022. Nissim, K., Raskhodnikova, S., and Smith, A. D. Smooth sensitivity and sampling in private data analysis. In STOC, pp. 75 84, 2007. Raskhodnikova, S. and Smith, A. D. Differentially private analysis of graphs. In Encyclopedia of Algorithms, pp. 543 547. Springer, 2016. Sambale, H. and Sinulis, A. Concentration inequalities on the multislice and for sampling without replacement. Journal of Theoretical Probability, 35:2712 2737, 2022. Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. Stochastic convex optimization. In COLT, 2009. Vadhan, S. P. The complexity of differential privacy. In Tutorials on the Foundations of Cryptography, pp. 347 450. Springer International Publishing, 2017. Wang, D., Ye, M., and Xu, J. Differentially private empirical risk minimization revisited: Faster and more general. In NIPS, pp. 2722 2731, 2017. On User-Level Private Convex Optimization A. Proof of Lemma 3.8 Lemma 3.8 follows quite immediately as an application of a special case of Proposition 1 in Sambale & Sinulis (2022) as stated below. Let SN denote the set of all permutations over [N] and for any π SN, let πi j denote the permutation with ith and jth entries swapped. Proposition A.1 (Proposition 1 in Sambale & Sinulis (2022)). Let f : SN R be a real-valued function over SN, such that |f(π) f(πi j)| ci,j for all π SN and all 1 i < j N for some ci,j 0. For any t 0, it holds that Pr π SN[f(π) E[f(π)] t] exp 1 i m. Moreover, when i m and j > m, it holds that |f(π) f(πi j)| = k=1 vπi j(k) vπ(i) vπ(j) 2G. Thus, using Proposition A.1 with ci,j = 2G when i m < j and 0 otherwise, we have that j [m] vπ(j) t + m G i exp Nt2 Choosing t = 4G p m log(1/β) completes the proof. B. Proofs of Improved Bounds for Strongly Convex Losses B.1. Empirical Risk Minimization Algorithm 5 Strongly-Convex-ERM 1: Input: Dataset x, loss function ℓ: Θ X R that is µ-strongly convex and G-Lipschitz 2: Parameters: Privacy parameters ε, δ; Target Failure Probability β 3: θ0 SCOutput Pertε/2,δ/2,β,G,µ,K(ℓ; x) 4: R σ(ε/2, δ/2, β, G, µ) p d log 1/β 5: K K Bd(θ0, R ) d R n m 7: bθ Phased-ERMε/2,δ/2,β,G,λ,K (ℓ; x) 8: return bθ Proof of Theorem 4.3. The mechanism in Algorithm 5, which uses a two-step approach to get stronger rates for strongly convex losses, following a similar reduction in Bassily et al. (2014). It first uses the SCOutput Pert algorithm with (ε/2, δ/2)-DP, which with probability 1 β returns θ0 := θ (x) + e where e N(0, σ2 I) for σ specified in Theorem 3.1. From standard concentration, we have that Pr[ e Cσ p d log 1/β] β, for a suitable C. Thus, with probability 1 2β, we have that θ (x) is indeed contained in Bd(θ0, R ) for R = Cσ p d log 1/β = e Oε(G d/(µn m)); note that this can be much smaller than the diameter of K which is at most 2G/µ. Finally, we use the Phased-ERM algorithm with On User-Level Private Convex Optimization (ε/2, δ/2)-DP over the region K = K Bd(θ0, R ). Following the proof of Theorem 4.1, setting β = 1/2n2m, we have that E[L(bθ; x)] L(θ ; x) e Oε G2d µn2m . The value of β was chosen such that βRG O(G2/(µn2m)), where R is the diameter of K, which is at most 2G/µ. This is to account for the probability of at most 2β that either SCOutput Pert fails or that e > Cσ p d log 1/β, in which case, the excess risk is at most RG. B.2. Stochastic Convex Optimization We rely on the following proposition, which to the best of our knowledge, is not known in the literature. Proposition B.1. For any G-Lipschitz, µ-strongly convex loss ℓand for any distribution D, it holds for all β < 1/e that θ (x) θ (D) 30G p log(2/β) µ nm Before we prove Proposition B.1, let us see how to use it to prove Theorem 5.3. Algorithm 6 Strongly-Convex-SCO 1: Input: Dataset x, loss function ℓ: Θ X R that is µ-strongly convex and G-Lipschitz 2: Parameters: Privacy parameters ε, δ; Target Failure Probability β 3: θ0 SCOutput Pertε/2,δ/2,β,G,µ,K(ℓ; x) 4: R σ(ε/2, δ/2, β, G, µ) p d log 1/β + G log 1/β µ nm 5: K K Bd(θ0, R ) d R n m 7: bθ Phased-SCOε/2,δ/2,β,G,λ,K (ℓ; x) 8: return bθ Proof of Theorem 5.3. Algorithm 6 is similar to Algorithm 5, namely, it first uses the SCOutput Pert algorithm with (ε/2, δ/2)-DP, which with probability 1 β returns θ0 := θ (x) + e where e N(0, σ2 I). Using Proposition B.1, we have that with probability at least 1 β, it holds that θ (x) θ (D) O(G p log 1/β/(µ nm)). Thus, we have that θ (D) is contained in Bd(θ0, R ) for R = O G d n m + 1 nm with probability at least 1 β. Finally, we use the Phased-SCO algorithm with (ε/2, δ/2)-DP over the region K = K Bd(θ0, R ). We get our desired conclusion by plugging in the bound for R in Theorem 5.1, again setting β = 1/2n2m. We suspect that Proposition B.1 might already be known in literature. Since we are unaware of a reference, we include a proof for completeness, which incidentally uses our new result about deletion stability (Theorem 3.6). Proof of Proposition B.1. First, it is well known from Shalev-Shwartz et al. (2009) that E x Dn m[L(θ (x); D)] L(θ (D); D) 4G2 On the other hand, from strong convexity we have for all x that L(θ (x); D) L(θ (D); D) µ 2 θ (x) θ (D) 2. Combining the above, we have E x Dn m[ θ (x) θ (D) ] 3G µ nm. (10) On User-Level Private Convex Optimization Additionally, from Theorem 3.6 (invoked twice with m nm and n 2, followed by the triangle inequality and a union bound), it follows that Pr x,x Dn m θ (x) θ (x ) 10G p log(2/β) µ nm By an averaging argument, there exists θ0 = θ (x(0)) for some x(0), such that θ (x) θ0 10G p log(2/β) µ nm Thus, combining with Equation (10), we have E x Dn m θ (x) θ (D) (1 β) θ0 θ (D) 10G p log(2/β) µ nm = θ0 θ (D) 20G p log(2/β) µ nm (for β < 1/2). Finally by the triangle inequality, we get θ (x) θ (D) 30G p log(2/β) µ nm C. On Speeding up our Algorithms As stated in Section 6, the time bottleneck of our algorithm is Del Output Pert, which requires computing X R1 stable. Doing this in a straightforward manner requires enumerating all sets S of size R1, resulting in a running time of n R1(md)O(1) = n O(log(1/δ)/ε)(md)O(1). In this section, we sketch an argument that brings the time down to (nmd)O(1) with high probability, while maintaining all the error bounds to within e Oε(1) factor. Note that all algorithms invoke Del Output Pert through Theorem 3.1 (i.e., the SCOutput Pert algorithm). Therefore, it suffices to argue how to achieve the speed up for SCOutput Pert. The first observation here is that if x belongs to X R1 stable, then we can just output θ (x) + N(0, σ2 I). Furthermore, we have already shown (Theorem 3.1) that x X R1 stable with high probability. Thus, if we can give a certificate that x X R1 stable, then we would be able to complete skip the check and just output θ (x) + N(0, σ2 I); this means that, whenever we have such a certificate, our algorithm will run in polynomial (in n, m, d) time. Our certificate is simple: the gradients at θ w.r.t. each user. The following lemma (whose proof is similar to part of the proof of Theorem 3.6) relates this certificate to r (which in turn implies membership in X R1 stable for appropriate ). Lemma C.1. Let x be any dataset and let θ := θ (x). Suppose that for all i [n], we have L(θ ; xi) γ. Then, we have r θ (x) for = O( rγ µn) for all r n/2. Proof. Consider any set S [n] such that |S| r. Let s := |S| and θ S := θ (x S). Since L(θ ; x) = 0, we have L(θ ; x S) = i S L(θ ; xi) i S L(θ ; xi) n/2 = O(rγ/n). On User-Level Private Convex Optimization Therefore, we have L(θ ; x S) L(θ S; x S) L(θ ; x S), θ θ S O(rγ/n) θ θ S . On the other hand, since ℓis µ-strongly convex and since θ S is the minimizer for L( ; x S), we can conclude that L(θ ; x S) L(θ S; x S) µ Comparing the two bounds above, we get θ θ ,π n O rγ Recall also from the proof of Theorem 3.6 that w.h.p. we have L(θ ; xi) e O(G/ m). When this holds, by computing P j [m] ℓ(θ ; xi,j) for all i [n], the above lemma means that this is a certificate that x X R1 stable when we set = O( κγ µn) = e O G log(1/δ) εµn m . Plugging this into Theorem 3.3, we arrive at a statement similar to Theorem 3.1 but with log n log(1/δ)/ε + log(1/β) µn m (log(1/δ))2.5 i.e., with an extra factor of O(log(1/δ)/ε). On the other hand, from the discussion about the certificate, we have that this algorithm runs in polynomial time with high probability (whenever L(θ ; xi) e O( m)). D. On Lower Bounds for User-Level DP-ERM and DP-SCO This section discusses lower bounds for user-level DP-SCO and DP-ERM. We start by noting that Levy et al. (2021) already proved a lower bound of Ω RG 1 nm + d εn m for DP-SCO for the convex case assuming n Ω d/ε . It can be easily seen that this also implies a lower bound of Ω RG d εn m for Ω d/ε n O d/ε2 (see, e.g., the proof of Theorem D.8 below). In the remainder of this section, we extend their techniques to show the lower bounds for strongly convex losses. D.1. Preliminaries Throughout, we will consider the loss ℓζ sq(θ; x) := ζ θ x 2 where ζ > 0 is a parameter. We list here a few results that will be useful throughout. We start by defining the (ℓ2-)truncated version of the Gaussian distribution as follows. Definition D.1. Let N tr(χ, Σ; B) denote the distribution of r.v. Z drawn as follows. First, draw Z N(χ, Σ). Then, let Z = Z 1[ Z B]. We use χtr(χ, Σ; B) to denote the mean of the distribution N tr(χ, Σ; B). As shown in Levy et al. (2021), the means of the truncated Gaussian distribution and the standard (non-truncated) version are very close: Lemma D.2 (Levy et al. 2021). For any χ Rd, d N, σ > 0, if χ 2 + 100 d σ < B, then χtr(χ, σ2Id; B) χ 2 O((σ + χ 2) e 10d). Since the version of the lemma in Levy et al. 2021 is slightly different than the one we use here, we give a proof sketch of this below6. Proof Sketch of Lemma D.2. Due to spherical symmetry, we may assume w.l.o.g. that χ2 = = χd = 0 and χ1 0. Again, due to symmetry, we have χtr(χ, σ2Id; B)2 = = χtr(χ, σ2Id; B)d = 0 and thus χtr(χ, σ2Id; B) χ 2 = |χtr(χ, σ2Id; B)1 χ1|. To bound this term, observe further that we may view Z1 as being generated as follows: 6More precisely, Levy et al. 2021 is using truncation in a coordinate-by-coordinate manner (i.e., by the ℓ -norm), which results in an extra polylogarithmic factor. On User-Level Private Convex Optimization Sample Z 1 N(χ1, σ2). Sample U χ2(d 1) . (This represents ((Z 2)2 + + (Z d)2)/σ2.) Let Z1 = Z 1 1[(Z 1)2 + σ2 U B2] For u > 0, let µu denote the mean of Z1 conditioned on U = u. We have χtr(χ, σ2Id; B) = E U χ2(d 1)[µU]. From symmetry, it is again simple to see that 0 µU χ1. As such, we have |χtr(χ, σ2Id; B)1 χ1| E U χ2(d 1)[|µU χ1|] = E U χ2(d 1)[χ1 µU]. Now, using standard concentration of χ2(d 1) distribution (see e.g., (Laurent & Massart, 2000)), we have Pr U χ2(d 1)[U 70 d] e 10d. From this, we have |χtr(χ, σ2Id; B)1 χ1| E U χ2(d 1)[χ1 µU | U 70 d] + χ1 Pr U χ2(d 1)[U 70 max u [0,70 d] (χ1 µu) + χ1 e 10d. To bound the first term, observe that for a fixed u, we simply have Z1 = Z 11[|Z 1| Bu] where Bu := B2 σ2u χ 2 + 70 d σ. Thus, we have µu = Pr[|Z 1| Bu] E[Z 1 | |Z 1| Bu] (1 e 10d) E[Z 1 | |Z 1| Bu], where the probability bound on Pr[|Z 1| > Bu] is based on standard concentrations of a (single-variate) Gaussian. Finally, E[Z 1 | |Z 1| Bu] is simply the expectation of the truncated single-variate Gaussian distribution, which has a closed-form formula, described below. Here ψ, Φ denote the PDF and CDF of the standard normal distribution respectively, and let α = Bu χ1 σ , β = Bu χ1 σ . Note that we have β 70 E[Z 1 | |Z 1| Bu] = χ1 + σ ψ(α) ψ(β) χ1 σ O(ψ(β)) χ1 σ O(e 10d). Plugging the previous three bounds together, we have |χtr(χ, σ2Id; B)1 χ1| O((σ + χ1) e 10d). More importantly, Levy et al. (2021) make the following crucial observation, which allows us to reduce any user-level DP algorithm for Gaussian distribution back to an item-level DP algorithm, albeit with the variance that is m times smaller. Lemma D.3 (User-to-Item Level Reduction, Levy et al. 2021). Let Auser be any user-level (ε, δ)-DP algorithm. Then, there exists an item-level (ε, δ)-DP algorithm Aitem such that, for any Gaussian distribution D = N(χ, σ2Id), Auser(Dn m) has exactly the same distribution as Aitem( Dn) where D = N χ, σ2 Finally, we will use the following fingerprinting lemma for Gaussians result due to Kamath et al. (2019), which gives a lower bound for any DP algorithm for estimating the mean of a spherical Gaussian. Theorem D.4 (Kamath et al. 2019). For any ψ (0, 1), σ > 0, n, d N and ε (0, 1], δ (0, 1/2] such that δ d), if there exists an item-level (ε, δ)-DP mechanism M such that, for any Gaussian distribution D = N(χ, σ2Id) where χ is unknown with ψσ χ ψσ it satisfies E ˆχ M(Dn)[ ˆχ χ 2] α2 dσ2ψ2 then we must have n dσ 24αε. On User-Level Private Convex Optimization Combining Lemma D.3 and Theorem D.4, we immediately arrive at the following lower bound for the user-level DP setting. Lemma D.5. For any ψ (0, 1), σ > 0, m, n, d N and ε (0, 1], δ (0, 1/2] such that δ there exists a user-level (ε, δ)-DP mechanism M such that, for any Gaussian distribution D = N(χ, σ2Id) where χ is unknown with ψσ m χ ψσ m it satisfies E ˆχ M(Dn m)[ ˆχ χ 2] α2 dσ2ψ2 then we must have n dσ 24αε m. Furthermore, combining the above with Lemma D.2, we arrive at the following lower bound where the only change is from Gaussian distributions to truncated Gaussian distributions. Lemma D.6. For any ψ (Ω(e d), 1), B, σ > 0, m, n, d N and ε (0, 1], δ (0, 1/2] such that δ d) and B > ψσ dσ, if there exists a user-level (ε, δ)-DP mechanism M such that, for any truncated Gaussian distribution D = N tr(χ, σ2Id; B) where χ is unknown with ψσ m χ ψσ m it satisfies E ˆχ M(Dn m)[ ˆχ χtr(χ, σ2Id; B) 2] α2 dσ2ψ2 then we must have n dσ 50αε m. D.2. Lower Bounds for Strongly Convex Losses D.2.1. DP-SCO We can now prove the lower bound for DP-SCO in the strongly convex case in a relatively straightforward manner, as optimizing for the loss ℓsq is equivalent to mean estimation with ℓ2 2-error. Theorem D.7. For any ε (0, 1], δ (0, 1/2] and any sufficiently large d, n, m N such that n n log n, there exists a µ-strongly convex G-Lipschitz loss function ℓsuch that for any (ε, δ)-DP algorithm, we h L(bθ; D) i L(θ ; D) nm + d ε2n2m We note that the condition n d/ε may be unnecessary. However, a slightly weaker condition n m Ω( d/ε) is necessary because outputting, e.g., the origin already achieves an error of G2/µ. Therefore, the second term G2 µ d ε2n2m cannot be present in this case. Proof of Theorem D.7. The first term of Ω G2 µ 1 nm is simply the statistical excess risk bound that holds even without any privacy considerations (Agarwal et al., 2012). We will only focus on the second term here. Consider ℓ= ℓζ sq for ζ = µ/2 and the parameter space K = Bd(0, G/µ). The loss is µ-strongly convex and is G-Lipschitz in K. Set the parameters as follows: B = G µ , σ = B 1000 d, ψ = 1. Let α = dσ 100εn m; note that when n d/ε, we also have α2 dσ2ψ2 12m as desired. Thus, we may apply Lemma D.6 with these parameters. This implies that, for any user-level (ε, δ)-DP mechanism M, there must be some truncated Gaussian distribution D = N(χ, σ2Id; B) such that E ˆχ M(Dn m)[ ˆχ χtr(χ, σ2Id; B) 2] Ω(α2) = Ω G2 Moreover, the excess (population) risk can be expanded as E bθ M(Dn m) h L(bθ; D) i L(θ ; D) = µ 2 E bθ M(Dn m) [ bθ χtr(χ, σ2Id; B) 2] Ω G2 On User-Level Private Convex Optimization D.2.2. DP-ERM The proof for DP-ERM is similar to above, except that we now have to account for the error between the population mean and the empirical mean. We enforce the parameters in such a way that this error is dominated by the lower bound given by Theorem D.7. Theorem D.8. There exists a sufficiently small constant c > 0 such that the following holds. For any ε (0, 1], δ (0, 1/2] and any sufficiently large d, n, m N such that cd/ε2 n n log n, there exists a µ-strongly convex G-Lipschitz loss function ℓsuch that for any (ε, δ)-DP algorithm, we have E x Dn m,bθ M(x) h L(bθ; x) L(θ ; x) i! In addition to the assumption n d/ε as in Theorem D.7, this theorem also requires the assumption n O(d/ε2). This assumption is required for the error between the empirical mean and the population mean to be small enough to be dominated by the error term Ω G2 µ d ε2n2m . Proof of Theorem D.8. Let ℓ, B, σ, ψ, α be exactly as in the setting of Theorem D.7. Similarly, there must exist some truncated Gaussian distribution D = N(χ, σ2Id; B) such that, for any (ε, δ)-DP algorithm M, we have E ˆχ M(Dn m)[ ˆχ χtr(χ, σ2Id; B) 2] Ω(α2) = Ω G2 Let ˆχ(x) denote the empirical mean of the dataset x. The left hand side can be further expanded as E ˆχ M(x),x Dn m[ ˆχ χtr(χ, σ2Id; B) 2] 2 E ˆχ M(x),x Dn m[ ˆχ ˆχ(x) 2] + E x Dn m[ ˆχ(x) χtr(χ, σ2Id; B) 2] = 2 E ˆχ M(x),x Dn m[ ˆχ ˆχ(x) 2] + O B2 = 2 E ˆχ M(x),x Dn m[ ˆχ ˆχ(x) 2] + O G2 Since we assume that n cd/ε2, we have 1 nm c d ε2n2m. Therefore, when c is sufficiently small, we can combine the previous two inequalities to conclude that E ˆχ M(x),x Dn m[ χ ˆχ(x) 2] Ω G2 Finally, the excess (empirical) risk can be expanded as E bθ M(x),x Dn m h L(bθ; x) L(θ ; x) i = µ 2 E bθ M(x),x Dn m[ bθ ˆχ(x) 2] (11) Ω G2