# stochastic_biasreduced_gradient_methods__f605e2b0.pdf Stochastic Bias-Reduced Gradient Methods Hilal Asi Yair Carmon Arun Jambulapati Yujia Jin Aaron Sidford We develop a new primitive for stochastic optimization: a low-bias, low-cost estimator of the minimizer x? of any Lipschitz strongly-convex function. In particular, we use a multilevel Monte-Carlo approach due to Blanchet and Glynn [8] to turn any optimal stochastic gradient method into an estimator of x? with bias δ, variance O(log(1/δ)), and an expected sampling cost of O(log(1/δ)) stochastic gradient evaluations. As an immediate consequence, we obtain cheap and nearly unbiased gradient estimators for the Moreau-Yoshida envelope of any Lipschitz convex function, allowing us to perform dimension-free randomized smoothing. We demonstrate the potential of our estimator through four applications. First, we develop a method for minimizing the maximum of N functions, improving on recent results and matching a lower bound up to logarithmic factors. Second and third, we recover state-of-the-art rates for projection-efficient and gradient-efficient optimization using simple algorithms with a transparent analysis. Finally, we show that an improved version of our estimator would yield a nearly linear-time, optimal-utility, differentially-private non-smooth stochastic optimization method. 1 Introduction Consider the fundamental problem of minimizing a µ-strongly convex function F : X ! R given access to a stochastic (sub-)gradient estimator ˆr F satisfying E ˆr F(x) 2 @F(x) and Ek ˆr F(x)k2 G2 for every x 2 X. Is it possible to transform the unbiased estimator ˆr F into a (nearly) unbiased estimator of the minimizer x? := argminx2X F(x)? In particular, can we improve upon the O(G/(µ T)) bias achieved by T iterations of stochastic gradient descent (SGD)? In this paper, we answer this question in the affirmative, proposing an optimum estimator ˆx?, which (for any fixed δ > 0) has bias k Eˆx? x?k = O(δ) and variance Ekˆx? Eˆx?k2 = O and, in expectation, costs O(log( G µδ)) evaluations of ˆr F.3 Setting δ = G/(µ T), we obtain the same bias bound as T iterations of SGD, but with expected cost of only O(log T) stochastic gradient evaluations (the worst-case cost is T). Further, the bias can be made arbitrarily small with only logarithmic increase in the variance and the stochastic gradient evaluations of our estimator, and therefore paralleling the term nearly linear-time [27] we call ˆx? nearly unbiased. Our estimator is an instance of the multilevel Monte Carlo technique for de-biasing estimator sequences [25] and more specifically the method of Blanchet and Glynn [8]. Our key observation is that this method is readily applicable to strongly-convex variants of SGD, or indeed any stochastic optimization method with the same (optimal) rate of convergence. Stanford University, {asi,jmblpati,yujiajin,sidford}@stanford.edu Tel Aviv University, ycarmon@tauex.tau.ac.il 3When X = BR(x0) Rd, F(x) = 1 i2[n] ˆF(x; i), and ˆr F is the subgradient of a uniformly random ˆF(x; i) we can also get an estimator with bias 0 and expected cost O(log(nd)). See Appendix A.1 for details. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Objective Expensive operation O NO EN ˆrf maxi2[N] f(i)(x) (Sec. 4) f(1)(x), . . . , f(N)(x) e O( 2/3) e O( 2) f(x) in domain X (Sec. 3) Proj X (x) O( 1) " (x) + f(x) for L-smooth (Sec. 5) r (x) O Table 1. Summary of our applications of accelerated bias-reduced stochastic gradient methods. We use NO and N ˆ rf to denote the number of expensive operations and subgradient estimations, respectively. The e O notation hides polylogarithmic factors. See Section 1.2 for additional description. 1.1 Estimating proximal points and Moreau-Yoshida envelope gradients Given a convex function f and regularization level λ, the proximal point of y is Pf,λ(y) := argminx2Rd . Since computing Pf,λ amounts to solving a λ-strongly-convex problem, our technique provides low-bias and cheap proximal point estimators. Proximal points are ubiquitous in optimization [43, 19, 52, 38] and estimating them efficiently with low bias opens up new algorithmic possibilities. One of these possibilities is estimating the gradient of the Moreau-Yoshida envelope fλ(y) := minx2Rd , which is a λ-smooth, G2/(2λ)-accurate approximation of any G-Lipschitz f (see, e.g., [43, 29] and Appendix B.3). Since rfλ(y) = λ(y Pf,λ(y)), our optimum estimator provides a low-bias estimator for rfλ(y) with second moment and expected cost greater than those of ˆrf by only a logarithmic factor. Thus, for any non-smooth f we can turn ˆrf into a gradient estimator for the smooth surrogate fλ, whose smoothness is independent of the problem dimension, allowing us to perform dimension-free randomized smoothing [20]. 1.2 Applications via accelerated bias-reduced methods Our optimum estimator is a new primitive in stochastic convex optimization and we expect it to find multiple applications. We now describe three such applications: the first improves on previously known complexity bounds while the latter two recover existing bounds straightforwardly. For simplicity of presentation we assume (in the introduction only) Ek ˆrfk2 1 and unit domain size. In each application, we wish to minimize an objective function given access to a cheap subgradient estimator ˆrf as well as an expensive application-specific operation O (e.g., a projection to a complicated set). Direct use of the standard stochastic gradient method finds an -accurate solution using O( 2) computations of both ˆrf and O, and our goal is to improve the O complexity without hurting the ˆrf complexity. To that end, we design stochastic accelerated methods consisting of T iterations, each one involving only a constant number of O and proximal point computations, which we approximate by averaging copies of our optimum estimator.4 Its low bias allows us to bound T 2 as though our proximal points were exact, while maintaining an e O( 2) bound on the total expected number of ˆrf calls.5 Thus, we save expensive operations without substantially increasing the gradient estimation cost. Table 1 summarizes each application, and we briefly describe them below. Minimizing the maximal loss (Section 4). Given N convex, 1-Lipschitz functions f(1), . . . , f(N) we would like to find an -approximate minimizer of their maximum fmax(x) = maxi2[N] f(i)(x). This problem naturally arises when optimizing worst-case behavior, as in maximum margin classification and robust optimization [53, 15, 45, 6]. We measure complexity by the number of individual function and subgradient evaluations, so that the expensive operation of evaluating f(1), . . . , f(N) at a single point has complexity O(N) and the subgradient method solves this problem with complexity O(N 2). Carmon et al. [13] develop an algorithm for minimizing fmax with complexity e O(N 2/3 + 8/3), improving on the subgradient method for sufficiently large N. Using our bias-reduced Moreau gradient envelope estimator in a Monteiro-Svaiter-type accelerated proximal point method [12, 11, 38], we obtain improved complexity e O(N 2/3 + 2). This matches (up to logarithmic factors) a lower bound shown in [13], settling the complexity of minimizing the maximum 4While averaging is parallelizable, our optimum estimator itself is sequential. Consequently, our approach does not yield improve parallelism; see Appendix A.2 for further discussion 5It is easy to turn expected complexity bounds into deterministic ones; see Appendix A.3. of N non-smooth functions. Our result reveals a surprising fact: for N (GR/ ) 4/3, minimizing the maximum of N functions is no harder than minimizing their average. Projection-efficient optimization via dimension-free randomized smoothing (Section 3). Consider the problem of minimizing a convex function f using an unbiased gradient estimator ˆrf over convex set X for which Euclidean projections are expensive to compute (for example, the cone of PSD matrices). When f is L-smooth, a stochastic version of Nesterov s accelerated gradient descent (AGD) [16] performs only O( L/ ) projections. For non-smooth f we instead apply AGD to the Moreau envelope smoothing of f (with appropriate λ = O( 1)) using our nearly-unbiased stochastic estimator for rfλ. This yields a solution in O( 1) projections and e O( 2) evaluations of ˆrf. Our algorithm provides a simple alternative to the recent work of Thekumparampil et al. [51] whose performance guarantees are identical up to a logarithmic factor. Gradient-efficient composite optimization (Section 5). We would like to minimize (x) = (x) + f(x), where is convex and L-smooth but we can access it only via computing (ex- pensive) exact gradients, while f is a non-smooth convex functions for which we have a (cheap) unbiased subgradient estimator ˆrf. Problems of this type include inverse problems with sparsity constraints and regularized loss minimization in machine learning [34]. To save r computations, it is possible to use composite AGD [41] which solves O( L/ ) subproblems of the form minimizex hr (y), xi + f(x) + β . Lan [34] designed a specialized method, gradient sliding, for which the total subproblem solution cost is O( 2) evaluations of ˆrf. We show that a simple alternative estimating the subproblem solutions via our low-bias optimum estimator recovers its guarantees up to logarithmic factors. 1.3 Non-smooth differentially private stochastic convex optimization We now discuss a potential application of our technique that is conditional on the existence of an improved optimum estimator. In it, we minimize the population objective function f(x) = ES P ˆf(x; S) under the well-known constraint of differential privacy [22]. Given n i.i.d. samples Si P and assuming that each ˆf is 1-Lipschitz, convex and sufficiently smooth, Feldman et al. [23] develop algorithms that obtain the optimal error and compute O(n) subgradients of ˆf. The nonsmooth case is more challenging and the best existing bound is O(n11/8) for the high-dimensional setting d = n [32, 3]. In Section 6 we show that our optimum estimator, combined with recent localization techniques [23], reduces the problem to private mean estimation. Unfortunately, our estimator is heavy-tailed, leading to insufficient utility. Nevertheless, assuming a version of our estimator that has bounded outputs, we give an algorithm that queries e O(n) subgradients for nonsmooth functions, solving a longstanding open problem in private optimization [14, 4]. This motivates the study of improved versions of our estimators that have constant sensitivity. 1.4 Related work Multilevel Monte-Carlo (MLMC) techniques originate from the literature on parametric integration for solving integral and differential equations [25]. Our approach is based on an MLMC variant put forth by Blanchet and Glynn [8] for estimating functionals of expectations. Among several applications, they propose [8, Section 5.2] an estimator for argminx ES P ˆf(x; S) where ˆf( ; s) is convex for all s and assuming access to minimizers of empirical objectives of the form P i2[N] ˆf(x; si). The authors provide a preliminary analysis of the estimator s variance (later elaborated in [9]) using an asymptotic Taylor expansion around the population minimizer. In comparison, we study the more general setting of stochastic gradient estimators and provide a complete algorithm based on SGD, along with a non-asymptotic analysis and concrete settings where our estimator is beneficial. A number of works have used the Blanchet-Glynn estimator in the context of optimization and machine learning. These applications include estimating the ratio of expectations for semisupervised learning [7], estimating gradients of distributionally robust optimization objectives [35], and estimating gradients in deep latent variable models [47]. Our estimator is similar to that of Levy et al. [35] in that we also have to pick a critical doubling probability for the (random) computational budget, which makes the expected cost and variance of our estimators depend logarithmically on the bias. 1.5 Limitations Our paper demonstrates that our proposed optimum estimator is a useful proof device: it allows us to easily prove upper bounds on the complexity of structured optimization problems, and at least in one case (minimizing the maximum loss) improve over previously known bounds. However, our work does not investigate the practicality of our optimum estimator, as implementation and experiments are outside its scope. Nevertheless, let us briefly discuss the practical prospects of the algorithms we propose. On the one hand, our optimum estimator itself is fairly easy to implement, adding only a few parameters on top of a basic gradient method. On the other hand, in the settings of Sections 3 and 5, gradient-sliding based methods [34, 51] are roughly as simple to implement and enjoy slightly stronger convergence bounds (better by logarithmic factors) than our optimum estimator. Consequently, in these settings we have no reason to assume that our algorithms are better in practice. In the setting of Section 4 (minimizing the maximum loss) our algorithm does enjoy a stronger guarantee than the previous best bound [13]. However, both our algorithm and [13] are based on an accelerated proximal point method that, in its current form, is not practical [13, Sec. 6.2]. Thus, evaluating the benefit of stochastic bias reduction in the context of minimizing the maximum loss would require us to first develop a practical accelerated proximal point algorithm, which is an open question under active research [see, e.g., 50]. Another limitation of our optimum estimator is that, while it has a bounded second moment, its higher moments are unbounded. While this does not matter for most of our results, the lack of higher moment bounds prevents us from setting the complexity of non-smooth private stochastic convex optimization in Section 6. Finding an optimum estimator that is bounded with high probability or proving that one does not exist remains an open question for future work. Finally, our analyses are limited to convex objective functions. However, while outside the scope of the paper, we believe our results are possibly relevant for non-convex settings as well. In particular, for smooth non-convex functions (and weakly-convex functions [17] more broadly) the problem of computing proximal points with sufficiently high regularization is strongly convex and our estimator applies. Such non-convex proximal points play an important role in non-convex optimization [17] with applications in deep learning [see, e.g., 49]. Applying the optimum-estimator technique in non-convex optimization is therefore a viable direction for future work. 1.6 Notation We let BR(x) = {y 2 Rd : ky xk R} denote the ball of radius R around x, where k k is the Euclidean norm throughout. We write Proj S for the Euclidean projection to S. We write {A} for the indicator of event A, i.e., {A} = 1 when A holds and 0 otherwise. Throughout the paper, ˆrf denotes a (stochastic) subgradient estimator for the function f, and X Rd denotes the optimization domain, which we always assume is closed and convex. We use Pf,λ to denote the proximal operator (2) and fλ to denote the Moreau envelope (3) associated with function f and regularization parameter λ. Finally, we use Nf and N ˆrf to denote function and subgradient estimator evaluation complexity, respectively. 2 A multilevel Monte-Carlo optimum estimator In this section, we construct a low-bias estimator for the minimizer of any strongly convex function F : X ! R. This estimator is the key component of our algorithms in the subsequent sections, which use it to approximate proximal points and Moreau envelope gradients. We assume that F is of the form F = f + , where the function is simple and that f satisfies the following. Assumption 1. The function f : X ! R is convex (with closed and convex domain X) and is accessible via an unbiased subgradient estimator ˆrf which satisfies Ek ˆrf(x)k2 G2 for all x. Our applications only use of the form (x) = λ 2 kx x0k2 but our estimator applies more broadly to cases where argminx hv, xi + (x) + 1 2 kx yk2 is easy to compute for all v and y. 2.1 ODC algorithms Our estimator can use, in a black-box fashion, any method for minimizing F with sufficiently fast convergence to x? = argminx2X F(x). We highlight the required convergence property as follows. Algorithm 1: OPTEST( ˆrf, , µ, δ, σ2, X) . δ, σ2 are required bias and square error . c is the ODC algorithm constant µ2 min{δ2, 1 32c G2 log(Tmax) for i = 1, . . . , N do ? = a draw of the estimator (1) i2[N] ˆx(i) Algorithm 2: MORGRADEST( ˆrf, y, λ, δ, σ2, X) . δ, σ2 are required bias and square error . λ is the regularization level . y is the point at which to estimate rfλ(y) λ(x) = λ ˆx? = OPTEST ˆrf, λ, λ, δ return λ(y ˆx?) Definition 1. An optimal-distance-convergence algorithm ODC takes as input ˆrf satisfying Assumption 1, a simple function and a budget T 1. If F = f + is µ-strongly convex with minimizer x?, the algorithm s output x = ODC( ˆrf, , T) requires at most T evaluations of ˆrf to compute and satisfies Ekx x?k2 c G2 µ2T for some constant c > 0. Standard lower bound constructions imply that the O( G2 µ2T ) squared distance convergence rate is indeed optimal; see Appendix A.4 for additional discussion. Conversely, ODC algorithms are readily available in the literature [44, 28] since any point x satisfying EF(x) F(x?) = O( G2 µT ) (the optimal rate of convergence in strongly convex, Lipschitz optimization) also satisfies Ekx x?k2 O( G2 µ2T ) by due to the strong convexity of F. We provide a concrete ODC algorithm consisting of a generalization of epoch SGD [28], which allows us to optimize over the composite objective F = f + instead of only f as in the prior study of epoch SGD. Lemma 1. EPOCHSGD (Algorithm 8 in Appendix B.1) is an ODC algorithm with constant c = 32. 2.2 Constructing an optimum estimator To turn any ODC algorithm into a low-bias, low-cost and near-constant variance optimum estimator, we use the multilevel Monte Carlo (MLMC) technique of Blanchet and Glynn [8]. Given a problem instance ˆrf, , an algorithm ODC and a cutoff parameter Tmax 2 N, our estimator ˆx? is: Draw J Geom 2 N and, writing xj := ODC( ˆrf, , 2j), compute 2J(x J x J 1) 2J Tmax 0 otherwise. (1) We note that for certain ODC algorithms it is possible to extract x0, x J 1 from the intermediate steps of computing x J, so that we only need to invoke ODC once. This is particularly simple to do for EPOCHSGD, as we explain in Appendix B.1. The key properties of our estimator are as follows. Proposition 1. Let f and ˆrf satisfy Assumption 1, F = f + be µ-strongly convex with minimizer x? and Tmax 2 N. For any ODC algorithm with constant c, the estimator (1) has bias k Eˆx? x?k p 2c G µp Tmax and variance Ekˆx? Eˆx?k2 16c G2 µ2 log2(Tmax). Moreover, the expected number of ˆrf evaluations required to compute ˆx? is O(log Tmax). Proof. Let jmax = max{j 2 N | 2j Tmax} = blog2 Tmax)c. The expectation of ˆx? is Eˆx? = Ex0 + P(J = j)2j(Exj Exj 1) = Exjmax, where the second equality follows from P(J = j) = 2 j and the sum telescoping. Noting that xjmax = ODC( ˆrf, , T) for T = 2jmax Tmax/2, we have that k Exjmax x?k Ekxjmax x?k2 pc G by Definition 1. To bound the variance we use ka + bk2 2kak2 + 2kbk2 and note that Ekˆx? Eˆx?k2 Ekˆx? x?k2 2Ekˆx? x0k2 + 2Ekx0 x?k2. The ODC property implies that Ekx0 x?k2 c G2/µ2. For the term Ekˆx? x0k2 we have Ekˆx? x0k2 = P(J = j)22j Ekxj xj 1k2 = 2j Ekxj xj 1k2, and Ekxj xj 1k2 2Ekxj x?k2 + 2Ekxj 1 x?k2 6c G2 Substituting, we get Ekˆx? x0k2 6c G2 µ2 jmax and Ekˆx? Eˆx?k2 16c G2 µ2 log2(Tmax). Finally, the expected number of ˆrf evaluations is 1 + Pjmax j=1 P(J = j)(2j + 2j 1) = O(jmax). The function OPTEST in Algorithm 1 computes an estimate of x? with and desired bias δ and square error σ2 by averaging independent draws of the MLMC estimator (1). The following guarantees are immediate from Proposition 1; see Appendix B.2 for a short proof. Theorem 1. Let f and ˆrf satisfy Assumption 1, F = f + be µ-strongly convex with minimizer x? 2 X, and δ, σ > 0. The function OPTEST( ˆrf, , µ, δ, σ2, X) outputs ˆx? satisfying k Eˆx? x?k δ and Ekˆx? x?k2 σ2 using N ˆrf stochastic gradient computations, where G µ min{δ, σ} G µ min{δ, σ} 2.3 Estimating proximal points and Moreau envelope gradients The proximal point of function f : X ! R with regularization level λ at point y is Pf,λ(y) := argmin When f satisfies Assumption 1, we may use OPTEST (with (x) = λ 2 kx yk2 and µ = λ) to obtain a reduced-bias proximal point estimator. The proximal point Pf,λ(y) is closely related to the Moreau envelope fλ(y) := min via the relationship rfλ(y) = λ(y Pf,λ(y)) (see Appendix B.3). Therefore, we can use our optimum estimator to turn e O(1) calls to ˆrf into a nearly unbiased estimator for rfλ. We formulate this as: Corollary 2. Let f and ˆrf satisfy Assumption 1, let y 2 X and let λ, σ, δ > 0. The function MORGRADEST( ˆrf, λ, y, δ, σ2, X) outputs ˆrfλ(y) satisfying k E ˆrfλ(y) rfλ(y)k δ and Ek ˆrfλ(y) rfλ(y)k2 σ2 and has complexity EN ˆrf = 3 Projection-efficient convex optimization In this section, we combine the bias-reduced Moreau envelope gradient estimator with a standard accelerated gradient method to recover the result of Thekumparampil et al. [51]. We consider the problem of minimizing a function f satisfying Assumption 1 over the domain BR(0) subject to the constraint x 2 X, where X BR(0) is a complicated convex set that we can only access via (expensive) projections of the form Proj X (x) := argminy2X ky xk. We further assume that an initial point x0 2 X satisfies kx0 x?k D. Algorithm 3 applies a variant of Nesterov s accelerated gradient descent method (related to [2, 1]) on the (λ-smooth) Moreau envelope fλ defined in eq. (3). Since computing the Moreau envelope Algorithm 3: Stochastic accelerated gradient descent on the Moreau envelope Input: A gradient estimator ˆrf satisfying Assumption 1 in BR(0), projection oracle Proj X , and initial point x0 = v0 with kx0 x?k D. Parameters :Iteration budget T , Moreau regularization λ, approximation parameters δk, σ2 k 1 for k = 1, , T do 2 yk 1 = k 1 k+1xk + 2 k+1vk 1 3 gk = MORGRADEST( ˆrf, yk 1, λ, δk, σ2 4 xk = Proj X yk 1 1 3λgk 5 vk = Proj BR(0) 6 return x T does not involve projection to X, for sufficiently accurate approximation of rfλ we require only T = O( λD2/ ) projections to X for finding an O( )-suboptimal point of fλ constrained to X. For that point to be also -suboptimal for f itself, we must choose λ of the order of G2/ , so that the number of projections is O(GD/ ). As noted in [51] computing rfλ to accuracy O( /R) is sufficient for the above guarantee to hold, but doing so using a stochastic gradient method requires O((GD/ )2) evaluations of ˆrf per iteration, and O((GD/ )3) evaluations in total. To improve this, we employ Algorithm 2 to compute nearlyunbiased estimates for rfλ and bound the error incurred by their variance. Our result matches the gradient sliding-based technique of Thekumparampil et al. [51] up to polylogarithmic factors while retaining the conceptual simplicity of directly applying AGD on the Moreau envelope. We formally state the guarantees of our method below, and provide a self-contained proof in Appendix C. Theorem 3. Let f : BR(0) ! R and ˆrf satisfy Assumption 1. Let X BR(0) be a convex set admitting a projection oracle Proj X . Let x0 2 X be an initial point with kx x?k D for some x? 2 X. With λ = 2G2 , δk = 8R, σ2 k = 2 λ k+1, and T = 7GD Algorithm 3 computes x 2 X with E [f(x)] f(x?)+ with complexity EN ˆrf = O 2 log2 $ GR calls to Proj X . 4 Accelerated proximal methods and minimizing the maximal loss In this section we apply our estimator in an accelerated proximal point method and use it to obtain an optimal rate for minimizing the maximum of N convex functions (up to logarithmic factors). 4.1 Accelerated proximal point method via Moreau gradient estimation Algorithm 4 is an Monteiro-Svaiter-type [38, 12] accelerated proximal point method [36, 24] that leverages our reduced-bias Moreau envelope gradient estimator. To explain the method, we contrast it with stochastic AGD on the Moreau envelope (Algorithm 3). First and foremost, Algorithm 3 provides a suboptimality bound on the Moreau envelope fλ (which for small λ is far from f) while Algorithm 4 minimizes f itself. Second, while Algorithm 3 uses a fixed regularization parameter λ, Algorithm 4 handles an arbitrary sequence {λk} given by a black-box function NEXTLAMBDA. To facilitate our application of the method to minimizing the maximal loss where gradient estimation is only tractable in small Euclidean balls around a reference point we include an optional parameter r such that the proximal point movement bound k Pf,λk+1(yk) ykk r holds for all k. However, most of our analysis of Algorithm 4 does not require this parameter (i.e., holding for r = 1), making it potentially applicable to other settings that use accelerated proximal point methods [11, 38, 50]. The third and final notable difference between Algorithms 3 and 4 is the method of updating the xk iteration sequence. While a projected stochastic gradient descent step suffices for Algorithm 3, here we require a more direct approximation of function value decrease attained by the exact proximal mapping Pf,λ (see eq. (2)). For a given accuracy ', we define the '-approximate proximal mapping f,λ(y) := any x 2 X such that EF(x) F(Pf,λ(y)) + ' for F(z) := f(z) + λ 2 kz yk2. (4) Algorithm 4: Stochastic accelerated proximal point method Input: Gradient estimator ˆrf, function NEXTLAMBDA, initialization x0 = v0 and A0 0. Parameters :Approximation parameters {'k, δk, σk}, stopping parameters Amax and Kmax, optional movement bound r > 0. 1 for k = 0, 1, . . . do 2 λk+1 = NEXTLAMBDA(xk, vk, Ak) . guaranteeing that k Pf,λk+1(yk) ykk r 3 ak+1 = 1 2λk+1 1 + 4λk+1Ak and Ak+1 = Ak + ak+1 and Xk = X \ Br(yk) 4 yk = Ak Ak+1 xk + ak+1 Ak+1 vk and xk+1 = e P'k+1 f,λk+1(yk) . defined in eq. (4) 5 gk+1 = MORGRADEST( ˆrf, yk, λk+1, δk, σ2 k, Xk) and vk+1 = Proj X 6 if Ak+1 Amax or k + 1 = Kmax then return xk+1 Note that e P0 f,λ = Pf,λ and that for ' > 0 we can compute e P' f,λ with an appropriate SGD variant (such as EPOCHSGD) using O(G2/(λ')) evaluations of ˆrf. With the differences between the algorithms explained, we emphasize their key similarity: both algorithms update the vk sequence using our bias reduction method MORGRADEST (Algorithm 2), which holds the key to their efficiency. The following proposition shows that Algorithm 4 has the same bound on Kmax as an exact accelerated proximal point method [12], while requiring at most e O(G2R2 2) stochastic gradient evaluations; see proof in Appendix D.1. Proposition 2. Let f : X ! R and ˆrf satisfy Assumption 1, and let X BR(x0). For a target accuracy GR let 'k+1 = 60λk+1ak+1 , δk+1 = 120R, σ2 k+1 = 60ak+1 , A0 = R . If λk λmin 1 Amax = ( R2 ) for all k Kmax, then lines 4 and 5 of Algorithm 4 have total complexity EN ˆrf = O Kmax log GR . If in addition k Pf,λk(yk 1) yk 1k 3r/4 whenever λk 2λmin then for Kmax = O algorithm s output x K satisfies f(x K) f(x?) with probability at least 2 4.2 Minimizing the maximal loss We now consider objectives of the form fmax(x) := maxi2[N] f(i)(x) where each function f(i) : X ! R is convex and G-Lipschitz. Our approach to minimizing fmax largely follows Carmon et al. [13]; the main difference is that we approximate proximal steps via Algorithm 4 and our reduced-bias bias estimator. The first step of the approach is to replace fmax with the softmax function, defined for a given target accuracy as fsmax(x) := 0 log f(i)(x)/ 0& , where 0 := 2 log N . Since fsmax(x) fmax(x) 2 [0, 2-accurate solution of fsmax is -accurate for fmax. The second step is to develop an efficient gradient estimator for fsmax; this is non-trivial because fsmax is not a finite sum or expectation. In [13] this is addressed via an exponentiated softmax trick; we develop an alternative, rejection sampling-based approach that fits Algorithm 4 more directly (see Algorithm 9). To produce an unbiased estimate for rfsmax(x) for x in a ball of radius r = 0/G we require a single rf(i)(x) evaluation (for some i), O(1) evaluations of f(i)(x) in expectation, and evaluation of the N functions f(1)(y), . . . , f(N)(y) for pre-processing. Plugging this estimator into Algorithm 4 with r = r , the total pre-processing overhead of lines 4 and 5 is O(Kmax N). The final step is to find a function NEXTLAMBDA such that k Pfsmax,λt+1(yk) ykk r for all k (enabling gradient estimation), and k Pfsmax,λt+1(yk) ykk 3 4r when λk+1 > 2λmin (allowing us to bound Kmax with Proposition 2). Here we use the bisection subroutine from [13] as is (see Algorithm 10). By judiciously choosing λmin an improvement over the analysis in [13] we obtain the following complexity guarantee on Nf(i) and N@f(i), the total numbers of individual function and subgradient evaluations, respectively. (See proof Appendix D.2). Theorem 4. Let f(1), . . . , f(N) : X ! R be convex and G-Lipschitz and let X BR(x0). For any < 1 2GR/ log N, Algorithm 4 (with e P' fsmax,λ implemented in Algorithm 8, ˆrfsmax given by Algorithm 9 and NEXTLAMBDA given by Algorithm 10 with λmin = e O( /(r4/3 R2/3)) outputs x 2 X that with probability at least 1 2 is -suboptimal for fmax(x) = maxi2[N] f(i)(x) and has complexity and EN@f(i) = O The rate given by Theorem 4 matches (up to logarithmic factors) the lower bound (N(GR/ )2 + (GR/ )2) shown in [13] and is therefore near-optimal. 5 Gradient-efficient composite optimization Consider the problem of finding a minimizer of the following convex composite optimization problem x2X (x) := (x) + f(x) where is L-smooth and f satisfies Assumption 1, (5) given x0 such that kx0 x?k R for some x? 2 argminx2X (x). Lan [34] developed a method called gradient sliding that finds an -accurate solution to (5) with complexity Nr = O( LR2/ ) evaluations of r (x) and N ˆrf = O((GR/ )2) evaluations of ˆrf(x), which are optimal even for each component separately.6 In this section, we provide an alternative algorithm that matches the complexity of gradient up to logarithmic factors and is conceptually simple. Our approach, Algorithm 5, is essentially composite AGD [41], where at the kth iteration we compute a proximal point (2) with respect to a partial linearization of around yk. In particular, letting k(v) := (yk)+hr (yk), v yki and βk = 2L k , we approximate P +f,βk(vk 1). Similar to Algorithm 4, Algorithm 5 computes two types of approximations: one is an k-approximate proximal point e P k +f,βk(vk 1) as per its definition (4), while the other is our bias-reduced optimum estimator from Algorithm 1. We note, however, that unlike Algorithm 4 which approximates the xk update, here we approximate vk, the mirror descent update. Below we state the formal guarantees for Algorithm 5; we defer its proof to Appendix E. Algorithm 5: Stochastic composite accelerated gradient descent Input: A problem of the form (5) with , f, r , ˆrf. Parameters :Step size parameters βk = 2L k and γk = 2 k+1, iteration number N, approximation parameters { k, δk, σ2 k} and x0 = v0 satisfying kx0 x?k R. 1 for k = 1, 2, , N do 2 yk = (1 γk)xk 1 + γk Proj X (vk 1) 3 vk = e P k k+f,βk(vk 1) for k(v) := (yk) + hr (yk), v yki 4 vk = OPTEST( ˆrf, k, βk, δk, σ2 k, BR(v0) \ X) for k(z) = βk 2 kz vk 1k2 + k(z) 5 xk = (1 γk)xk 1 + γk vk 6 return x N Theorem 5. Given problem (5) with solution x?, a point x0 such that kx0 x?k R and target accuracy > 0, Algorithm 5 with k = LR/2k N, δk = R/16N, σ2 k = R2/4N, and N = ( LR2/ ) finds an approximate solution x satisfying E (x) (x?) + and has complexity and EN ˆrf = O 6 Efficient non-smooth private convex optimization We conclude the paper with a potential application of our optimum estimator for differentially private stochastic convex optimization (DP-SCO). In this problem we are given n i.i.d. sample 6The gradient sliding result holds under a relaxed Lipschitz assumption [see 34, eq. (1.2)]. It is straightforward to extend EPOCHSGD, and hence all of our results, to that assumption as well. si P taking values in a set S, and out objective is to privately minimize the population average f(x) = ES P [ ˆf(x; S)], where ˆf : X S ! R, is convex in the first argument and X Rd is a convex set, and S is a population of data points. That is, We wish to find ˆx 2 X with small excess loss f(ˆx) minx2X f(x) while preserving differential privacy. Definition 2 ([22]). A randomized algorithm A is ( , β)-differentially private (( , β)-DP) if, for all datasets S, S0 2 Sn that differ in a single data element and for every event O in the output space of A, we have P[A(S) 2 O] e P[A(S0) 2 O] + β. DP-SCO has received increased attention over the past few years. Bassily et al. [5] developed (inefficient) algorithms that attain the optimal excess loss 1/pn + d log(1/β)/n . When each function is O(pn) smooth, Feldman et al. [23] gave algorithms with optimal excess loss and O(n) gradient query complexity. In the non-smooth setting, however, their algorithms require O(n2) subgradients. Subsequently, Asi et al. [3] and Kulkarni et al. [32] developed more efficient algorithms for non-smooth functions which need O(min(n2/ d, n5/4d1/8, n3/2/d1/8)) subgradients which is O(n11/8) for the high-dimensional setting d = n. Whether a linear gradient complexity is achievable for DP-SCO in the non-smooth setting is still open. In this section, we develop an efficient algorithm for non-smooth DP-SCO that queries e O(n) subgradients conditional on the existence of an optimum estimator with the following properties. Definition 3. Let F = f + be µ-strongly convex with minimizer x? and f is G-Lipschitz. For δ > 0, we say that Oδ is efficient bounded low-bias estimator if it returns ˆx? = Oδ(F) such that k E[ˆx? x?]k2 δ2, kˆx? x?k2 C1G2 log(G/µδ)/µ2, and the expected number of gradient queries is C2 log(G/µδ). Comparing to our MLMC estimator (1) and Proposition 1, we note that the only place our current estimator falls short of satisfying Definition 3 is the probability 1 bound on kˆx? x?k2, which for (1) holds only in expectation. Indeed, for our estimator, kˆx? x?k can be as large as O(G/(µδ)), meaning that it is heavy-tailed. It is not clear whether an EBBOE as defined above exists. Nevertheless, assuming access to such estimator, Algorithm 6 solves the DP-SCO problem with a near-linear amount of gradient computations. The algorithm builds on the recent localization-based optimization methods in [23] which iteratively solve regularized minimization problems. Algorithm 6: Differentially-private stochastic convex optimization via optimum estimation Input: (s1, . . . , sn) 2 Sn, domain X BR(x0), EBBOE O (satisfying Definition 3). 1 Set k = dlog ne, B = 20(log( 1 β ) + C2 log2 n), n = n 1 pn, B log(n) 2 for i = 1, 2, , k do 3 Let i = 2 4i , fi(x) = 1 j=1+(k 1) n ˆf(x; sj), i(x) = kx xi 1k2/( i n) 4 Let xi = 1 j=1 Oδi(Fi) with Fi = fi + i , δ2 5 Set xi = xi + i where i N(0, σ2 i Id) with σi = 8B(p C1 log n + 2) i log(2/β)/ i 6 return xk We average multiple draws of the (hypothetical) bounded optimum estimator to solve the regularized problems, and apply private mean estimation procedures to preserve privacy. We defer the proof of the following results Appendix F. Theorem 6 (conditional). Given an efficient bounded low-bias estimator Oδ satisfying Definition 3 for any δ > 0, then for log(1/β), X 2 BR(x0), convex and G-Lipschitz ˆf(x; s), Algorithm 6 is ( , β)-DP, queries e O(n) subgradients and has (hiding logarithmic factors in n) E[f(xk) minx2X f(x)] GR e O d log3(1/β) Theorem 6 provides a strong motivation for constructing bounded optimum estimators that satisfy Definition 3 . In Appendix F.3, we discuss the challenges in making our MLMC estimator bounded, as well as some directions to overcome them. Acknowledgments HA was supported by ONR YIP N00014-19-2288 and the Stanford DAWN Consortium. YC was supported in part by Len Blavatnik and the Blavatnik Family foundation, and the Yandex Machine Learning Initiative for Machine Learning. YJ was supported by a Stanford Graduate Fellowship. AS was supported in part by a Microsoft Research Faculty Fellowship, NSF CAREER Award CCF-1844855, NSF Grant CCF-1955039, a Pay Pal research award, and a Sloan Research Fellowship. [1] Z. Allen-Zhu. Katyusha: the first direct acceleration of stochastic gradient methods. In Proceedings of the Forty-Ninth Annual ACM Symposium on the Theory of Computing, 2017. [2] Z. Allen-Zhu and L. Orecchia. Linear coupling: An ultimate unification of gradient and mirror descent. In Proceedings of the 8th Innovations in Theoretical Computer Science, ITCS 17, 2017. [3] H. Asi, V. Feldman, T. Koren, and K. Talwar. Private stochastic convex optimization: Optimal rates in 1 geometry. In Proceedings of the 38th International Conference on Machine Learning, 2021. [4] R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 55th Annual Symposium on Foundations of Computer Science, pages 464 473, 2014. [5] R. Bassily, V. Feldman, K. Talwar, and A. Thakurta. Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems, volume 32, pages 11282 11291, 2019. [6] A. Ben-Tal, L. E. Ghaoui, and A. Nemirovski. Robust Optimization. Princeton University Press, [7] J. Blanchet and Y. Kang. Semi-supervised learning based on distributionally robust optimization. In Stochastic Modeling Techniques and Data Analysis International Conference, 2018. [8] J. H. Blanchet and P. W. Glynn. Unbiased Monte Carlo for optimization and functions of expectations via multi-level randomization. In 2015 Winter Simulation Conference (WSC), pages 3656 3667, 2015. [9] J. H. Blanchet, P. W. Glynn, and Y. Pei. Unbiased multilevel Monte Carlo: Stochastic optimiza- tion, steady-state simulation, quantiles, and other applications. ar Xiv:1904.09929 [math.ST], 2019. [10] S. Bubeck. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn., 8 (3-4):231 357, 2015. [11] S. Bubeck, Q. Jiang, Y. T. Lee, Y. Li, and A. Sidford. Complexity of highly parallel non-smooth convex optimization. ar Xiv:1906.10655 [math.OC], 2019. [12] Y. Carmon, A. Jambulapati, Q. Jiang, Y. Jin, Y. T. Lee, A. Sidford, and K. Tian. Acceleration with a ball optimization oracle. In Advances in Neural Information Processing Systems, 2020. [13] Y. Carmon, A. Jambulapati, Y. Jin, and A. Sidford. Thinking inside the ball: Near-optimal minimization of the maximal loss. In Conference on Learning Theory, 2021. [14] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk mini- mization. Journal of Machine Learning Research, 12:1069 1109, 2011. [15] K. L. Clarkson, E. Hazan, and D. P. Woodruff. Sublinear optimization for machine learning. Journal of the ACM (JACM), 59(5):1 49, 2012. [16] M. Cohen, J. Diakonikolas, and L. Orecchia. On acceleration with noise-corrupted gradients. In International Conference on Machine Learning, pages 1019 1028. PMLR, 2018. [17] D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207 239, 2019. [18] J. Diakonikolas and C. Guzmán. Lower bounds for parallel and randomized convex optimization. In Proceedings of the Thirty Second Annual Conference on Computational Learning Theory, pages 1132 1157, 2019. [19] D. Drusvyatskiy. The proximal point method revisited. SIAG/OPT Views and News, 26(1), [20] J. C. Duchi, P. L. Bartlett, and M. J. Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674 701, 2012. [21] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 & 4):211 407, 2014. [22] C. Dwork, F. Mc Sherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Theory of Cryptography Conference, pages 265 284, 2006. [23] V. Feldman, T. Koren, and K. Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM on the Theory of Computing, pages 439 449, 2020. [24] R. Frostig, R. Ge, S. Kakade, and A. Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In International Conference on Machine Learning, 2015. [25] M. B. Giles. Multilevel Monte Carlo methods. Acta Numerica, 24:259 328, 2015. [26] G. Grimmett and D. Stirzaker. Probability and random processes. Oxford university press, [27] Y. Gurevich and S. Shelah. Nearly linear time. In International Symposium on Logical Foundations of Computer Science, pages 108 118, 1989. [28] E. Hazan and S. Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. Journal of Machine Learning Research, 15(1):2489 2512, 2014. [29] J.-B. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms II: Advanced Theory and Bundle Methods. Springer, 1993. [30] H. Jiang, Y. T. Lee, Z. Song, and S. C. Wong. An improved cutting plane method for convex optimization, convex-concave games, and its applications. In Proceedings of the Fifty-Second Annual ACM Symposium on the Theory of Computing, pages 944 953. ACM, 2020. [31] L. G. Khachiyan. A polynomial algorithm in linear programming. Doklady Akademii Nauk, 244(5):1093 1096, 1979. [32] J. Kulkarni, Y. T. Lee, and D. Liu. Private non-smooth empirical risk minimization and stochastic convex optimization in subquadratic steps. ar Xiv:2103.15352 [cs.LG], 2021. [33] G. Lan. Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization. Mathematical Programming, 149(1):1 45, 2015. [34] G. Lan. Gradient sliding for composite optimization. Mathematical Programming, 159(1): 201 235, 2016. [35] D. Levy, Y. Carmon, J. C. Duchi, and A. Sidford. Large-scale methods for distributionally robust optimization. Advances in Neural Information Processing Systems, 2020. [36] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, 2015. [37] M. Mitzenmacher and E. Upfal. Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge University Press, 2005. [38] R. D. C. Monteiro and B. F. Svaiter. An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM J. Optim., 23(2): 1092 1125, 2013. [39] A. Nemirovski. On parallel complexity of nonsmooth convex optimization. Journal of Com- plexity, 10(4):451 463, 1994. [40] A. Nemirovski and D. Yudin. Problem complexity and method efficiency in optimization. Wiley-Interscience, 1983. [41] Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Program- ming, 140(1):125 161, 2013. [42] Y. Nesterov. Lectures on convex optimization, volume 137. Springer, 2018. [43] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3): 123 231, 2013. [44] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In International Coference on International Conference on Machine Learning, pages 1571 1578, 2012. [45] S. Shalev-Shwartz and Y. Wexler. Minimizing the maximal loss: How and why? In Proceedings of the 33rd International Conference on Machine Learning, 2016. [46] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. In Proceedings of the Twenty Second Annual Conference on Computational Learning Theory, 2009. [47] Y. Shi and R. Cornish. On multilevel Monte Carlo unbiased gradient estimation for deep latent variable models. In International Conference on Artificial Intelligence and Statistics, pages 3925 3933. PMLR, 2021. [48] N. Z. Shor. Cut-off method with space extension in convex programming problems. Cybernetics, 13(1):94 96, 1977. [49] A. Sinha, H. Namkoong, and J. Duchi. Certifying some distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018. [50] C. Song, Y. Jiang, and Y. Ma. Unified acceleration of high-order algorithms under Hölder continuity and uniform convexity. SIAM Journal on Optimization, 2021. [51] K. K. Thekumparampil, P. Jain, P. Netrapalli, and S. Oh. Projection efficient subgradient method and optimal nonsmooth frank-wolfe method. In Advances in Neural Information Processing Systems, 2020. [52] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. 2008. [53] V. N. Vapnik. An overview of statistical learning theory. IEEE transactions on neural networks, 10(5):988 999, 1999. [54] B. Woodworth and N. Srebro. Tight complexity bounds for optimizing composite objectives. In Advances in Neural Information Processing Systems 29, 2016. [55] D. B. Yudin and A. S. Nemirovski. Evaluation of the information complexity of mathematical programming problems. Ekonomika i Matematicheskie Metody, 12:128 142, 1976.