# local_minimax_complexity_of_stochastic_convex_optimization__8e40f7e4.pdf Local Minimax Complexity of Stochastic Convex Optimization Yuancheng Zhu Wharton Statistics Department University of Pennsylvania Sabyasachi Chatterjee Department of Statistics University of Chicago John Duchi Department of Statistics Department of Electrical Engineering Stanford University John Lafferty Department of Statistics Department of Computer Science University of Chicago We extend the traditional worst-case, minimax analysis of stochastic convex optimization by introducing a localized form of minimax complexity for individual functions. Our main result gives function-specific lower and upper bounds on the number of stochastic subgradient evaluations needed to optimize either the function or its hardest local alternative to a given numerical precision. The bounds are expressed in terms of a localized and computational analogue of the modulus of continuity that is central to statistical minimax analysis. We show how the computational modulus of continuity can be explicitly calculated in concrete cases, and relates to the curvature of the function at the optimum. We also prove a superefficiency result that demonstrates it is a meaningful benchmark, acting as a computational analogue of the Fisher information in statistical estimation. The nature and practical implications of the results are demonstrated in simulations. 1 Introduction The traditional analysis of algorithms is based on a worst-case, minimax formulation. One studies the running time, measured in terms of the smallest number of arithmetic operations required by any algorithm to solve any instance in the family of problems under consideration. Classical worst-case complexity theory focuses on discrete problems. In the setting of convex optimization, where the problem instances require numerical rather than combinatorial optimization, Nemirovsky and Yudin [12] developed an approach to minimax analysis based on a first order oracle model of computation. In this model, an algorithm to minimize a convex function can make queries to a first-order oracle, and the complexity is defined as the smallest error achievable using some specified minimum number of queries needed. Specifically, the oracle is queried with an input point x 2 C from a convex domain C, and returns an unbiased estimate of a subgradient vector to the function f at x. After T calls to the oracle, an algorithm A returns a value bx A 2 C, which is a random variable due to the stochastic nature of the oracle, and possibly also due to randomness in the algorithm. The Nemirovski-Yudin analysis reveals that, in the worst case, the number of calls to the oracle required to drive the expected error E(f(bx A) infx2C f(x)) below scales as T = O(1/ ) for the class of strongly convex functions, and as T = O(1/ 2) for the class of Lipschitz convex functions. In practice, one naturally finds that some functions are easier to optimize than others. Intuitively, if the function is steep near the optimum, then the subgradient may carry a great deal of information, and a stochastic gradient descent algorithm may converge relatively quickly. A minimax approach to analyzing the running time cannot take this into account for a particular function, as it treats the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. worst-case behavior of the algorithm over all functions. It would be of considerable interest to be able to assess the complexity of solving an individual convex optimization problem. Doing so requires a break from traditional worst-case thinking. In this paper we revisit the traditional view of the complexity of convex optimization from the point of view of a type of localized minimax complexity. In local minimax, our objective is to quantify the intrinsic difficulty of optimizing a specific convex function f. With the target f fixed, we take an alternative function g within the same function class F, and evaluate how the maximum expected error decays with the number of calls to the oracle, for an optimal algorithm designed to optimize either f or g. The local minimax complexity RT (f; F) is defined as the least favorable alternative g: RT (f; F) = sup inf A2AT max h2{f,g} error(A, h) (1) where error(A, h) is some measure of error for the algorithm applied to function h. Note that here the the algorithm A is allowed to depend on the function f and the selected worst-case g. In contrast, the traditional global worst-case performance of the best algorithm, as defined by the minimax complexity RT (F) of Nemirovsky and Yudin, is RT (F) = inf A2AT sup error(A, g). (2) The local minimax complexity can be thought of as the difficulty of optimizing the hardest alternative to the target function. Intuitively, a difficult alternative is a function g for which querying the oracle with g gives results similar to querying with f, but for which the value of x 2 C that minimizes g is far from the value that minimizes f. Our analysis ties this function-specific notion of complexity to a localized and computational analogue of the modulus of continuity that is central to statistical minimax analysis [5, 6]. We show that the local minimax complexity gives a meaningful benchmark for quantifying the difficulty of optimizing a specific function by proving a superefficiency result; in particular, outperforming this benchmark at some function must lead to a larger error at some other function. Furthermore, we propose an adaptive algorithm in the one-dimensional case that is based on binary search, and show that this algorithm automatically achieves the local minimax complexity, up to a logarithmic factor. Our study of the algorithmic complexity of convex optimization is motivated by the work of Cai and Low [2], who propose an analogous definition in the setting of statistical estimation of a one-dimensional convex function. The present work can thus be seen as exposing a close connection between statistical estimation and numerical optimization of convex functions. In particular, our results imply that the local minimax complexity can be viewed as a computational analogue of Fisher information in classical statistical estimation. In the following section we establish our notation, and give a technical overview of our main results, which characterize the local minimax complexity in terms of the computational modulus of continuity. In Section 2.2, we demonstrate the phenomenon of superefficiency of the local minimax complexity. In Section 3 we present the algorithm that adapts to the benchmark, together with an analysis of its theoretical properties. We also present simulations of the algorithm and comparisons to traditional stochastic gradient descent. Finally, we conclude with a brief review of related work and a discussion of future research directions suggested by our results. 2 Local minimax complexity In this section, we first establish notation and define a modulus of continuity for a convex function f. We then state our main result, which links the local minimax complexity to this modulus of continuity. Let F be the collection of Lipschitz convex functions defined on a compact convex set C Rd. Given a function f 2 F, our goal is to find a minimum point, x f 2 arg minx2C f(x). However, our knowledge about f can only be gained through a first-order oracle. The oracle, upon being queried with x 2 C, returns f 0(x) + , where f 0(x) is a subgradient of f at x and N(0, σ2Id). When the oracle is queried with a non-differentiable point x of f, instead of allowing the oracle to return an arbitrary subgradient at x, we assume that it has a deterministic mechanism for producing f 0(x). That is, when we query the oracle with x twice, it should return two random vectors with the same mean f 0(x). Such an oracle can be realized, for example, by taking f 0(x) = arg minz2@f(x) kzk. Here and throughout the paper, k k denotes the Euclidean norm. Figure 1: Illustration of the flat set and the modulus of continuity. Both the function f (left) and its derivative f 0 (right) are shown (black curves), along with one of the many possible alternatives, g and its derivative g0 (solid gray curves), that achieve the sup in the definition of !f( ). The flat set contains all the points for which |f 0(x)| < , and !f( ) is the larger half width of the flat set. Consider optimization algorithms that make a total of T queries to this first-order oracle, and let AT be the collection of all such algorithms. For A 2 AT , denote by bx A the output of the algorithm. We write err(x, f) for a measure of error for using x as the estimate of the minimum point of f 2 F. In this notation, the usual minimax complexity is defined as RT (F) = inf A2AT sup Ef err(bx A, f). (3) Note that the algorithm A queries the oracle at up to T points xt 2 C selected sequentially, and the output bx A is thus a function of the entire sequence of random vectors vt N(f 0(xt), σ2Id) returned by the oracle. The expectation Ef denotes the average with respect to this randomness (and any additional randomness injected by the algorithm itself). The minimax risk RT (F) characterizes the hardness of the entire class F. To quantify the difficulty of optimizing an individual function f, we consider the following local minimax complexity, comparing f to its hardest local alternative RT (f; F) = sup inf A2AT max h2{f,g} Eh err(bx A, h). (4) We now proceed to define a computational modulus of continuity that characterizes the local minimax complexity. Let X f = arg minx2C f(x) be the set of minimum points of function f. We consider err(x, f) = infy2X f kx yk as our measure of error. Define d(f, g) = infx2X f ,y2X g kx yk for f, g 2 F. It is easy to see that err(x, f) and d(f, g) satisfy the exclusion inequality err(x, f) < 1 2d(f, g) implies err(x, g) 1 2d(f, g). (5) Next we define (f, g) = sup kf 0(x) g0(x)k (6) where f 0(x) is the unique subgradient of f that is returned as the mean by the oracle when queried with x. For example, if we take f 0(x) = arg minz2@f(x) kzk, we have (f, g) = sup k Proj@f(x)(0) Proj@g(x)(0)k (7) where Proj B(z) is the projection of z to the set B. Thus, d(f, g) measures the dissimilarity between two functions in terms of the distance between their minimizers, whereas (f, g) measures the dissimilarity by the largest separation between their subgradients at any given point. Given d and , we define the modulus of continuity of d with respect to at the function f by !f( ) = sup {d(f, g) : g 2 F, (f, g) } . (8) We now show how to calculate the modulus for some specific functions. Example 1. Suppose that f is a convex function on a one-dimensional interval C R. If we take f 0(x) = arg minz2@f(x) kzk, then !f( ) = sup |x y| : y 2 C, |f 0(y)| < The proof of this claim is given in the appendix. This result essentially says that the modulus of continuity measures the size (in fact, the larger half-width) of the the flat set where the magnitude of the subderivative is smaller than . See Figure 1 for an illustration Thus, for the class of symmetric functions f(x) = 1 k|x|k over C = [ 1, 1], with k > 1, 1 k 1 . (10) For the asymmetric case f(x) = 1 kl |x|kl I( 1 x 0) + 1 kr |x|kr I(0 < x 1) with kl, kr > 1, 1 kl_kr 1 . (11) That is, the size of the flat set depends on the flatter side of the function. 2.1 Local minimax is characterized by the modulus We now state our main result linking the local minimax complexity to the modulus of continuity. We say that the modulus of the continuity has polynomial growth if there exists > 0 and 0, such that for any c 1 and 0/c !f(c ) c !f( ). (12) Our main result below shows that the modulus of continuity characterizes the local minimax complexity of optimization of a particular convex function, in a manner similar to how the modulus of continuity quantifies the (local) minimax risk in a statistical estimation setting [2, 5, 6], relating the objective to a geometric property of the function. Theorem 1. Suppose that f 2 F and that !f( ) has polynomial growth. Then there exist constants C1 and C2 independent of T and T0 > 0 such that for all T > T0 RT (f; F) C2 !f Remark 1. We use the error metric err(x, f) = infy2X f kx yk here. For a given a pair (err, d) that satisfies the exclusion inequality (5), our proof technique applies to yield the corresponding lower bound. For example, we could use err(x, f) = infy2X f |v T (x y)| for some vector v. This error metric would be suitable when we wish to estimate v T x f, for example, the first coordinate of x f. Another natural choice of error metric is err(x, f) = f(x) infx2C f(x), with a corresponding distance d(f, g) = infx2C |f(x) infx f(x) + g(x) infx g(x)|. For this case, while the proof of the lower bound stays exactly the same, further work is required for the upper bound, which is beyond the scope of this paper. Remark 2. The results can be extended to oracles with more general noise models. In particular, the lower bounds will still hold with more general noise distributions, as long as Gaussian noise is a subclass. Indeed, in proving lower bounds assuming Gaussianity only makes solving the optimization problem easier. Our algorithm and upper bound analysis will go through for all sub-Gaussian noise oracles. For the ease of presentation, we will focus on Gaussian noise model for the current paper. Remark 3. Although the theorem gives an upper bound for the local minimax complexity, this does not guarantee the existence of an algorithm that achieves the local complexity for any function. Therefore, it is important to design an algorithm that adapts to this benchmark for each individual function. We solve this problem in the one-dimensional case in Section 3. The proof of this theorem is given in the appendix. We now illustrate the result with examples that verify the intuition that different functions should have different degrees of difficulty for stochastic convex optimization. Example 2. For the function f(x) = 1 k|x|k with x 2 [ 1, 1] for k > 1, we have RT (f; F) = T 1 2(k 1) ' . This agrees with the minimax risk complexity for the class of Lipschitz convex functions that satisfy f(x) f(x fkk [14]. In particular, when k = 2, we recover the strongly convex case, where the (global) minimax complexity is O with respect to the error err(x, f) = infy2X f kx yk. We see a faster rate of convergence for k < 2. As k ! 1, we also see that the error fails to decrease as T gets large. This corresponds to the worst case for any Lipschitz convex function. In the asymmetric setting with f(x) = 1 kl |x|kl I( 1 x 0) + 1 kr |x|kr I(0 < x 1) with kl, kr > 1, we have RT (f; F) = O(T 1 2(kl_kr 1) ). The following example illustrates that the local minimax complexity and modulus of continuity are consistent with known behavior of stochastic gradient descent for strongly convex functions. Example 3. In this example we consider the error err(x, f) = infy2X f |v T (x y)| for some vector v, and let f be an arbitrary convex function satisfying r2f(x f) 0 with Hessian continuous around x f. Thus the optimizer x f is unique. If we define gw(x) = f(x) w T r2f(x f)x, then gw(x) is a convex function with unique minimizer and (f, gw) = sup ())rf(x) (rf(x) r2f(x Thus, defining δ(w) = x w {|v T δ(w)| : (15) By the convexity of gw, we know that x gw satisfies rf(x f) 1w = 0, and therefore by the implicit function theorem, x f + w + o(kwk) as w ! 0. Thus, as T ! 1. (16) In particular, we have the local minimax lower bound TRT (f; F) C1σ where C1 is the same constant appearing in Theorem 1. This shows that the local minimax complexity captures the function-specific dependence on the constant in the strongly convex case. Stochastic gradient descent with averaging is known to adapt to this strong convexity constant [16, 13, 10]. Note that lower bounds of similar forms on the minimax complexity have been obtained in [11]. 2.2 Superefficiency Having characterized the local minimax complexity in terms of a computational modulus of continuity, we would now like to show that there are consequences to outperforming it at some function. This will strengthen the case that the local minimax complexity serves as a meaningful benchmark to quantify the difficulty of optimizing any particular convex function. Suppose that f is any one-dimensional function such that X f = [xl, xr], which has as asymptotic expansion around {xl, xr} of the form f(xl δ) = f(xl) + λlδkl + o(δkl) and f(xr + δ) = f(xr) + λrδkr + o(δkr) (18) for δ > 0, some powers kl, kr > 1, and constants λl, λr > 0. The following result shows that if any algorithm significantly outperforms the local modulus of continuity on such a function, then it underperforms the modulus on a nearby function. Proposition 1. Let f be any convex function satisfying the asymptotic expansion (21) around its optimum. Suppose that A 2 AT is any algorithm that satisfies Ef err(bx A, f) Ef err(bx A, f)2 δT !f where δT < C1. Define g 1(x) = f(x) T x and g1(x) = f(x) + T x, where T is given by /T. Then for some g 2 {g 1, g1}, there exists T0 such that T T0 implies Eg err(bx A, g) C !g for some constant C that only depends on k = kl _ kr. A proof of this result is given in the appendix, where it is derived as a consequence of a more general statement. We remark that while condition (19) involves the squared error Ef err(bx A, f)2, we expect that the result holds with only the weaker inequality on the absolute error Ef err(bx A, f). It follows from this proposition that if an algorithm A significantly outperforms the local minimax complexity in the sense that (19) holds for some sequence δT ! 0 with lim inf T e T δT = 1, then there exists a sequence of convex functions g T with (f, g T ) ! 0, such that Eg T err(bx A, g T ) This is analogous to the phenomenon of superefficiency in classical parametric estimation problems, where outperforming the asymptotically optimal rate given by the Fisher information implies worse performance at some other point in the parameter space. In this sense, !f can be viewed as a computational analogue of Fisher information in the setting of convex optimization. We note that superefficiency has also been studied in nonparametric settings [1], and a similar result was shown by Cai and Low [2] for local minimax estimation of convex functions. 3 An adaptive optimization algorithm In this section, we show that a simple stochastic binary search algorithm achieves the local minimax complexity in the one-dimensional case. The general idea of the algorithm is as follows. Suppose that we are given a budget of T queries to the oracle. We divide this budget into T0 = b T/Ec queries over each of E = br log Tc many rounds, where r > 0 is a constant to be specified later. In each round, we query the oracle T0 times for the derivative at the mid-point of the current interval. Estimating the derivative by averaging over the queries, we proceed to the left half of the interval if the estimated sign is positive, and to the right half of the interval if the estimated sign is negative. The details are given in Algorithm 1. Algorithm 1 Sign testing binary search Input: T, r. Initialize: (a0, b0), E = br log Tc, T0 = b T/Ec. for e = 1, . . . , E do Query xe = (ae + be)/2 for T0 times to get Z(e) t for t = 1, . . . , T0. Calculate the average Z(e) T0 > 0, set (ae+1, be+1) = (ae, xe). T0 0, set (ae+1, be+1) = (xe, be). end for Output: x E. We will show that this algorithm adapts to the local minimax complexity up to a logarithmic factor. First, the following result shows that the algorithm gets us close to the flat set of the function. Proposition 2. For δ 2 (0, 1), let Cδ = σ 2 log(E/δ). Define y 2 dom(f) : |f 0(y)| < Cδ p T0 Suppose that (a0, b0) \ Iδ 6= ;. Then dist(x E, Iδ) 2 E(b0 a0) (23) with probability at least 1 δ. This proposition tells us that after E rounds of bisection, we are at most a distance 2 E(b0 a0) from the flat set Iδ. In terms of the distance to the minimum point, we have |x E x| 2 E(b0 a0) + sup |x y| : y 2 Iδ If the modulus of continuity satisfies the polynomial growth condition, we then obtain the following. 100 1000 10000 5e-06 5e-05 5e-04 5e-03 100 1000 10000 0.001 0.005 0.020 0.100 100 1000 10000 0.01 0.05 0.20 0.50 100 1000 10000 0.002 0.010 0.050 kl = 1.5, kr = 2 100 1000 10000 0.02 0.05 0.10 0.20 kl = 1.5, kr = 3 100 1000 10000 0.02 0.05 0.10 0.20 kl = 2, kr = 3 binary search SGD, (t) = 1/t SGD, (t) = 1/ t theoretic Figure 2: Simulation results: Averaged risk versus number of queries T. The black curves correspond to the risk of the stochastic binary search algorithm. The red and blue curves are for the stochastic gradient descent methods, red for stepsize 1/t and blue for 1/ t. The dashed gray lines indicate the optimal convergence rate. Note that the plots are on a log-log scale. The plots on the top panels are for the symmetric cases f(x) = 1 k|x x |k; the lower plots are for the asymmetric cases. Corollary 1. Let 0 > 0. Suppose !f satisfies the polynomial growth condition (12) with constant 0. Let r = 1 2 0. Then with probability at least 1 δ and for large enough T, |x E x| e C!f where the term e C hides a dependence on log T and log(1/δ). The proofs of these results are given in the appendix. 3.1 Simulations showing adaptation to the benchmark We now demonstrate the performance of the stochastic binary search algorithm, making a comparision to stochastic gradient descent. For the stochastic gradient descent algorithm, we perform T steps of update xt+1 = xt (t) bg(xt) (26) where (t) is a stepsize function, chosen as either (t) = 1 t or (t) = 1 p t. We first consider the following setup with symmetric functions f: 1. The function to optimize is fk(x) = 1 k|x x |k for k = 3 2, 2 or 3. 2. The minimum point x Unif( 1, 1) is selected uniformaly at random over the interval. 3. The oracle returns the derivative at the query point with additive N(0, σ2) noise, σ = 0.1. 4. The optimization algorithms know a priori that the minimum point is inside the interval ( 2, 2). Therefore, the binary search starts with interval ( 2, 2) and the stochastic gradient descent starts at x0 Unif( 2, 2) and project the query points to the interval ( 2, 2). 5. We carry out the simulation for values of T on a logarithmic grid between 100 and 10,000. For each setup, we average the error |bx x | over 1,000 runs. The simulation results are shown in the top 3 panels of Figure 2. Several properties predicted by our theory are apparent from the simulations. First, the risk curves for the stochastic binary search algorithm parallel the gray curves. This indicates that the optimal rate of convergence is achieved. Thus, the stochastic binary search adapts to the curvature of different functions and yields the optimal local minimax complexity, as given by our benchmark. Second, the stochastic gradient descent algorithms with stepsize 1/t achieve the optimal rate when k = 2, but not when k = 3; with stepsize 1/ t SGD gets close to the optimal rate when k = 3, but not when k = 2. Neither leads to the faster rate when k = 3 2. This is as expected, since the stepsize needs to be adapted to the curvature at the optimum in order to achieve the optimal rate. Next, we consider a set of asymmetric functions. Using the same setup as in the symmetric case, we consider the functions of the form f(x) = 1 kl |x x |kl I(x x 0) + 1 kr |x x |kr I(x x > 0), for exponent pairs (k1, k2) chosen to be ( 3 2, 3) and (2, 3). The simulation results are shown in the bottom three panels of Figure 2. We observe that the stochastic binary search once again achieves the optimal rate, which is determined by the flatter side of the function, that is, the larger of kl and kr. 4 Related work and future directions In related recent work, Ramdas and Singh [14] study minimax complexity for the class of Lipschitz convex functions that satisfy f(x) f(x fkk. They show that the minimax complexity under the function value error is of the order T k 2(k 1) . Juditski and Nesterov [8] also consider minimax complexity for the class of k-uniformly convex functions for k > 2. They give an adaptive algorithm based on stochastic gradient descent that achieves the minimax complexity up to a logarithmic factor. Connections with active learning are developed in [15], with related ideas appearing in [3]. Adaptivity in this line of work corresponds to the standard notion in statistical estimation, which seeks to adapt to a large subclass of a parameter space. In contrast, the results in the current paper quantify the difficulty of stochastic convex optimization at a much finer scale, as the benchmark is determined by the specific function to be optimized. The stochastic binary search algorithm presented in Section 3, despite being adaptive, has a few drawbacks. It requires the modulus of continuity of the function to satisfy polynomial growth, with a parameter bounded away from 0. This rules out cases such as f(x) = |x|, which should have an error that decays exponentially in T; it is of interest to handle this case as well. It would also be of interest to construct adaptive optimization procedures tuned to a fixed numerical precision. Such procedures should have different running times depending on the hardness of the problem. Progress on both problems has been made, and will be reported elsewhere. Another challenge is to remove the logarithmic factors appearing in the binary search algorithm developed in Section 3. In one dimension, stochastic convex optimization is intimately related to a noisy root finding problem for a monotone function taking values in [ a, a] for some a > 0. Karp and Kleinberg [9] study optimal algorithms for such root finding problems in a discrete setting. A binary search algorithm that allows backtracking is proposed, which saves log factors in the running time. It would be interesting to study the use of such techniques in our setting. Other areas that warrant study involve the dependence on dimension. The scaling with dimension of the local minimax complexity and modulus of continuity is not fully revealed by the current analysis. Moreover, the superefficiency result and the adaptive algorithm presented here are only for the one-dimensional case. We note that a form of adaptive stochastic gradient algorithm for the class of uniformly convex functions in general, fixed dimension is developed in [8]. Finally, a more open-ended direction is to consider larger classes of stochastic optimization problems. For instance, minimax results are known for functions of the form f(x) := E F(x; ) where is a random variable and x 7! F(x; ) is convex for any , when f is twice continuously differentiable around the minimum point with positive definite Hessian. However, the role of the local geometry is not well understood. It would be interesting to further develop the local complexity techniques introduced in the current paper, to gain insight into the geometric structure of more general stochastic optimization problems. Acknowledgments Research supported in part by ONR grant 11896509 and NSF grant DMS-1513594. The authors thank Tony Cai, Praneeth Netrapalli, Rob Nowak, Aaron Sidford, and Steve Wright for insightful discussions and valuable comments on this work. [1] Lawrence Brown and Mark Low. A constrained risk inequality with applications to nonpara- metric functional estimation. Annals of Statistics, 24(6):2524 2535, 1996. [2] Tony Cai and Mark Low. A framework for estimation of convex functions. Statistica Sinica, pages 423 456, 2015. [3] Rui Castro and Robert Nowak. Minimax bounds for active learning. Information Theory, IEEE Transactions on, 54(5):2339 2353, 2008. [4] David Donoho. Statistical estimation and optimal recovery. The Annals of Statistics, pages 238 270, 1994. [5] David Donoho and Richard Liu. Geometrizing rates of convergence, I. Technical report, University of California, Berkeley, 1987. Department of Statistics, Technical Report 137. [6] David Donoho and Richard Liu. Geometrizing rates of convergence, II. Annals of Statistics, 19: 633 667, 1991. [7] Jean-Baptiste Hiriart-Urruty and Claude Lemaréchal. Convex Analysis and Minimization Algorithms I & II. Springer, New York, 1993. [8] Anatoli Juditski and Yuri Nesterov. Deterministic and stochastic primal-dual subgradient methods for minimizing uniformly convex functions. Stochastic System, 4(1):44 80, 2014. [9] Richard M Karp and Robert Kleinberg. Noisy binary search and its applications. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 881 890. Society for Industrial and Applied Mathematics, 2007. [10] Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algo- rithms for machine learning. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 451 459, 2011. [11] Aleksandr Nazin. Informational inequalities in gradient stochastic optimization optimal feasible algorithms. Automation and Remote Control, 50(4):531 540, 1989. [12] Arkadi Nemirovsky and David Yudin. Problem Complexity and Method Efficiency in Optimiza- tion. John Wiley & Sons, 1983. [13] Boris Polyak and Anatoli Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838 855, 1992. [14] Aaditya Ramdas and Aarti Singh. Optimal rates for stochastic convex optimization under Tsybakov noise condition. In Proceedings of The 30th International Conference on Machine Learning, pages 365 373, 2013. [15] Aaditya Ramdas and Aarti Singh. Algorithmic connections between active learning and stochastic convex optimization. arxiv:1505.04214, 2015. [16] David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Tech- nical report, Report 781, Cornell University Operations Research and Industrial Engineering, 1988. [17] Alexandre Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.