# variancebased_regularization_with_convex_objectives__3da6184d.pdf Journal of Machine Learning Research 19 (2018) 1-55 Submitted 12/17; Revised 11/18; Published 5/19 Variance-based Regularization with Convex Objectives John Duchi jduchi@stanford.edu Department of Statistics and Electrical Engineering Stanford University Stanford, CA 94305, USA Hongseok Namkoong hnamk@stanford.edu Department of Management Science and Engineering Stanford University Stanford, CA 94305, USA Editor: Alexander Rakhlin We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds offof techniques for distributionally robust optimization and Owen s empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems. Keywords: variance regularization, robust optimization, empirical likelihood 1. Introduction We propose and study a new approach to risk minimization that automatically trades between bias or approximation error and variance or estimation error. Let X be a sample space, P0 a distribution on X, and Θ a parameter space. For a loss function ℓ: Θ X R, consider the problem of finding θ Θ minimizing the risk R(θ) := E[ℓ(θ, X)] = Z ℓ(θ, x)d P(x) (1) given a sample {X1, . . . , Xn} drawn i.i.d. according to the distribution P. Under appropriate conditions on the loss ℓ, parameter space Θ, and random variables X, a number of researchers (Bartlett et al., 2005, 2006; Boucheron et al., 2005; Koltchinskii, 2006) have shown results of the form that with high probability, i=1 ℓ(θ, Xi) + C1 Var(ℓ(θ, X)) n for all θ Θ (2) c 2018 John Duchi and Hongseok Namkoong. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v19/17-750.html. Duchi and Namkoong where C1 and C2 depend on the parameters of problem (1) and the desired confidence guarantee. Such bounds justify empirical risk minimization (ERM), which chooses bθn to minimize 1 n Pn i=1 ℓ(θ, Xi) over θ Θ. Further, these bounds showcase a tradeoffbetween bias and variance, where we identify the bias (or approximation error) with the empirical risk 1 n Pn i=1 ℓ(θ, Xi), while the variance arises from the second term in the bound. Given bounds of the form above and heuristically considering the classical bias-variance tradeoffin estimation and statistical learning, it is natural to instead choose θ to directly minimize a quantity trading between approximation and estimation error, say of the form i=1 ℓ(θ, Xi) + C Var b Pn(ℓ(θ, X)) where Var b Pn denotes the empirical variance of its argument. Maurer and Pontil (2009) considered precisely this idea, giving a number of guarantees on the convergence and good performance of such a procedure. Unfortunately, even when the loss ℓis convex in θ, the formulation (3) is in general non-convex, yielding computationally intractable problems, which has limited the applicability of procedures that minimize the variance-corrected empirical risk (3). In this paper, we develop an approach that provides a tractable convex formulation whenever the loss ℓis convex and very closely approximates the penalized risk (3). Our approach is based on Owen s empirical likelihood (Owen, 2001) and ideas from distributionally robust optimization (Ben-Tal et al., 2009; Bertsimas et al., 2014; Ben-Tal et al., 2015). Below, we give a number of theoretical guarantees and empirical evidence for its performance. Before summarizing our contributions, we first describe our approach. Let φ : R+ R be a convex function with φ(1) = 0. Then the φ-divergence between distributions P and Q defined on a space X is Dφ (P||Q) = Z φ d P where µ is any measure for which P, Q µ, and p = d P dµ , q = d Q dµ . Throughout this paper, we use φ(t) = 1 2(t 1)2, which gives the χ2-divergence (Tsybakov, 2009). Given φ and a sample X1, . . . , Xn, we define the local neighborhood of the empirical distribution with radius ρ by Pn := n distributions P such that Dφ P|| b Pn ρ where b Pn denotes the empirical distribution of the sample, and our choice of φ(t) = 1 means that Pn consists of discrete distributions supported on the sample {Xi}n i=1. We then define the robustly regularized risk Rn(θ, Pn) := sup P Pn EP [ℓ(θ, X)] = sup P n EP [ℓ(θ, X)] : Dφ(P|| b Pn) ρ As it is the supremum of a family of convex functions, the robust risk θ 7 Rn(θ, Pn) is convex in θ whenever ℓis convex, no matter the value of ρ 0. Given the robust empirical risk (4), our proposed estimation procedure is to choose a parameter bθ rob n by minimizing Rn(θ, Pn). Variance-based Regularization with Convex Objectives Let us now discuss a few of the properties of procedures minimizing the robust empirical risk (4). Our first main technical result, which we show in Section 2, is that for bounded loss functions, the robust risk Rn(θ, Pn) is a good approximation to the variance-regularized quantity (3). That is, Rn(θ, Pn) = E b Pn[ℓ(θ, X)] + 2ρVar b Pn(ℓ(θ, X)) n + εn(θ), (5) where εn(θ) 0 and is OP (1/n) uniformly in θ. We show specifically that whenever ℓ(θ, X) has suitably large variance, with high probability we have εn = 0. From variance expansions of the form (5) and empirical Bernstein inequality (2), we see that Rn(θ, Pn) is a O(1/n)-approximation to the population risk R(θ), in contrast to the cruder O(1/ n)- approximation that the empirical risk E b Pn[ℓ(θ; X)] provides. Based on this intuition that the robustly regularized risk Rn(θ; Pn) is a tighter approximation to the population risk R(θ), we show a number of finite-sample convergence guarantees for the estimator bθ rob n argmin θ Θ n EP [ℓ(θ, X)] : Dφ P|| b Pn ρ that are often tighter than those available for ERM (see Section 3). The above problem is a convex optimization problem when the original loss ℓ( ; X) is convex and Θ is a convex set. Based on the expansion (5), solutions bθ rob n of problem (6) enjoy automatic finite sample optimality certificates: for ρ 0, with probability at least 1 C1 exp( ρ) we have R(bθ rob n ) = E[ℓ(bθ rob n ; X)] Rn(bθ rob n ; Pn) + C2ρ n = inf θ Θ Rn(θ, Pn) + C2ρ where C1, C2 are constants (which we specify) that depend on the loss ℓand domain Θ. That is, with high probability the robust solution has risk no worse than the optimal finite sample robust objective up to an O(ρ/n) error term. To guarantee a desired level of risk performance with probability 1 δ, we may specify the robustness penalty ρ = O(log 1 δ). Secondly, we show that the procedure (6) allows us to automatically and near-optimally trade between approximation and estimation error (bias and variance), so that R(bθ rob n ) = E[ℓ(bθ rob n ; X)] inf θ Θ E[ℓ(θ; X)] + 2 n Var(ℓ(θ; X)) with high probability. When there are parameters θ with small risk R(θ) and small variance Var(ℓ(θ, X)), this guarantees that the excess risk R(bθ rob n ) infθ Θ R(θ) is essentially of order O(ρ/n), where ρ governs our desired confidence level. Our bounds do not require the Bernstein-type condition Var(ℓ(θ; X)) MR(θ) often required for ERM. Since it is often the case that M depends on global information (e.g. size of parameter space Θ), we have Var(ℓ(θ; X)) MR(θ), in which case the bound (7) offers a tighter guarantee than that available for the ERM solution bθ erm n . In particular, we give an explicit example in Section 3.3 where our robustly regularized procedure (6) converges at rate O(log n/n) compared to O(1/ n) of empirical risk minimization. Duchi and Namkoong Bounds that trade between risk and variance are known in a number of cases in the empirical risk minimization literature (Mammen and Tsybakov, 1999; Tsybakov, 2004; Bartlett et al., 2005; Boucheron et al., 2005; Bartlett et al., 2006; Boucheron et al., 2013; Koltchinskii, 2006), which is relevant when one wishes to achieve fast rates of convergence for statistical learning algorithms (that is, faster than the O(1/ n) guaranteed by a number of uniform convergence results (Bartlett and Mendelson, 2002; Boucheron et al., 2005, 2013)). In many cases, however, such tradeoffs require either conditions such as the Mammen and Tsybakov s noise condition (Mammen and Tsybakov, 1999; Boucheron et al., 2005) or localization results made possible by curvature conditions that relate the loss/risk and variance (Bartlett et al., 2006, 2005; Mendelson, 2014). The robust solutions (6) enjoy a different tradeoff between variance and risk than that in this literature, but essentially without conditions except compactness of Θ. In proposing any new estimator, it is essential to understand the limits of the proposed procedure and identify situations in which its performance may be worse than existing estimators. There are indeed situations in which minimizing the robust-regularized risk (4) yields some inefficiency (for example, in classical statistical estimation problems with correctly specified model). To understand limits of the inefficiency induced by using the distributionally-robustified estimator (6), in Section 4 we study explicit finite sample properties of the robust estimator for general stochastic optimization problems, and we also provide asymptotic normality results in classical problems. There are a number of situations, based on growth conditions on the population risk R, when convergence rates faster than 1/ n (or even 1/n) are attainable (see Shapiro et al. (2009, Chapter 5)). We show that under these conditions, the robust procedure (6) still enjoys (near-optimal) fast rates of convergence, similar to empirical risk minimization (also known as sample average approximation in the stochastic programming literature). Our study of asymptotics makes precise the asymptotic efficiency loss of the robust procedure over minimizing the standard (asymptotically optimal) empirical expectation: there is a bias term that scales as p ρ/n in the limiting distribution of bθ rob n , though its variance is optimal. o We complement our theoretical results in Section 5, where we conclude by providing three experiments comparing empirical risk minimization strategies to robustly-regularized risk minimization (6). These results validate our theoretical predictions, showing that the robust solutions are a practical alternative to empirical risk minimization. In particular, we observe that the robust solutions outperform their ERM counterparts on harder instances with higher variance. In classification problems, for example, the robustly regularized estimators exhibit an interesting tradeoff, where they improve performance on rare classes (where ERM usually sacrifices performance to improve the common cases increasing variance slightly) at minor cost in performance on common classes. Related Work The theoretical foundations of empirical risk minimization are solid (Vapnik, 1998; Bartlett and Mendelson, 2002; Boucheron et al., 2005, 2013). When the expectation of the excess loss bounds its variance, it is possible to achieve faster rates than the O(1/ n) offered by standard uniform convergence arguments (Vapnik and Chervonenkis, 1971, 1974; Bartlett et al., 2006; Koltchinskii, 2006; Boucheron et al., 2013) (see Boucheron et al. (2005, Section Variance-based Regularization with Convex Objectives 5) for an overview in the case of classification, and Shapiro et al. (2009, Chapter 5.3) for more general stochastic optimization problems). Vapnik and Chervonenkis (1971, 1974) first provided such results in the context of {0, 1}-valued losses for classification (see also (Anthony and Shawe-Taylor, 1993)), where the expectation of the loss always upper bounds its variance, so that if there exists a perfect classifier the convergence rates of empirical risk minimization procedures are O(1/n). Mammen and Tsybakov (Mammen and Tsybakov, 1999; Tsybakov, 2004) give low noise conditions for binary classification substantially generalizing these results, which yield a spectrum of fast rates. Under related conditions, Bartlett, Jordan, and Mc Auliffe (2006) show similar fast rates of convergence for convex risk minimization under appropriate curvature conditions on the loss. The robust procedure (6), on the other hand, is guaranteed to provide an at most O(1/n) over-estimate of the population risk and a small increase of its variance regularized population counterpart. It may be the case that the variance-regularized risk infθ{R(θ) + p Var(ℓ(θ, X))/n} decreases to R(θ ) more slowly than 1/n. As we note above and detail in Section 4, however, in stochastic optimization problems the variance-regularized approach (6) suffers limited degradation with respect to empirical risk minimization strategies, even under convexity and curvature properties that allow faster rates of convergence than those achievable in classical regimes, as detailed by (Shapiro et al., 2009, Chapter 5.3). Most related to our work is that of Maurer and Pontil (2009), who propose directly regularizing empirical risk minimization by variance, providing guarantees similar to ours and giving a natural foundation offof which many of our results build. In their setting, however as they carefully note it is unclear how to actually solve the variance-regularized problem, as it is generally non-convex. Shivaswamy and Jebara (2010, 2011) build on this and develop an elegant approach for boosting binary classifiers based on a variance penalty applied to the exponential loss; as it is a boosting approach, their approach provides a coordinate-wise strategy for decreasing the loss, but it is not guaranteed to converge to a global minimizer and applies to classification-like problems. Our approach, handling general stochastic optimization problems, removes these obstructions. The robust procedure (6) is based on distributionally robust optimization ideas that many researchers have developed (Ben-Tal et al., 2013; Bertsimas et al., 2014; Lam and Zhou, 2015), where the goal (as in robust optimization more broadly (Ben-Tal et al., 2009)) is to protect against all deviations from a nominal data model. In the optimization literature, there is substantial work on tractability of the problem (6), including that of Ben-Tal et al. (2013), who show that the dual of (4) often admits a standard form (such as a second-order cone problem) to which standard polynomial-time interior point methods can be applied. Namkoong and Duchi (2016) develop stochastic-gradient-like procedures for solving the problem (6), which efficiently provide low accuracy solutions (which are still sufficient for statistical tasks). Work on the statistical analysis of such procedures is nascent; Bertsimas, Gupta, and Kallus (2014) and Lam and Zhou (2015) provide confidence intervals for solution quality under various conditions, and Duchi et al. (2016) give asymptotics showing that the optimal robust risk Rn(bθ rob n ; Pn) is a calibrated upper confidence bound for infθ Θ E[ℓ(θ; X)]. They and Gotoh et al. (2015) also provide a number of asymptotic results showing relationships between the robust risk Rn(θ; Pn) and variance regularization, but they do not leverage these results for guarantees on the solutions bθ rob n . Duchi and Namkoong 4 3 2 1 0 1 2 3 4 Figure 1. Plot of θ 7 p Var(ℓ(θ, X)) for ℓ(θ; X) = |θ X| where X Uni({ 2, 1, 0, 1, 2}). The function is non-convex, with multiple local minima, inflection points, and does not grow as θ . Notation We collect our notation here. We let B denote a unit norm ball in Rd, B = {θ Rd : θ 1}, where d and are generally clear from context. Given sets A Rd and B Rd, we let A + B = {a + b : a A, b B} denote Minkowski addition. For a convex function f, the subgradient set f(x) of f at x is f(x) = {g : f(y) f(x) + g (y x) for all y}. For a function h : Rd R, we let h denote its Fenchel (convex) conjugate, h (y) = supx{y x h(x)}. For sequences an, bn, we let an bn denote that there is a numerical constant C < such that an Cbn for all n. For a sequence of random vectors X1, X2, . . ., we let Xn d X denote that Xn converges in distribution to X . For a nonegative sequence a1, a2, . . ., we say Xn = OP (an) if limc supn P( Xn can) = 0, and we say Xn = o P (an) if limc 0 lim supn P( Xn can) = 0. 2. Variance Expansion We begin our study of the robust regularized empirical risk Rn(θ, Pn) by showing that it is a good approximation to the empirical risk plus a variance term, that is, studying the variance expansion (5). Although the variance of the loss is in general non-convex (see Figure 1 for a simple example), the robust formulation (6) is a convex optimization problem for variance regularization whenever the loss function is convex (the supremum of convex functions is convex (Hiriart-Urruty and Lemar echal, 1993, Prop. 2.1.2.)). 2.1. Variance expansion for a single variable To gain intuition for the variance expansion that follows, we begin with a slightly simpler problem, which is to study the quadratically constrained linear maximization problem i=1 pizi subject to p Pn = p Rn + : 1 2 np 1 2 2 ρ, 1, p = 1 , (8) Variance-based Regularization with Convex Objectives where z Rn is a vector. For simplicity, let s2 n = 1 n z 2 2 (z)2 = 1 n z z 2 2 denote the empirical variance of the vector z, where z = 1 n 1, z is the mean value of z. Then by introducing the variable u = p 1 n1, the objective in problem (8) satisfies p, z = z+ u, z = z + u, z z because u, 1 = 0. Thus problem (8) is equivalent to solving maximize u Rn z + u, z z subject to u 2 2 2ρ n2 , 1, u = 0, u 1 Notably, by the Cauchy-Schwarz inequality, we have u, z z 2ρ z z 2 /n = p 2ρs2n/n, and equality is attained if and only if ui = 2ρ(zi z) n z z 2 = 2ρ(zi z) It is possible to choose such ui while satisfying the constraint ui 1/n if and only if ns2n 1. (9) Thus, if inequality (9) holds for the vector z that is, there is enough variance in z we have sup p Pn p, z = z + For losses ℓ(θ, X) with enough variance relative to ℓ(θ, Xi) E b Pn[ℓ(θ, Xi)], that is, those satisfying inequality (9), then, we have Rn(θ, Pn) = E b Pn[ℓ(θ, X)] + 2ρVar b Pn(ℓ(θ, X)) A slight elaboration of this argument, coupled with the application of a few concentration inequalities, yields the next theorem. The theorem as stated applies only to bounded random variables, but in subsequent sections we relax this assumption by applying the characterization (9) of the exact expansion. As usual, we assume that φ(t) = 1 2(t 1)2 in our definition of the φ-divergence. Theorem 1 Let Z be a random variable taking values in [0, M]. Let σ2 = Var(Z) and s2 n = E b Pn[Z2] E b Pn[Z]2 denote the population and sample variance of Z, respectively. Fix ρ 0. Then r n EP [Z] : Dφ(P|| b Pn) ρ o E b Pn[Z] n s2n. (10) Moreover, for n max n 2, M2 σ2 max {8σ, 44} o , with probability at least 1 exp 3nσ2 sup P:Dφ(P|| b Pn) ρ n EP [Z] = E b Pn[Z] + n s2n. (11) Duchi and Namkoong See Section A for the proof of Theorem 1. Inequality (10) and the exact expansion (11) show that, at least for bounded loss functions ℓ, the robustly regularized risk (4) is a natural (and convex) surrogate for empirical risk plus standard deviation of the loss, and the robust formulation approximates exact variance regularization with a convex penalty. In the sequel, we leverage this result to provide sharp guarantees for a number of stochastic risk minimization problems. 2.2. Uniform variance expansions We now turn to a more uniform variant of Theorem 1, which depends on familiar notions of function complexity based on Rademacher averages. For a sample x1, . . . , xn and i.i.d. random signs εi { 1, 1}, independent of the xi, the empirical Rademacher complexity of the class F is i=1 εif(xi) The worst-case Rademacher complexity (Srebro et al., 2010) is Rsup n (F) := sup x1,...,xn X E i=1 εif(xi) For example, when F is a class of functions bounded by M with VC-subgraph dimension d, we have the inequalities E[Rn(F)] Rsup n (F) M q d n. See van der Vaart and Wellner (1996, Chapter 2) and Bartlett and Mendelson (2002) for other bounds. With this definition, we provide a result showing that the variance expansion (5) holds uniformly for all functions with enough variance. Theorem 2 Let F be a collection of bounded functions f : X [0, M], and M n. There exists a universal constant C such that if τ 2 > 0 satisfies n + C Rsup n (F)2 log3 n + M2 n (t + log log n) . Then with probability at least 1 3e t sup P:Dφ(P|| b Pn) ρ n EP [f(X)] = E b Pn[f(X)] + n Var b Pn(f(X)) (12) for all f F such that Var(f) τ 2. We prove the theorem in Section B. Theorem 2 shows that the variance expansion of Theorem 1 holds uniformly for all functions f with sufficient variance. An asymptotic analogue of the equality (12) for heavier tailed random variables is also possible (Duchi et al., 2016). In the remainder of the section, we provide examples and applications of the theorem. Variance-based Regularization with Convex Objectives 2.2.1. Linear and margin-based losses Consider a standard margin-based classification problem (Bartlett and Mendelson, 2002), where we have data pairs (x, y) X { 1, 1}, and X Rd. Let Θ Rd be a norm ball of radius r(Θ), Θ = {θ Rd | θ r}, and let be the associated dual norm, assuming also that X {x Rd | x r(X)}. We may then consider the standard loss minimization setting, where for some non-increasing and 1-Lipschitz loss ℓ: R R+, we have the risk R(θ) := E [ℓ(Y θ, X )] , so that ℓ(y x, θ ) is the loss suffered by making prediction θ, x when the label is y. By taking the function class F = {(x, y) 7 ℓ(y x, θ ) ℓ(0) | θ Θ}, in this case, an application of the Ledoux-Talagrand contraction inequality (Ledoux and Talagrand, 1991) implies for any y1, x1, . . . , yn, xn that i=1 εi [ℓ(yi θ, xi ) ℓ(0)] i=1 εi θ, xi Example 1 (Euclidean norms) In the above context, suppose that norm is the standard ℓ2 Euclidean norm so that Θ is contained in an ℓ2-ball of radius r(Θ), and X Rd in an ℓ2 ball of radius r(X). Then Jensen s inequality and independence of εi s give the bound Then, inequality (13) and Theorem 1 imply that sup P:Dφ(P|| b Pn) ρ n EP [ℓ(Y θ, X )] = E b Pn[ℓ(Y θ, X )] + n Var b Pn(ℓ(Y θ, X )) for all θ satisfying Var(ℓ(Y θ, X )) r(X)2r(Θ)2 n 4ρ + C log3 n + Ct , with probability at least 1 e t. Example 2 (High-dimensional problems) In high dimensional problems, the Euclidean scaling of Example 1 may be problematic, so that using ℓ1-constraints is preferred (B uhlmann and van de Geer, 2011). Thus, taking the norm in the preceding to be the ℓ1 norm, so that Θ {θ Rd | θ 1 r1(Θ)} and = , then E[ Pn i=1 εixi ] r(X) p n log(2d), where r (X) denotes the ℓ -radius of X Rd. Thus, if we take the loss class F = {ℓ( θ, ) ℓ(0) | θ Θ}, we obtain Rsup n (F) sup x1,...,xn X Then the exact variance expansion (12) holds with probability at least 1 e t uniformly over θ satisfying Var(ℓ(Y θ, X )) r1(Θ)2r (X)2 n [4ρ + C log d log3 n + Ct]. Duchi and Namkoong 2.2.2. Covering number guarantees It is also possible to provide guarantees on the exact variance expansion using standard covering numbers, though careful arguments based on Rademacher complexity can be tighter. We begin by recalling the appropriate notions from approximation theory. Let V be a vector space and V V be any collection of vectors in V. Let be a (semi)norm on V. We say a collection v1, . . . , v N V is an ϵ-cover of V if for each v V, there exists vi such that v vi ϵ. The covering number of V with respect to is then N(V, ϵ, ) := inf {N N : there is an ϵ-cover of V with respect to } . Now, let F be a collection of functions f : X R, and define the L (X) norm on f by f g L (X) := sup x X |f(x) g(x)|. We also relax our covering number requirements to empirical ℓ -covering numbers as follows. Define F(x) = {(f(x1), . . . , f(xn)) : f F} for x X n, and define the empirical ℓ -covering numbers N (F, ϵ, n) = sup x X n N (F(x), ϵ, ) , which bound the number of ℓ -balls of radius ϵ required to cover F(x). Note that we always have N (F, ϵ, n) N(F, ϵ, L (X)) by definition. The classical Dudley entropy integral (Dudley, 1999; van der Vaart and Wellner, 1996) shows that, if Pn denotes the point masses on x1, . . . , xn and L2(Pn) the empirical L2-norm on functions f : X [ M, M], then " 1 n sup f F i=1 εif(xi) log N(F, ϵ, L2(Pn))dϵ log N (F, ϵ, n)dϵ . (14) Our main (essentially standard (van der Vaart and Wellner, 1996)) motivating example is that of Lipschitz loss functions for a parametric set Θ, as follows. Example 3 Let Θ Rd and assume that ℓ: Θ X [0, M] is L-Lipschitz in θ with respect to the ℓ2-norm for all x X, meaning that |ℓ(θ, x) ℓ(θ , x)| L θ θ 2. Then taking F = {ℓ(θ, ) : θ Θ}, any ϵ-covering {θ1, . . . , θN} of Θ in ℓ2-norm guarantees that mini |ℓ(θ, x) ℓ(θi, x)| Lϵ for all θ, x. That is, N(F, ϵ, L (X)) N(Θ, ϵ/L, 2) 1 + diam(Θ)L where diam(Θ) = supθ,θ Θ θ θ 2. Thus ℓ2-covering numbers of Θ control L -covering numbers of the family F, and we have by the entropy integral (14) that log diam(Θ)L ϵ dϵ diam(Θ)L That is, with high probability, for all θ such that Var(ℓ(θ, X)) 4M2ρ n + Cd diam(Θ)2L2 log3 n n , we have the exact variance expansion (12). Variance-based Regularization with Convex Objectives 3. Optimization by Minimizing the Robust Loss Based on the precise variance expansions in the preceding section, it is natural to expect that the robust solution (6) automatically trades between approximation and estimation error. This intuition is accurate, and we show that the robustly regularized objective Rn(θ; Pn) overestimates the population risk R(θ) by at most O(1/n). By virtue of optimizing this tighter approximation as opposed to the usual O(1/ n)-approximation given by the empirical risk E b Pn[ℓ(θ; X)] the robustly regularized solution (6) enjoys a number of favorable finite-sample properties, which are not always comparable to those for empirical risk minimization (ERM). In Section 3.1, we present two versions of our main result that depend on covering numbers and discuss their consequences, and we provide an example where the robustly regularized solution bθ rob n achieves a tighter excess risk bound compared to those that a straightforward application of localized Rademacher complexities (Bartlett et al., 2005) show that the ERM solution bθ erm n achieves. As evidenced by the substantial work on Rademacherand Gaussian-complexity and symmetrization, in some instances coveringnumber-based arguments do not provide the sharpest scaling (Bartlett and Mendelson, 2002; Bartlett et al., 2005; Srebro et al., 2010); thus, in Section 3.2 we present a version of our main result that depends on localized Rademacher complexities, which can allow more refined uniform concentration bounds than covering numbers. We also provide a concrete (but admittedly somewhat contrived) example where our robustly regularized procedure (6) achieves R(bθ rob n ) infθ Θ R(θ) log n n , while empirical risk minimization suffers R(bθ erm n ) infθ Θ R(θ) 1 n, in Section 3.3. The robust regularizer has invariance properties other regularization procedures do not, and we mention these briefly in Section 3.4. 3.1. Covering arguments Our first guarantee depends on the covering numbers of the function class F as we describe in Section 2.2.2. While we state our results abstractly, in the loss minimization setting we typically consider the function class F := {ℓ(θ, ) : θ Θ} parameterized by θ. We have the following theorem, where as usual, we let F be a collection of functions f : X [M0, M1] with M = M1 M0. Theorem 3 Let n 8M2/t, t log 12, ϵ > 0, and ρ 9t. Then with probability at least 1 2(3N (F, ϵ, 2n) + 1)e t, E[f(X)] sup P:Dφ(P|| b Pn) ρ n EP [f(X)] + 11 for all f F. Defining the empirical minimizer bf argmin f F n EP [f(X)] : Dφ(P|| b Pn) ρ we have with the same probability that E[ bf(X)] inf f F Duchi and Namkoong See Section C for a proof of the theorem. Because uniform L -covering numbers upper bound empirical L -covering numbers, it is immediate that covering F in L (X) provides an identical result. 3.1.1. Covering bounds: corollaries We turn to a number of corollaries that expand on Theorem 3 to investigate its consequences. Our first corollary shows that Theorem 3 applies to standard Vapnik-Chervonenkis (VC) classes. As VC dimension is preserved through composition, this result also extends to the procedure (6) in typical empirical risk minimization scenarios. Corollary 4 In addition to the conditions of Theorem 3, let F have finite VC-dimension VC(F). Then for a numerical constant c < , the bounds (15) and (16) hold with probability at least c VC(F) 16Mne VC(F) 1 + 2 Proof Let f L1(Q) := R |f(x)|d Q(x) denote the L1-norm on F for the probability distribution Q. Then by Theorem 2.6.7 of van der Vaart and Wellner (1996), we have sup Q N(F, ϵ, L1(Q)) c VC(F) 8Me for a numerical constant c. Because x x 1, taking Q to be uniform on x X 2n yields N(F(x), ϵ, ) N(F, ϵ 2n, L1(Q)). The result is immediate. Next, we focus more explicitly on the estimator bθ rob n defined by minimizing the robust regularized risk (6). Let us assume that Θ Rd, and that we have a typical linear modeling situation, where a loss h is applied to an inner product, that is, ℓ(θ, x) = h(θ x). In this case, by making the substitution that the class F = {ℓ(θ, ) : θ Θ} in Corollary 4, we have VC(F) d, and we obtain the following corollary. In the corollary, recall the definition (1) of the population risk R(θ) = E[ℓ(θ, X)], and the uncertainty set Pn = {P : Dφ(P|| b Pn) ρ n}, and that Rn(θ, Pn) = sup P Pn EP [ℓ(θ, X)]. By setting ϵ = M/n in Corollary 4, we obtain the following result. Corollary 5 Let the conditions of the previous paragraph hold and let bθ rob n argminθ Θ Rn(θ, Pn). Assume also that ℓ(θ, x) [0, M] for all θ Θ, x X. Then if n ρ 9 log 12, R(bθ rob n ) Rn(bθ rob n , Pn) + 11Mρ n Var(ℓ(θ; X)) with probability at least 1 2 exp(c1d log n c2ρ), where ci are universal constants with c2 1/9. To give an alternate concrete variant of Corollary 5 and Theorem 3, let Θ Rd and recall Example 3. We assume that for each x X, infθ Θ ℓ(θ, x) = 0 and that ℓis LLipschitz in θ. If D := diam(Θ) = supθ,θ Θ θ θ 2 < , then ℓ(θ, x) L diam(Θ), and Variance-based Regularization with Convex Objectives for δ > 0, we define δ + d log(2n DL). (17) Setting t = ρ and ϵ = 1 n in Theorem 3 and assuming that δ 1/n, D nk and L nk for a numerical constant k, choosing δ = 1 n we obtain that with probability at least 1 δ = 1 1/n, E[ℓ(bθ rob n ; X)] = R(bθ rob n ) inf θ Θ d Var(ℓ(θ, X)) + C d LD log n where C is a numerical constant. 3.1.2. Examples and heuristic discussion Unpacking Theorem 3, the first result (15) (and its Corollary 5) provides a high-probability guarantee that the true expectation E[ bf] cannot be more than O(1/n) worse than its robustly-regularized empirical counterpart. The second result (16) (and inequality (18)) guarantees convergence of the empirical minimizer to a parameter with risk at most O(log n/n) larger than the best possible variance-corrected risk. To illustriate how variance regularization can yield tighter guarantees than empirical risk minimization by optimizing a O(1/n) upper bound on the risk, we now compare the second bound (16) with an analogous result for empirical risk minimization (ERM). We first give a heuristic version, making it more precise in a coming example. For the ERM solution bθ erm n argminθ Θ E b Pn[ℓ(θ; X)], one common assumption is an upper bound of the variance by the risk; for example, when the losses take values in [0, M], one has Var(ℓ(θ, X)) MR(θ). In such cases, there is typically some complexity measure Compn associated with the class of functions being learned, and it is possible to achieve bounds of the form R(bθ erm n ) R(θ ) + C Compn MR(θ ) n + C Compn M where θ argminθ Θ R(θ), a type of result common for bounded nonnegative losses (Boucheron et al., 2005; Vapnik and Chervonenkis, 1971; Vapnik, 1998). For example, for classes of functions of VC-dimension d, we typically have Compn d log n d. In this caricature, when Var(ℓ(θ , X)) MR(θ ) and ρ Compn, the optimality guarantee (16) for variance regularization can be tighter than its ERM counterpart (19). This bound is certainly not always sharp, but yields minimax optimal rates in some cases. Example 4 (Well-specified least-absolute-deviation regression) For the least-absolutedeviation (LAD) regression, we compare rates of convergence for the ERM solution given by the localized Rademacher complexity against those for the robust solution. Let Z = (X, Y ) Rd R, where X {x Rd | x 2 L}, and let D := diam(Θ) be the ℓ2-diameter of Θ. The LAD loss is ℓ(θ; (x, y)) := |y θ, x |, where we assume that Y = θ , X + ϵ for some θ Θ, and random noise ϵ [ B, B] independent of X. We then have the global bound ℓ(θ; (X, Y )) DL + B =: M. Suppose for simplicity that ϵ is uniform on [ B, B]; then θ = argminθ Θ R(θ) and R(θ ) = E[ℓ(θ ; Z)] = 1 2B. In this case, Var (ℓ(θ ; Z)) = B2 2(DL + B)B = ME[ℓ(θ ; Z)] = MR(θ ). Duchi and Namkoong Using that the loss is 1-Lipschitz, the L covering numbers for the set of functions F := {fθ(x, y) = | θ, x y| | θ Θ} satisfy log N(F, ϵ, L (X)) d log DL applying the bound (18) for the robustly regularized solution bθ rob n with ϵ = DL/n, we obtain R(bθ rob n ) R(θ ) + C n B2 + C d(LD + B) log n with probability at least 1 1/n. On the other hand, even an optimistic (but naive) ERM bound, achieved by taking Compn 1 in the bound (19), yields R(bθ erm n ) R(θ ) + C n (BDL + B2) + C (LD + B) log n with probability at least 1 1/n. We see that leading term for the robustly regularized solution bθ rob n only depends on the noise-level B2 while the corresponding term for the ERM solution bθ erm n depends on global information like the size of the parameter space D, and a uniform bound over covariates L. For typical VC and other d-dimensional classes, the bound Compn scales linearly in d (cf. (Bartlett et al., 2005, Corollary 3.7), in which case the bound (19) scales as R(θ ) + C p d(BDL + B2) log n/n + O(log n/n), which is worse. Example 5 (A hard median estimation problem) To give a bit more insight into the behavior of the robust estimator, consider the simple 1-dimensional median problem, where ℓ(θ; x) = |θ x|, and assume that x { B, B} with P(X = B) = 1+δ 2 for some δ > 0, so that θ = argmin R(θ) = B and R(θ ) = (1 δ)B. In this case, taking θ0 = 0 yields Var(ℓ(θ; X)) = 0 and R(θ0) R(θ ) = δB. For δ small (on the order of 1/ n), with constant probability the empirical risk minimizer is bθ erm n = B, yielding risk R(bθ erm n ) R(θ ) = 2δB. On the other hand, with high probability bθ rob n 0 (because Var(ℓ(θ0; X)) = 0 as ℓ(0; X) B), and so R(bθ rob n ) R(θ ) δB. This gap is of course small, but it shows that the robust solution is more conservative: it chooses bθ rob n so that large losses (of scale 2B) are less frequent. When the population problem is easy , it is often possible to achieve faster rates of convergence than the usual O (1/ n) rate. The simplest scenario where this occurs is if the problem is realizable R(θ ) = 0, in which case bθ erm n has excess risk of the order O(log n/n); see the bound (19). The robustly regularized solution bθ rob n enjoys the same faster rates of convergence under the more general condition that Var(ℓ(θ ; X)) is small. As a concrete instance of this, let ℓ(θ; X) [0, M] and assume that ℓ(θ; X) satisfies the conditions of the first part of Example 3, and let the problem be realizable R(θ ) = 0. Since Var(ℓ(θ; X)) MR(θ), we have from the bounds (18) and (19) that R(bθ erm n ) Cd DL log n n and R(bθ rob n ) Cd DL log n For example, Var(ℓ(θ; X)) = 0 allows for the existence of some θ0 Θ such that ℓ(θ0; X) < ℓ(θ ; X) with positive probability. Variance-based Regularization with Convex Objectives 3.2. Localized Rademacher Complexity A somewhat more sophisticated approach to concentration inequalities and generalization bounds is based on localization ideas, motivated by the fact that near the optimum of an empirical risk, the complexity of the function class may be smaller than over the entire (global) class (van der Vaart and Wellner, 1996; Bartlett et al., 2005). With this in mind, we now present a refined version of Theorem 3 that depends on localized Rademacher averages. The starting point for this approach is a notion of localized Rademacher complexity (we give a slightly less general notion than Bartlett et al. (2005), as it is sufficient for our derivations). For a function class F of functions f : X R, the localized Rademacher complexity at level r is E Rn cf | f F, c [0, 1], E[c2f2 r] . In addition, we require a few analytic notions, beginning with sub-root functions, where we recall (Bartlett et al., 2005) that a function ψ : R+ R+ is sub-root if it is nonnegative, nondecreasing, and r 7 ψ(r)/ r is nonincreasing for all r > 0. Any non-constant sub-root function ψ is continuous and has a unique positive fixed point r = ψ(r ), where r ψ(r) for all r r . Lastly, we consider upper bounds ψn : R+ R+ on the localized Rademacher complexity satisfying ψn(r) E[Rn({cf : f F, c [0, 1], E[c2f2] r})], (20) where ψn is sub-root. (The localized Rademacher complexity itself is sub-root.) Roots of ψn play a fundamental role in providing uniform convergence guarantees, and Bartlett et al. (2005) and Koltchinskii (2006) provide careful analyses of localized Rademacher complexities, with typical results as follows. For a class of functions f with range bounded by 1, for any root r n of ψn, with probability at least 1 e t we have E[f] E b Pn[f] + 1 ηE b Pn[f] + C(1 + η) r n + 1 n for all f F and η 0. As an example, when F is a bounded VC-class, we have r n VC(F) log(n/VC(F)) n (Bartlett et al., 2005, Corollary 3.7). With this motivation, we have the following theorem. Theorem 6 For M 1, let F be a collection of functions f : X [0, M], let ψn be a sub-root function bounding the localized complexity (20), and let r n ψn(r n). Let t > 0 be arbitrary and assume that ρ satisfies t + log l log n Then with probability at least 1 e t, sup P:Dφ(P|| b Pn) ρ n for all f F. (22) Duchi and Namkoong Additionally, if bf minimizes sup P:Dφ(P|| b Pn) ρ/n EP [f], then with probability at least 1 3e t, 91ρ 45n Var(f) ! M(3ρ + t) We provide the proof of Theorem 6 in Appendix D. It builds offof and parallels many of the techniques developed by Bartlett, Bousquet, and Mendelson (2005), but we require a bit of care to develop the precise variance bounds we provide. Let us consider the additional q ρ n factors in Theorem 6 (as compared to Theorem 3). In general, these terms are negligible to the extent that the variance of f dominates the first moment of the function f heuristically, in situations in which we expect penalizing the variance to improve performance. Let us make this more precise in a regime where n is large. Letting f F, we see that we have the inequality ρ/n) E[f] + r ρ n Var(f) E[f] + C r ρ (for a constant C > 1+ p ρ/n) if and only if (C 1 p ρ/n)2Var(f) E[f]2. Equivalently, as n gets large, this occurs roughly when E[f2] C2 2C+2 C2 2C+1E[f]2, which holds for large enough C whenever Var(f) > 0. In some scenarios, we can obtain substantially tighter bounds by using localized Rademacher averages instead of the covering number arguments considered in Section 3.1. (Recall also the discussion following Theorem 2.) To illustriate this point, we consider the case where F is a bounded subset of a reproducing kernel Hilbert space generated by some sufficiently nice kernel K; even for the Gaussian kernel K(x, z) = exp( 1 2 x z 2), log covering numbers for such function spaces grow at least exponentially in the dimension (Zhou, 2003; K uhn, 2011). Example 6 (Reproducing kernels and least-absolute-deviation regression) We now give an example using a non-parametric class of functionals in which covering number arguments do not apply, as the covering numbers of the associated classes are too large. Let H be a reproducing kernel Hilbert space (RKHS) with norm H and associated kernel (representer of evaluation) K : X X R. Letting P be a distribution on X, Mercer s theorem (e.g. Cristianini and Shawe-Taylor, 2004) implies that the integral operator TK : L2(X, P) L2(X, P) defined by TK(f)(x) = R K(x, z)d P(z) is compact, and K(x, x ) = P j=1 λjφj(x)φj(z) where λj are the eigenvalues of T in decreasing order and φj form an orthonormal decomposition of L2(X, P). Consider now the least absolute deviation (LAD) loss function ℓ(h; x, y) = |h(x) y|, defined for h H, and let BH be the unit H-ball of H. Assume additionally that the model is well-specified, and that y = h (x)+ξ for some random variable ξ with E[ξ | X] = 0, E[ξ2] σ2, and h BH. Let the function class {ℓ H} r := (x, y) 7 cℓ(h(x), y) | c [0, 1], c2E[ℓ(h(X), Y )2] r . Based on inequality (20), we consider the localized complexity Rn({ℓ H} r) = E " 1 n sup h BH,c [0,1] X εicℓ(h(xi), yi) | E[ℓ(h(X), Y )2] r/c2 # Variance-based Regularization with Convex Objectives We claim that Rn({ℓ H} r) p j=1 min{λj, r} As this claim is not central to our development but does show a slightly different localization result based on Gaussian comparison inequalities than available, for example, in Mendelson (2003) we provide its proof in Appendix G.1. Let us use inequality (24). To apply Theorem 3, we must find a bound on the fixed point of the localized complexity. To give this bound, we require some knowledge on the eigenvalues λj, for which there exists a body of work. For example (Mendelson, 2003), the Gaussian kernel K(x, x ) = exp( 1 2 x x 2 2) generates a class of smooth functions for which the eigenvalues λj decay exponentially, as λj e j2. Kernel operators underlying Sobolev spaces with different smoothness orders (Birman and Solomjak, 1967; Gu, 2002) typically have eigenvalues scaling as λj j 2α for some α > 1 2. As a concrete example, the first-order Sobolev (min) kernel K(x, x ) = 1 + min{x, x } generates an RKHS of Lipschitz functions with α = 1. In the former case of λj e j2, r n = log n j=1 min e j2, log n log n e t2dt In the latter case of polynomially decaying eigenvalues λj j 2α, we have j 2α = r when r 1 2α = j, so X j=1 min{j 2α, r} r 2α 1 r 1/2α t 2αdt r 2α 1 Solving for nr = r 2α 1 2α , we find the fixed point (r n) 2α 1 4α = r n n yields r n = n 2α 2α+1 . Ignoring constants, the above analysis shows that in the case that the kernel eigenvalues scale as λj e j2, as soon as ρ log n we have E[ℓ(h(X), Y )] (1+2 p E b Pn[ℓ(h(X), Y )] + n Var b Pn(ℓ(h(X), Y )) n for all h BH with high probability. In the case of polynomial eigenvalues, if bh minimizes the robust empirical loss sup P:Dφ(P|| b Pn) ρ/n EP [ℓ(h(X), Y )] and ρ n1 2α 2α+1 , then E h ℓ(bh(X), Y ) i 1 + Cn α 2α+1 inf h BH E[ℓ(h(X), Y )] + Cn α 2α+1 p Var(ℓ(h(X), Y )) +Cn 2α 2α+1 . This rate of convergence holds without any assumptions on the smoothness of the distribution of the noise ξ. 3.3. Beating empirical risk minimization We now provide a concrete example where the robustly regularized estimator bθ rob n exhibits a substantial performance gap over empirical risk minimization. In the sequel, we bound the Duchi and Namkoong performance degradation to show that the formulation (6) in general loses little over empirical risk minimization. For intuition in this section, consider the (admittedly contrived) setting in which we replace the loss ℓ(θ, X) with ℓ(θ, X) ℓ(θ , X), where θ argminθ Θ R(θ). Then in this case, by taking θ = θ in Corollary 5, we have R(bθ rob n ) R(θ ) + O(1/n) with high probability. More broadly, we expect the robustly regularized approach to offer performance benefits in situations in which the empirical risk minimizer is highly sensitive to noise, say, because the losses are piecewise linear, and slight underor over-estimates of slope may significantly degrade solution quality. With this in mind, we construct a concrete 1-dimensional example estimating the median of a discrete distribution supported on X = { 1, 0, 1} in which the robustly regularized estimator has convergence rate log n/n, while empirical risk minimization is at best 1/ n. Define the loss ℓ(θ; x) = |θ x| |x|, and for δ (0, 1) let the distribution P be defined by P(X = 1) = 1 δ 2 , P(X = 1) = 1 δ 2 , P(X = 0) = δ. (25) Then for θ R, the risk of the loss is R(θ) = δ|θ| + 1 δ 2 |θ 1| + 1 δ 2 |θ + 1| (1 δ). By symmetry, it is clear that θ := argminθ R(θ) = 0, which satisfies R(θ ) = 0. (Note also that ℓ(θ, x) = ℓ(θ, x) ℓ(θ , x).) Without loss of generality, we assume that Θ = [ 1, 1] in this problem. Now, consider a sample X1, . . . , Xn drawn i.i.d. from the distribution P, let b Pn denote its empirical distribution, and define the empirical risk minimizer bθ erm n := argmin θ R E b Pn[ℓ(θ, X)] = argmin θ [ 1,1] E b Pn[|θ X|]. If too many of the observations satisfy Xi = 1 or too many satisfy Xi = 1, then bθ erm n will be either 1 or 1; for small δ, such events become reasonably probable, as the following lemma makes precise. In the lemma, Φ(x) = 1 2 t2dt denotes the standard Gaussian CDF. (See Section G.2 for a proof.) Lemma 7 Let the loss ℓ(θ; x) = |θ x| |x|, δ [0, 1], and X follow the distribution (25). Then R(bθ erm n ) R(θ ) δ with probability at least (1 δ2) n 2 r On the other hand, we certainly have ℓ(θ ; x) = 0 for all x X, so that Var(ℓ(θ ; X)) = 0. Now, consider the bound in Theorem 3. We see that log N({ℓ(θ, ) : θ Θ}, ϵ, L (X)) ϵ, and taking ϵ = 1 n, we have that if bθ rob n argminθ Θ Rn(θ, Pn), then R(bθ rob n ) R(θ ) + 15ρ n with probability 1 4 exp (2 log n ρ) . Variance-based Regularization with Convex Objectives In particular, taking ρ = 3 log n, we see that R(bθ rob n ) R(θ ) + 45 log n n with probability at least 1 4 The risk for the empirical risk minimizer, as Lemma 7 shows, may be substantially higher; taking δ = 1/ n we see that with probability at least 2Φ( q R(bθ erm n ) R(θ ) + n 1 (For n 20, the probability of this event is .088.) For this (specially constructed) example, there is a gap of nearly n 1 2 in order of convergence. 3.4. Invariance properties The robust regularization (4) technique enjoys a number of invariance properties. Standard regularization techniques (such as ℓ1and ℓ2-regularization), which generally regularize a parameter toward a particular point in the parameter space, do not. While we leave deeper discussion of these issues to future work, we make two observations, which apply when Θ = Rd is unconstrained. Throughout, we let bθ rob n argminθ Rn(θ, Pn) denote the robustly regularized empirical solution. First, consider a location estimation problem in which we wish to estimate the minimizer of the expectation of a loss of the form ℓ(θ, X) = h(θ X), where h : Rd R is convex and symmetric about zero. Then the robust solution is by inspection shift invariant, as ℓ(θ + c, X + c) = ℓ(θ, X) for any vector c Rd. Concretely, in the example of the previous section, ℓ1or ℓ2-regularization achieve better convergence guarantees than ERM does, but if we shift all data x 7 x+c, then non-invariant regularization techniques lose efficiency (while the robust regularization technique does not). Second, we may consider a generalized linear modeling problem, in which data comes in pairs (x, y) X Y and ℓ(θ, (x, y)) = h(y, θ x) for a function h : Y R R that is convex in its second argument. Then bθ rob n is invariant to invertible linear transformations, in the sense that for any invertible A Rd d, n sup P:Dφ(P|| b Pn) ρ n EP [ℓ(θ, (X, Y ))] o = argmin θ n sup P:Dφ(P|| b Pn) ρ n EP [ℓ(A 1θ, (AX, Y ))] o = bθ rob n . Our results in this section do not precisely apply as we require unbounded θ, however, the next section shows that localization approaches can address this. 4. Robust regularization cannot be too bad The previous two sections provide guarantees on the performance of the robust regularized estimator (6), it does not cannot dominate classical approaches based on empirical risk minimization (also known as sample average approximation in the stochastic optimization literature), though it can improve on them in some cases. For example, with a correctly specified linear regression model with gaussian noise, least-squares empirical risk minimization with the loss ℓ(θ, (x, y)) = 1 2(θ x y)2 is essentially optimal. Our goal in this section is thus to provide more understanding of potential poor behavior of the procedure (6) with Duchi and Namkoong respect to ERM, considering two scenarios. The first is in stochastic (convex) optimization problems, where we investigate the finite-sample convergence rates of the robust solution to the population optimal risk. We show that the robust solution bθ rob n enjoys fast rates of convergence in cases in which the risk has substantial curvature precisely as with empirical risk minimization. The second is to consider the asymptotics of the robust solution bθ rob n , where we show that in classical statistical scenarios the robust solution is nearly efficient, though there is an asymptotic bias of order 1/ n that scales with the confidence ρ. 4.1. Fast Rates In cases in which the risk R has curvature, empirical risk minimization often enjoys faster rates of convergence (Boucheron et al., 2005; Shapiro et al., 2009). The robust solution bθ rob n similarly attains faster rates of convergence in such cases, even with approximate minimizers of Rn(θ, Pn). For the risk R and ϵ 0, let Sϵ := θ Θ : R(θ) inf θ Θ R(θ ) + ϵ denote the ϵ-sub-optimal (solution) set, and similarly let b Sϵ := θ Θ : Rn(θ, Pn) inf θ Θ Rn(θ , Pn) + ϵ . For a vector θ Θ, let πS (θ) = argminθ S θ θ 2 denote the Euclidean projection of θ onto the set S ; this projection operator is very useful for showing faster rates of convergence in stochastic optimization (see Shapiro et al. (2009), whose techniques we closely follow). In the statement of the result, for A Θ, we let Rn(A) denote the Rademacher complexity of the localized process {x 7 ℓ(θ; x) ℓ(πS (θ); x) : θ A}. We then have the following result, whose proof we provide in Section E. Theorem 8 Let Θ be convex and let ℓ( ; x) be convex and L-Lipshitz in its first argument for all x X. For constants λ > 0, γ > 1, and r > 0, assume the risk R satisfies R(θ) inf θ Θ R(θ) λ dist(θ, S )γ for all θ such that dist(θ, S ) r. (26) Let t > 0. If 0 ϵ 1 2λrγ satisfies γ 2(γ 1) and ϵ 2 2E[Rn(S2ϵ )] + L 2ϵ then P(b Sϵ S2ϵ ) 1 e t, We provide a brief discussion of this result as well as a corollary that gives more explicit rates of convergence. First, we note that (by an inspection of the proof) the L-Lipschitz assumption need only hold in the neighborhood S2ϵ for the result to hold. We also have the following Variance-based Regularization with Convex Objectives Corollary 9 In addition to the conditions of Theorem 8, assume that S = {θ } is a single point and Θ Rd. Then for any ϵ 1 2λrγ, we have P(b Sϵ S2ϵ ) 1 e t for So long as ρ d log n d, this rate of convergence is as good as that enjoyed by standard empirical risk minimization approaches (Shapiro et al., 2009, Ch. 5) under these types of growth conditions. The case that γ = 2 corresponds (roughly) to strong convexity, and in this case we get the approximate rate of convergence of L2 d n , the familiar rate of convergence under these conditions. Of course, if there is too much variance penalization (i.e. ρ is too large), then the rates of convergence may be slower. Proof That S is a singleton implies that S2ϵ {θ | θ θ (2ϵ/λ) 1 γ }. Moreover, in this case we also have that E b Pn[ℓ(θ; X) ℓ(θ ; X)] E b Pn[ℓ(θ ; X) ℓ(θ ; X)] L θ θ , so that an ϵ/L-cover of {θ | θ θ (2ϵ/λ) 1 γ } is an ϵ-cover of the function class F = {f(x) = ℓ(θ; x) ℓ(θ ; x) | θ S2ϵ } in L2(Pn) norm. Thus, the standard Dudley entropy integral (Dudley, 1999; van der Vaart and Wellner, 1996) yields E[Rn(S2ϵ )] 1 n log N(F, δ, L2(Pn))dδ Z L(2ϵ/λ) 1 γ γ log λ 2Lγϵ where we have used that R ε 0 ε . Solving for ϵ in the localization inequality (27) then yields the corollary, showing that the specified choice of ϵ is sufficient for all the conditions (27) to hold. 4.2. Asymptotics It is important to understand the precise limiting behavior of the robust estimator in addition to its finite sample properties this allows us to more precisely characterize when there may be degradation relative to classical risk minimization strategies. With that in mind, in this section we provide asymptotic results for the robust solution (6) to better understand the consequences of penalizing the variance of the loss itself. In particular, we would like to understand efficiency losses relative to (say) maximum likelihood in situations in which maximum likelihood is efficient. Before stating the results, we make a few standard assumptions on the risk R(θ), the loss ℓ, and the moments of ℓand its derivatives. Concretely, we assume that θ := argmin θ R(θ) and 2R(θ ) 0, Duchi and Namkoong that is, the risk functional has strictly positive definite Hessian at θ , which is thus unique. Additionally, we have the following smoothness assumptions on the loss function, which are satisfied by common loss functions, including the negative log-likelihood for any exponential family or generalized linear model (Lehmann and Casella, 1998). In the assumption, we let B denote the ℓ2-ball of radius 1 in Rd. Assumption A For some ϵ > 0, there exists a function L : X R+ satisfying |ℓ(θ, x) ℓ(θ , x)| L(x) θ θ 2 for θ, θ θ + ϵB and E[L(X)2] L(P) < . Additionally, there is a function H such that the function θ 7 ℓ(θ, x) has H(x)-Lipschitz continuous Hessian (with respect to the Frobenius norm) on θ + ϵB, where E[H(X)2] < . Then, recalling the robust estimator (6) as the minimizer of Rn(θ, Pn), we have the following theorem, which we prove in Section F. Theorem 10 Let Assumption A hold, and let the sequence bθ rob n be defined by bθ rob n argminθ Rn(θ, Pn). Define b(θ ) := Cov( θℓ(θ , X), ℓ(θ , X)) p Var(ℓ(θ , X)) and Σ(θ ) = 2R(θ ) 1 Cov( ℓ(θ , X)) 2R(θ ) 1 . Then bθ rob n a.s. θ and n(bθ rob n θ ) d N p 2ρ b(θ ), Σ(θ ) The asymptotic variance Σ(θ ) in Theorem 10 is generally unimprovable, as made apparent by Le Cam s local asymptotic normality theory and the H ajek-Le Cam local minimax theorems (van der Vaart and Wellner, 1996). Thus, Theorem 10 shows that the robust regularized estimator (6) has some efficiency loss, but it is only in the bias term. We explore this a bit more in the context of the risk of bθ rob n . Letting W N(0, Σ(θ )), as an immediate corollary to this theorem, the delta-method implies that n h R(bθ rob n ) R(θ ) i d 1 2ρ b(θ ) + W 2 2R(θ ) , (28) where we recall that x 2 A = x Ax. This follows from a Taylor expansion, because R(θ ) = 0 and so R(θ) R(θ ) = 1 2(θ θ ) 2R(θ )(θ θ ) + o( θ θ 2), or n(R(bθ rob n ) R(θ )) = n 1 2(bθ rob n θ ) 2R(θ )(bθ rob n θ ) + o( bθ rob n θ 2) n(bθ rob n θ ) 2R(θ ) n(bθ rob n θ ) + o P (1) 2ρ b(θ ) + W) 2R(θ )( p 2ρ b(θ ) + W) by Theorem 10. Variance-based Regularization with Convex Objectives The limiting random variable in expression (28) has expectation 2ρb(θ ) + W 2 2R(θ )] = ρb(θ ) 2R(θ )b(θ ) + 1 2 tr( 2R(θ ) 1 Cov(ℓ(θ , X)), while the classical empirical risk minimization procedure (standard M-estimation) (Lehmann and Casella, 1998; van der Vaart and Wellner, 1996) has limiting mean-squared error 1 2 tr( 2R(θ ) 1 Cov(ℓ(θ , X))). Thus there is an additional ρ b(θ ) 2 2R(θ ) penalty in the asymptotic risk (at a rate of 1/n) for the robustly-regularized estimator. An inspection of the proof of Theorem 10 reveals that b(θ ) = θ p Var(ℓ(θ , X)); if the variance of the loss is stable near θ , so that moving to a parameter θ = θ + for some small has little effect on the variance, then the standard loss terms dominate, and robust regularization has asymptotically little effect. On the other hand, highly unstable loss functions for which θ p Var(ℓ(θ , X)) is large yield substantial bias. We conclude our study of the asymptotics with a (to us) somewhat surprising example. Consider the classical linear regression setting in which y = x θ + ε, where ε N(0, σ2). Using the standard squared error loss ℓ(θ, (x, y)) = 1 2(θ x y)2, we obtain that ℓ(θ , (x, y)) = (x θ y)x = (x θ x θ ε)x = εx, while ℓ(θ , (x, y)) = 1 2ε2. The covariance Cov(εX, ε2) = E[εX(ε2 σ2)] = 0 by symmetry of the error distribution, and so in the special classical case of correctly specified linear regression the bias term b(θ ) = 0 for linear regression in Theorem 10. That is, the robustly regularized estimator (6) is asymptotically efficient. 5. Experiments We present three experiments in this section. The first is a small simulation example, which serves as a proof of concept allowing careful comparison of standard empirical risk minimization (ERM) strategies to our variance-regularized approach. The latter two are classification problems on real datasets; for both of these we compare performance of robust solution (6) to its ERM counterpart. 5.1. Minimizing the robust objective As a first step, we give a brief description of our (essentially standard) method for solving the robust risk problem. Our work in this paper focuses mainly on the properties of the robust objective (4) and its minimizers (6), so we only briefly describe the algorithm we use; we leave developing faster and more accurate specialized methods to further work. To solve the robust problem, we use a gradient descent-based procedure, and we focus on the case in which the empirical sampled losses {ℓ(θ, Xi)}n i=1 have non-zero variance for all parameters θ Θ, which is the case for all of our experiments. Recall the definition of the subdifferential f(θ) = {g Rd : f(θ ) f(θ)+ g, θ θ for all θ }, which is simply the gradient for differentiable functions f. A standard result in convex analysis (Hiriart-Urruty and Lemar echal, 1993, Theorem VI.4.4.2) is that if the vector p Rn + . Code is available at https://github.com/hsnamkoong/robustopt. Duchi and Namkoong achieving the supremum in the definition (4) of the robust risk is unique, then θRn(θ, Pn) = θ sup P Pn EP [ℓ(θ; X)] = i=1 p i θℓ(θ; Xi), where the final summation is the standard Minkowski sum of sets. As this maximizing vector p is indeed unique whenever Var b Pn(ℓ(θ; X)) = 0, we see that for all our problems, so long as ℓis differentiable, so too is Rn(θ, Pn) and θRn(θ, Pn) = i=1 p i θℓ(θ; Xi) where p = argmax p Pn i=1 piℓ(θ; Xi) . (29) In order to perform gradient descent on the risk Rn(θ, Pn), then, by equation (29) we require only the computation of the worst-case distribution p . By taking the dual of the maximization (29), this is an efficiently solvable convex problem; for completeness, we provide a procedure for this computation in Section H that requires time O(n log n + log 1 ϵ log n) to compute an ϵ-accurate solution to the maximization (29). As all our examples have smooth objectives, we perform gradient descent on the robust risk Rn( , Pn), with stepsizes chosen by a backtracking (Armijo) line search (Boyd and Vandenberghe, 2004, Chapter 9.2). 5.2. Simulation experiment For our simulation experiment, we use a quadratic loss with linear perturbation. For v, x Rd, define the loss ℓ(θ; x) = 1 2 θ v 2 2 + x (θ v). We set d = 50 and take X Uni({ B, B}d), varying B in the experiment. For concreteness, we let the domain Θ = {θ Rd : θ 2 r} and set v = r 2 d1, so that v int Θ; we take r = 10. Notably, standard regularization strategies, such as ℓ1 or ℓ2-regularization, pull θ toward 0, while the variance of ℓ(θ; X) is minimized by θ = v (thus naturally advantaging the variance-based regularization we consider, as R(v) = infθ R(θ) = 0). Moreover, as X is pure noise, this is an example where we expect variance regularization to be particularly useful. We choose δ = .05 and set ρ as in Eq. (17) (using that ℓis (3r + d B)-Lipschitz) to obtain robust coverage with probability at least 1 δ. In our experiments, we obtained 100% coverage in the sense of (15), as the high probability bound is conservative. Figure 2 summarizes the results. The robust solution bθ rob n = argminθ Θ Rn(θ, Pn) always outperforms the empirical risk minimizer bθ erm n = argminθ Θ E b Pn[ℓ(θ, X)] in terms of the true risk E[ℓ(θ, X)] = 1 2 θ v 2 2. Each experiment consists of 1,200 independent replications for each sample size n and value B. In Tables 1 and 2, we display the risks of bθ erm n and bθ rob n and variances, respectively, computed for the 1,200 independent trials. The gap between the risk of bθ erm n and bθ rob n is siginificant at level p < .01 for all sample sizes and values of B we considered according to a one-sided T-test. Notice also in Table 2 that the variance of the robust solutions is substantially smaller than that of the empirical risk minimizer often several orders of magnitude smaller for large sample sizes n. This simulation shows that in a simple setting favorable to it our procedure outperforms standard alternatives. Variance-based Regularization with Convex Objectives 102 5 102 103 5 103 104 Sample Size B=.01 B=.1 B=1 B=10 Figure 2. Simulation experiment. log E[R(bθ erm n )] is the solid lines, in decreasing order from B = 10 (top) to B = .01 (bottom). log E[R(bθ rob n )] is the dashed line, in the same vertical ordering at sample size n = 102. Table 1: Simulation experiment: Mean risks over 1,200 simulations B = .01 B = .1 B = 1 B = 10 n R(bθ erm n ) R(bθ rob n ) R(bθ erm n ) R(bθ rob n ) R(bθ erm n ) R(bθ rob n ) R(bθ erm n ) R(bθ rob n ) 100 4.06E-06 7.42E-07 4.17E-04 7.65E-05 4.20E-02 7.64E-03 4.15E+00 7.12E-01 500 8.91E-07 5.01E-15 8.22E-05 1.63E-14 8.36E-03 2.19E-13 8.41E-01 8.21E-12 1000 4.47E-07 1.52E-15 4.02E-05 1.64E-17 4.20E-03 6.32E-18 4.19E-01 2.45E-17 5000 1.44E-07 2.68E-16 8.00E-06 2.74E-18 8.27E-04 5.09E-20 8.38E-02 6.55E-20 10000 7.64E-08 1.32E-16 4.02E-06 1.32E-18 4.13E-04 2.57E-20 4.18E-02 3.34E-20 5.3. Protease cleavage experiments For our second experiment, we compare our robust regularization procedure to other regularizers using the HIV-1 protease cleavage dataset from the UCI ML-repository (Lichman, 2013). In this binary classification task, one is given a string of amino acids (a protein) and a featurized representation of the string of dimension d = 50960, and the goal is to predict whether the HIV-1 virus will cleave the amino acid sequence in its central position. We have a sample of n = 6590 observations of this process, where the class labels are somewhat skewed: there are 1360 examples with label Y = +1 (HIV-1 cleaves) and 5230 examples with Y = 1 (does not cleave). We use the logistic loss ℓ(θ; (x, y)) = log(1+exp( yθ x)). We compare the performance of different constraint sets Θ by taking Θ = n θ Rd : a1 θ 1 + a2 θ 2 r o , Duchi and Namkoong Table 2: Simulation experiment: Variances of R(bθ) over 1,200 simulations B = .01 B = .1 B = 1 B = 10 n ERM Robust ERM Robust ERM Robust ERM Robust 100 7.06E-13 9.76E-14 6.58E-09 1.03E-09 7.09E-05 1.08E-05 7.37E-01 9.20E-02 500 5.98E-14 7.15E-28 3.04E-10 3.52E-26 2.80E-06 2.26E-24 2.92E-02 3.26E-21 1000 2.63E-14 1.07E-31 7.53E-11 1.99E-35 7.14E-07 3.44E-33 7.03E-03 4.78E-32 5000 7.34E-15 2.94E-33 2.70E-12 3.28E-37 2.95E-08 2.50E-39 2.74E-04 5.24E-38 10000 1.60E-15 6.54E-34 6.74E-13 7.59E-38 7.04E-09 3.34E-39 6.52E-05 2.25E-38 which is equivalent to elastic net regularization (Zou and Hastie, 2005), while varying a1, a2, and r. We experiment with ℓ1-constraints (a1 = 1, a2 = 0) with r {50, 100, 500, 1000, 5000}, ℓ2-constraints (a1 = 0, a2 = 1) with r {5, 10, 50, 100, 500}, elastic net (a1 = 1, a2 = 10) with r {100, 200, 1000, 2000, 10000}, our robust regularizer with ρ {100, 1000, 10000, 50000, 100000} and our robust regularizer coupled with the ℓ1-constraint (a1 = 1, a2 = 0) with r = 100. Though we use a convex surrogate (logistic loss), we measure performance of the classifiers using the 0-1 (misclassification) loss 1{sign(θT x)y 0}. For validation, we perform 50 experiments, where in each experiment we randomly select 9/10 of the data to train the model, evaluating its performance on the held out fraction (test). We plot results summarizing these experiments in Figure 3. The horizontal axis in each figure indexes our choice of regularization value (so Regularizer = 1 for the ℓ1constrained problem corresponds to r = 50). The figures show that the robustly regularized risk provides a different type of protection against overfitting than standard regularization or constraint techniques do: while other regularizers underperform in heavily constrained settings, the robustly regularized estimator bθ rob n achieves low classification error for all values of ρ (Figure 3(b)). Notably, even when coupled with a fairly stringent ℓ1-constraint (r = 100), robust regularization has perofrmance better than ℓ1 except for large values r, especially on the rare label Y = +1 (Figure 3 (d) and (f)). We investigate the effects of the robust regularizer with a slightly different perspective in Figure 4, where we use Θ = {θ : θ 1 r} with r = 100 for the constraint set for each experiment. The horizontal axis indicates the tolerance ρ we use in construction of the robust estimator bθ rob n , where ERM means ρ = 0. In Fig. 4(a), we plot the logistic risk R(bθ) = E[ℓ(bθ, (X, Y ))] for the train and test distribution. We also plot the upper confidence bound Rn(θ, Pn) in this plot, which certainly over-estimates the test risk we hope to tighten this overestimate in future work. In Figure 4(b), we plot the misclassification error on train and test for different values of ρ, along with 2-standard-error intervals for the 50 runs. Figures 4(c) and (d) show the error rates restricted to examples from the uncommon (c) and common (d) classes. In Table 3 we give explicit error rates and logistic risk values for the different procedures. Due to the small size of the test dataset (ntest = 659), the deviation across folds is somewhat large. Variance-based Regularization with Convex Objectives 1 2 3 4 5 Regularizer L1 L2 Elastic Net Robust L1 + Robust 1 2 3 4 5 Regularizer L1 L2 Elastic Net Robust L1 + Robust (a) Train error (b) Test error 1 2 3 4 5 Regularizer Error (y = + 1) L1 L2 Elastic Net Robust L1 + Robust 1 2 3 4 5 Regularizer Error (y = + 1) L1 L2 Elastic Net Robust L1 + Robust (c) Train error on rare class (Yi = +1) (d) Test error on rare class (Yi = +1) 1 2 3 4 5 Regularizer Error (y = 1) L1 L2 Elastic Net Robust L1 + Robust 1 2 3 4 5 Regularizer Error (y = 1) L1 L2 Elastic Net Robust L1 + Robust (e) Train error on common class (Yi = 1) (f) Test error on common class (Yi = 1) Figure 3. HIV-1 Protease Cleavage plots (2-standard error confidence bars). Comparison of misclassification error rates among different regularizers. Duchi and Namkoong ERM 102 103 104 risk train risk test robust train ERM 102 103 104 (a) Logistic risk and confidence bound (b) Misclassification error rate ERM 102 103 104 Error (y = + 1) ERM 102 103 104 Error (y = 1) (c) Error on rare class (Yi = +1) (d) Error on common class (Yi = 1) Figure 4. HIV-1 Protease Cleavage plots (2-standard error confidence bars). Plot (a) shows the logistic risk R(θ) = E[log(1+e Y θ X)] and confidence bounds computed from the robust risk (4). Plots (b) (d) show misclassification error rates plotted against robustness parameter ρ. In this experiment, we see (roughly) that the ERM solutions achieve good performance on the common class (Y = 1) but sacrifice performance on the uncommon class. As we increase ρ, performance of the robust solution bθ rob n on the rarer label Y = +1 improves (Fig. 4(c)), while the misclassification rate on the common class degrades a small (insignificant) amount (Fig. 4(d)); see also Table 3. This behavior is roughly what we might expect for the robust estimator: the poor performance of the ERM estimator bθ erm n on the rare class induces (relatively) more variance, which the robust solution reduces by via improved classification performance on the rare (Y = +1) class. This occurs at little expense over the more common label Y = 1 so that overall performance improves by a small amount. We remark but are unable to explain that this improvement on classification error for the rare labels comes despite increases in logistic risk; while the average logistic loss increases, misclassification errors decrease. Variance-based Regularization with Convex Objectives Table 3: HIV-1 Cleavage Error risk error (%) error (Y = +1) error (Y = 1) ρ train test train test train test train test erm 0.1587 0.1706 5.52 6.39 17.32 18.79 2.45 3.17 100 0.1623 0.1763 4.99 5.92 15.01 17.04 2.38 3.02 1000 0.1777 0.1944 4.5 5.92 13.35 16.33 2.19 3.2 10000 0.283 0.3031 2.39 5.67 7.18 14.65 1.15 3.32 5.4. Document classification in the Reuters corpus For our final experiment, we consider a multi-label classification problem with a reasonably large dataset. The Reuters RCV1 Corpus (Lewis et al., 2004) has 804,414 examples with d = 47,236 features, where feature j is an indicator variable for whether word j appears in a given document. The goal is to classify documents as a subset of the 4 categories Corporate, Economics, Goverment, and Markets, and each document in the data is labeled with a subset of those. As each document can belong to multiple categories, we fit binary classifiers on each of the four categories. There are different numbers of documents labeled as each category, with the Economics category having the fewest number of positive examples. Table 4 gives the number of times a document is labeled as each of the four categories (so each document has about 1.18 associated classes). In this experiment, we expect the robust solution to outperform ERM on the rarer category (Economics), as the robustification (6) naturally upweights rarer (harder) instances, which disproportionally affect variance as in the experiment on HIV-1 cleavage. Table 4: Reuters Number of Examples Corporate Economics Government Markets 381,327 119,920 239,267 204,820 For each category k {1, 2, 3, 4}, we use the logistic loss ℓ(θk; (x, y)) = log(1+exp( yθ k x)). For each binary classifier, we use the ℓ1 constraint set Θ = θ Rd : θ 1 1000 . To evaluate performance on this multi-label problem, we use precision (ratio of the number of correct positive labels to the number classified as positive) and recall (ratio of the number of correct positive labels to the number of actual positive labels): precision = 1 P4 k=1 1{θ k xi 0, yi = 1} P4 k=1 1{θ k xi > 0} , P4 k=1 1{θ k xi 0, yi = 1} P4 k=1 1 {yi = 1} . We partition the data into ten equally-sized sub-samples and perform ten validation experiments, where in each experiment we use one of the ten subsets for fitting the logistic models and the remaining nine partitions as a test set to evaluate performance. In Figure 5, we summarize the results of our experiment averaged over the 10 runs, with 2-standard error bars (computed across the folds). To facilitate comparison across the document categories, we give exact values of these averages in Tables 5 and 6. Both Duchi and Namkoong ERM 103 104 105 106 risk train risk test robust train (a) Logistic risk and confidence bound ERM 103 104 105 106 (b) Precision ERM 103 104 105 106 ERM 103 104 105 106 Recall (Economics) (d) Recall on rare category (Economics) Figure 5: Reuters Corpus (2-standard error deviations) bθ rob n and bθ erm n have reasonably high precision across all categories, with increasing ρ giving a mild improvement in precision (from .93 .005 to .94 .005); see also Figure 5(a). On the other hand, we observe in Figure 5(d) that ERM has low recall (.69 on test) for the Economics category, which contains about 15% of documents. As we increase ρ from 0 (ERM) to 105, we see a smooth and substantial improvement in recall for this rarer category (without significant degradation in precision). This improvement in recall amounts to reducing variance in predictions on the rare class. We also note that while the robust solutions outperform ERM in classification performance for ρ 105, for very large ρ = 106 10n, the regularizing effects of robustness degrade the solution bθ rob n . This precision and recall improvement comes in spite of the increase in the average binary logistic risk for each of the 4 classes, which we show in Figure 5a, which plots the average binary logistic loss (on train and test sets) averaged over the 4 categories as well as the upper confidence bound Rn(θ, Pn) as we vary ρ. The robust regularization effects reducing variance appear to improve the performance of the binary logistic loss as a surrogate for true misclassification error. Variance-based Regularization with Convex Objectives Table 5: Reuters Corpus Precision (%) Precision Corporate Economics Government Markets ρ train test train test train test train test train test erm 92.72 92.7 93.55 93.55 89.02 89 94.1 94.12 92.88 92.94 1E3 92.97 92.95 93.31 93.33 87.84 87.81 93.73 93.76 92.56 92.62 1E4 93.45 93.45 93.58 93.61 87.6 87.58 93.77 93.8 92.71 92.75 1E5 94.17 94.16 94.18 94.19 86.55 86.56 94.07 94.09 93.16 93.24 1E6 91.2 91.19 92 92.02 74.81 74.8 91.19 91.25 89.98 90.18 Table 6: Reuters Corpus Recall (%) Recall Corporate Economics Government Markets ρ train test train test train test train test train test erm 90.97 90.96 90.20 90.25 67.53 67.56 90.49 90.49 88.77 88.78 1E3 91.72 91.69 90.83 90.86 70.42 70.39 91.26 91.23 89.62 89.58 1E4 92.40 92.39 91.47 91.54 72.38 72.36 91.76 91.76 90.48 90.45 1E5 93.46 93.44 92.65 92.71 76.79 76.78 92.26 92.21 91.46 91.47 1E6 93.10 93.08 92.00 92.04 79.84 79.71 91.89 91.90 92.00 91.97 We have seen through multiple examples that robustification our convex surrogate for variance regularization is an effective tool in a number of applications. As we heuristically expect, variance-based regularization (robust regularization) yields predictors with better performance on hard instances, or subsets of the problem that induce higher variance, such as classes with relatively few training examples in classification problems. The robust regularization ρ gives a principled knob for tuning performance to trade between variance (uniform or across-the-board performance) and sometimes absolute performance. 6. Discussion In this paper, we have developed theoretical results for robust regularization (6) that apply to general stochastic optimization and learning problems problems. The examples we describe in Section 3 illustrate our expectation that the robust solution bθ rob n should have good performance in cases in which Var(ℓ(θ ; X)) is small (recall also Theorems 3 and 6). Identifying the separation between the performance empirical risk minimization and related estimators and that of the robustly-regularized estimators as well as variance-regularized estimates we consider more generally remains a challenge. We hope that this paper inspires work in this direction in machine learning and statistics, and more broadly, torward considering distributionally robust problems. Part of this is likely to come from making rigorous our empirical observations (Section 5) that robust regularization improves performance on hard instances without sacrificing performance on easier cases. Our understanding of so-called fast rates for stochastic optimization problems, while considering robustness, is also limited. For empirical risk minimization, fast rates of convergence hold under conditions in which the the gap R(θ) R(θ ) controls the variance of the excess loss ℓ(θ, X) ℓ(θ , X) (cf. Mammen and Tsybakov, 1999; Bartlett et al., 2005; Duchi and Namkoong Boucheron et al., 2005; Bartlett et al., 2006), which usually requires some type of uniform convexity assumption. These bounds typically follow from localization guarantees (Bartlett et al., 2005, Section 5) on the function class {x 7 ℓ(θ, x) ℓ(θ , x) | θ Θ} . While in Section 4.1, we show that the robust estimate bθ rob n enjoys faster rates of convergence under growth conditions analogous to uniform convexity of the risk, as Var(ℓ(θ; X) ℓ(θ ; X)) = Var(ℓ(θ; X)), it is not clear how to directly connect these guarantees to results of the form in Theorems 3 and 6. We leave investigation of these topics to future work. The last point of our discussion is to revisit Theorem 6, which provides a guarantee for robustly regularized estimators based on localized Rademacher complexities. An investigation of our proof shows that our derivation proceeds by considering the complexity of self-normalized classes of functions of the form Gr = r r E[f2] rf | f F . In contrast, the analogous result of Bartlett et al. (2005, Thereom 3.3) for empirical risk minimization considers the complexity of classes of functions of the form Gr = r E[f2] rf | f F . The latter class normalizes functions f by p E[f2] a type of self-normalization that arises in the computation of pivotal (asymptotically independent of the underlying distribution) statistics. While this choice prima facie is just a step in our proof, the robust objective Rn(θ, Pn) defined in Eq. (4) is an empirical likelihood upper confidence bound on the optimal population risk (see also Duchi et al., 2016). One of the important characteristics of empirical likelihood confidence bounds is that they are self-normalizing and yield pivotal statistics (Owen, 2001). Investigating such self-normalization in complexity guarantees seems likely to yield fruitful insights. Acknowledgments We thank Feng Ruan for pointing out a much simpler proof of Theorem 1 than in our original paper. JCD and HN were partially supported by the SAIL-Toyota Center for AI Research and HN was partially supported by the Samsung Fellowship. JCD was also partially supported by the National Science Foundation award NSF-CAREER-1553086 and the Sloan Fellowship. Variance-based Regularization with Convex Objectives Appendix A. Proof of Theorem 1 The theorem is immediate if sn = 0 or σ2 = 0, as in this case sup P:Dφ(P|| b Pn) ρ/n EP [Z] = E b Pn[Z] = E[Z]. In what follows, we will thus assume that σ2, s2 n > 0. We recall the maximization problem (8), which is i=1 pizi subject to p Pn = p Rn + : 1 2 np 1 2 2 ρ, 1, p = 1 , and the solution criterion (9), which guarantees that the maximizing value of problem (8) is z + p 2ρs2n/n whenever p Letting z = Z, then under the conditions of the theorem, we have |zi z| M, and to satisfy inequality (9) it is certainly sufficient that ns2n 1, or n 2ρM2 s2n , or s2 n 2ρM2 Conversely, suppose that s2 n < 2ρM2 n . Then we have 2ρs2 n n < 4ρ2M2 n2 , which in turn implies that sup p Pn p, z 1 Combining this inequality with the condition (30) for the exact expansion to hold yields the two-sided variance bounds (10). We now turn to showing the high-probability exact expansion (11), which occurs whenever the sample variance is large enough by expression (30). To that end, we show that s2 n is bounded from below with high probability. Define the event En := s2 n 3 and let n 4M2 σ2 max {2σ, 11}. Then, on event En we have n 44ρM2 s2n , so that the sufficient condition (30) holds and expression (11) follows. We now argue that the event En has high probability via the following lemma due to Maurer and Pontil. Lemma 11 (Maurer and Pontil (2009, Theorem 10)) Let Zi be i.i.d. random variables taking values in [0, M], and let s2 n = 1 n Pn i=1 Z2 i 1 n Pn i=1 Zi 2. Then, for n 2 P (sn σ t) P (sn σ + t) exp nt2 Setting t = 1 3 8 σ, the final result follows from noting that P(En) 1 exp nt2 from the lemma. Duchi and Namkoong Appendix B. Proof of Theorem 2 Our starting point is to recall from inequality (30) in the proof of Theorem 1 that for each f F, the empirical variance equality (12) holds if n 4ρM2 Var b Pn(f). As a consequence, Theorem 2 will follow if we can provide a uniform lower bound on the sample variances Var b Pn(f) that holds with high enough probability. We use C to denote a universal constant whose value may change from line to line. Noting that Var b Pn(f) = E b Pn(f E[f])2 (E b Pn(f E[f]))2, we proceed in two parts. First, we give a lower bound for E b Pn(f E[f])2. Lemma 12 Let F be a collection of bounded functions f : X [M0, M1] with M := M1 M0. Then, with probability at least 1 e t, for every f F Var(f) 2E b Pn(f E[f])2 + C Rsup n (F)2 log3(n M) + M2 n (t + log log n) . Proof We follow the arguments of Srebro et al. (2010) and Bousquet (2002b, Thm. 6.1). For x1, . . . , xn X, let Fn,r := n f E[f] | f F, E b Pn[(f E[f])2] r o , where b Pn is the empirical measure on x1, . . . , xn. Let ψsup n be a sub-root upper bound on the worst-case Rademacher complexity ψsup n (r) Rsup n (Fn,r), where implicitly in the right hand side we take the supremum over x1, . . . , xn definining Fn,r as well. Lemma 13 (Srebro et al. (2010, Lemma 2.2)) Let H be a class of bounded functions X [ M, M], and let κ : [ M, M] R+ be a bounded function with L-Lipschitz derivatives. Then, Rsup n n κ h : h H, E b Pn[κ h] r o C r Rsup n (H) log 3 2 (Mn). Since κ(t) = t2 has Lipschitz derivatives, above lemma with H = {f E[f] : f F} yields Rsup n (F2 n,r) C r Rsup n (F) log 3 2 (n M) (31) where we recall the notation that G2 = {g2 | g G} for any function class G. Thus we may take ψsup n (r) = C r Rsup n (F) log 3 2 n, which has fixed point rsup n = C2Rsup n (F)2 log3 n. The following classical result then shows that the fixed point rsup n controls generalization of class of functions F2 n,r. Since f2 0, Theorem 6.1 of Bousquet (2002b) yields that for all f F, E(f E[f])2 2E b Pn(f E[f])2 + C Rsup n (F)2 log3 n M + M2 n (t + log log n) with probability at least 1 e t. Next, we give an upper bound for (E b Pn(f E[f]))2. We use the following version of Talagrand s inequality due to Bousquet (2002a, 2003). (See also Bartlett et al. (2005, Thm 2.1).) Variance-based Regularization with Convex Objectives Lemma 14 Let r > 0 and F be a class of functions that map X into [a, b] such that for every f F, Var(f(X)) r. Then, with probability at least 1 e t sup f F {E[f] E b Pn[f]} inf α>0 2(1 + α)E[Rn(F)] + The same statement holds with supf F(E b Pn[f] E[f]) replacing the left-hand side of the inequalities. Applying Lemma 14 and letting α = 1 2, with probability at least 1 2e t |E b Pn[f] E[f]| 3E[Rn(F)] + 2M holds for all f F. Combining the above display with Lemma 12, we obtain the desired result. Appendix C. Proof of Theorem 3 Before proving the theorem proper, we state a technical lemma that provides uniform Bernstein-like bounds for the class F using empirical ℓ -covering numbers. Lemma 15 (Maurer and Pontil (2009, Theorem 6)) Let n 8M2 t and t log 12. Then with probability at least 1 6N (F, ϵ, 2n)e t, we have E[f] E b Pn[f] + 3 2Var b Pn(f)t for all f F. We return to the proof of Theorem 3. Let E1 denote that the event that the inequalities (32) hold. Then on E1 hold, uniformly over f F we have E[f] E b Pn[f] + 18Var b Pn(f(X))t (i) sup P:Dφ(P|| b Pn) ρ n EP [f(X)] + 2ρVar b Pn(f(X)) 2ρVar b Pn(f(X)) sup P:Dφ(P|| b Pn) ρ n EP [f(X)] + 11 ϵ for all f F, (33) where inequality (i) follows from the bounds (10) in Theorem 1 and the fact that ρ 9t by assumption. This gives the first result (15). Duchi and Namkoong For the second result (16), we recall that bf argminf F sup P {EP [f(X)] : Dφ(P|| b Pn) ρ n}, and we bound the supremum term in expression (33). First, we note that because bf minimizes the supremum term in expression (33), we have E[ bf] sup P:Dφ(P|| b Pn) ρ n EP [f(X)] + 11Mρ ϵ for all f F. Now fix f F. As the function f is fixed, by Bernstein s inequality, we have E b Pn[f] E[f] + with probability at least 1 e t. Similarly, we have by Lemma 11 that Var b Pn(f) p with probability at least 1 e t. That is, for any fixed f F, we have with probability at least 1 2e t that sup P:Dφ(P|| b Pn) ρ n EP [f(X)] (i) E b Pn[f] + 2ρVar b Pn(f) (ii) E[f] + 2 where inequality (i) follows from the uniform upper bound (10) of Theorem 1 and inequality (ii) from our assumption that ρ t. Substituting this expression into our earlier bound (33) yields that for any f F, with probability at least 1 2(3N (F, ϵ, 2n) + 1)e t, E[ bf(X)] E[f(X)] + 2 2ρVar(f(X)) This gives the theorem. Appendix D. Proof of Theorem 6 We first show the following version of uniform Bernstein s inequality with Rademacher complexities. The proof uses a peeling technique (Bartlett et al., 2005; van de Geer, 2000), in conjuction with Talagrand s concentration inequality (Lemma 14). Variance-based Regularization with Convex Objectives Lemma 16 Let r > 0 and F be a collection of bounded functions f : X [0, M] with Var(f(X)) r. Then, with probability at least 1 e t, for every f F E[f] E b Pn[f] + t + log l log nr m + 6E[Rn(F)] + 7M t + log l log nr The same statements hold with the roles of E[f] and E b Pn[f] reversed. We defer the proof to section D at the end of this section. Because Var(f) M2 for all f F, Lemma 16 also holds if we replace the terms log nr M2t with log n t 1 + log n t . Next, we show an important extension of Lemma 16 that replaces the Rademacher complexity term E[Rn(F)] by a local quantity r n, the fixed point of ψn(r). To this end, we use another peeling argument and apply Lemma 16 to the self-normalized class Gr := r r E[f2] rf : f F cf : f F, E[c2f2] r, c [0, 1] . This idea follows the techniques of Bartlett et al. (2005, Thm. 3.3), though we use a type of self-normalizing scale, that is, f/ p E[f2], whereas they use a variance-normalizing scaling by studying classes of functions of the form f/E[f2]. Our use of this alternative normalization is important in the next lemma, which allows us to obtain bounds that apply to the robustly regularized risk. Lemma 17 Let F be a collection of bounded functions f : X [0, M] satisfying the localization inequality (20) for some sub-root function ψn( ) with root r n. Let Bn = 1 n t + log log n t . Then with probability at least 1 e t, for every f F E[f] E b Pn[f] + p 2e Bn + 6 p r n + 7MBn/3 p E[f2] + 6r n + 14MBn. The same statement holds with the roles of E[f] and E b Pn[f] reversed. See Section D.1 for the proof. Next, we give an analogous result for f2. Lemma 18 Let F be a collection of bounded functions f : X [0, M] satisfying the localization inequality (20) for some sub-root function ψn( ) with root r n. Let η > 0. Then, with probability at least 1 e t, for every f F E[f2] E b Pn[f2] + 1 ηE b Pn[f2] + 72M2(1 + η)r n + Mt Also, with probability at least 1 e t, for every f F E b Pn[f2] E[f2] + η 1 + ηE[f2] + 72M2(1 + η)r n + Mt See Section D.2 for the proof. Now, we make two additional pieces of shorthand notation. Let Vn = 4((2e + 84M)Bn + 36r n). Duchi and Namkoong Then, Lemma 17 implies that E[f] E b Pn[f] + p Vn E[f2] + 6r n + 14MBn with probability at least 1 e t. Applying Lemma 18 to this bound with the choice η = 1 immediately yields that E[f] E b Pn[f] + q 2Vn E b Pn[f2] + 144M2Vnr n + 7Vn M max{M, 1}t/n + 6r n + 14MBn E b Pn[f] + q 2Vn E b Pn[f2] + 12M r n + 7 max{M, 1} + 6r n + 14MBn for all f F with probability at least 1 2e t. Subtracting and adding (E b Pn[f])2 to the second term, we have q 2Vn E b Pn[f2] = q 2Vn Var b Pn(f) + 2Vn E b Pn[f]2 q 2Vn Var b Pn(f) + p 2Vn E b Pn[f], where we have used that f 0. We thus obtain 2Vn E b Pn[f] + q 2Vn Var b Pn(f) + 12M r n + 7 max{M, 1} + 6r n + 14MBn 2Vn E b Pn[f] + q 2Vn Var b Pn(f) + 6MVn + 6M r n + 7 max{M, 1}t + 6r n + 14MBn, where the second inequality follows because 2b for a, b 0. Recalling the bound (21), which implies ρ n Vn, ρ n(r n + 7 max{M,1}t Mn ), and ρ/n 6r n + 14MBn, we obtain that E b Pn[f] + n Var b Pn(f) + 13Mρ Theorem 1 implies E b Pn[f] + q n Var b Pn(f) sup P:Dφ(P|| b Pn) ρ n EP [f(X)] + 2Mρ n , so we immediately we arrive at sup P:Dφ(P|| b Pn) ρ n EP [f(X)] + for all f F with probability at least 1 2e t. This is the first result (22). To show the second result, we simply apply Bernstein s inequality and the concentration inequalities for the standard deviation in Lemma 11. For any fixed f F, by Bernstein s inequality, we have E b Pn[f] E[f] + with probability at least 1 e t. From Lemma 11, we have Var b Pn(f) p Variance-based Regularization with Convex Objectives with probability at least 1 e t. We thus obtain that for any fixed f, sup P:Dφ(P|| b Pn) ρ n EP [f] E b Pn[f]+ n Var b Pn(f) E[f]+ n Var(f)+2M ρt with probability at least 1 2e t. Noting that ρ 45Mt by assumption (21), so ρ+ 46ρ/45 + 45t p 91ρ/45 and that always 2 ρt 3ρ + 1 3t, we have that with probability at least 1 2e t that sup P:Dφ(P|| b Pn) ρ n EP [f] E[f] + 91ρ 45n Var(f) + 3Mρ Noting that we could take f to minimize the right hand side of the preceding expression and that bf minimizes sup P:Dφ(P|| b Pn) ρ/n EP [f], we have the result (23). We first show the claim for g Fcentered = {f E[f] : f F}. To see the claim for g Fcentered, let us fix L N to be chosen later, and for l = 1, . . . , L 1 define the classes Fl := n g Fcentered : e lr < E[g2] e (l 1)r o , FL := g Fcentered : E[g2] e Lr so that Fcentered = L l=1Fl. Let z > 0 be such that t z. Applying Lemma 14 (with the choice α = 1 2) to Fl for each l = 1, . . . , L 1, we have with probability at least 1 e t, for every g Fl E[g] E b Pn[g] + n + 3E[Rn(Fl)] + 5M t E b Pn[g] + n E[g2] + 3E[Rn(Fl)] + 5M t where in the last line we have used e lr E[g2] for g Fl. Similarly, applying Lemma 14 to FL, then with probability at least 1 e t, for every g FL E[g] E b Pn[g] + n + 3E[Rn(FL)] + 5M t E b Pn[g] + n + 3E[Rn(FL)] + 5M t Taking a union bound, we have with probability at least 1 Le t, for every g Fcentered E[g] E b Pn[g] + n E[g2] + 3E[Rn(Fcentered)] + 5M t Noting that E[Rn(Fcentered)] 2E[Rn(F)] by Jensen s inequality, we take L = log rn M2t and map t to t + log L to obtain the lemma. The case when the roles of E[f] and E b Pn[f] are reversed follows similarly. Duchi and Namkoong D.1. Proof of Lemma 17 Let r r n be an arbitrary but fixed value to be choosen later. Using this r, define the self-normalized class of functions Gr := r r E[f2] rf : f F cf : f F, E[c2f2] r, c [0, 1] . From the truncation by r, we have E[g2] r for all g Gr. Lemma 16 implies that with probability at least 1 e t, uniformly over g Gr E[g] E b Pn[g] + n E[g2] t + log l log n m + 6E[Rn(Gr)] + 7M t + log l log n Using the sub-root property of ψn and that ψn(r n) = r n, we have the inequality ψn(r) = rψn(r)/ r rψn(r n)/ p for any r r n, so E[Rn Gr] E[Rn cf : f F, E[c2f2] r, c [0, 1] ] ψn(r) p Using this upper bound in Eq. (34) and recalling the notation Bn = 1 n t + log log n E[g] E b Pn[g] + p 2e Bn E[g2] + 6 p r nr + 7MBn. (35) Now, we return to choose the value r to optimize the bound (35). let r be the largest solution to 6 r nr + 7MBn = 6r. The following elementary lemma provides a bound on r. Lemma 19 Let x be the largest solution to ax + b = x2 d where a, b, d > 0. Then a2d2 x2 a2d2 + 2bd. Proof From the quadratic formula, we have x = 1 a2d2 + 4b from which the lower bound follows. From convexity of z 7 z2 and z1 + z2 z1 + z2 for z1, z2 > 0, we obtain the upper bound. Lemma 19 immediately yields r n r r n + 7MBn For each g Gr, there exists f F such that g = q r E[f2] rf. If E[f2] r, we have g = f and the bound (35) yields E[f] E b Pn[f] + p 2e Bn E[f2] + 6r n + 14MBn. If E[f2] > r, rescaling g in the bound (35) and using the choice 6r = 6 r nr + 7MBn yields E[f] E b Pn[f] + p 2e Bn E[f2] + 6 p E b Pn[f] + p 2e Bn E[f2] + 6 p (r n + 7MBn/3)E[f2] Variance-based Regularization with Convex Objectives instead. Combining the cases E[f2] r, we conclude that for all f F, E[f] E b Pn[f] + p 2e Bn + 6 p r n + 7MBn/3 p E[f2] + 6r n + 14MBn with probability at least 1 e t. Similarly, we can reverse the roles of E[f] and E b Pn[f] to get the second result. D.2. Proof of Lemma 18 We frequently use the Rademacher contraction principle (Ledoux and Talagrand, 1991, Thm. 4.12) in what follows. Lemma 20 Let φ : R R be L-Lipschitz. Then, for every class G Eϵ[Rn(φ G)] LEϵ[Rn(G)] where φ G = {φ f : f G}. As in Section D.1, define the self-normalized functions in F Gr := r r E[f2] rf : f F cf : f F, E[c2f2] r, c [0, 1] where r r n will be choosen later. Let G2 r = {g2 : g Gr}. From the truncation by r, we have that for all g2 G2 r, Var(g2) E[g4] M2E[g2] M2r. Let c1 = 3 and c2 = 7 3. Then by Lemma 14 applied to G2 r, with probability at least 1 e t, for every g Gr E[g2] E b Pn[g2] + c1E[Rn(G2 r)] + M (a) E b Pn[g2] + 2c1ME[Rn(Gr)] + M (b) E b Pn[g2] + 2c1M p where in step (a) we used the contraction principle (Lemma 20) and that x 7 x2 is 2MLipschitz on [ M, M], and in step (b), we used that ψn(r) rr n as in the proof of Lemma 17 in Section D.1. Let A = 2c1M r n + M q n and D = c2M2t n . For any fixed K > 1, choose r to be the largest solution to A r + D = r K so that the bound (36) becomes E[g2] E b Pn[g2] + r From Lemma 19, we have K2A2 r K2A2 + 2KD and in particular, r K2A2 r n. For each g Gr, there exists f F such that g = q r E[f2] rf. If E[f2] r, rescaling the inequality (36) and using the upper bound on r, we obtain E[f2] E b Pn[f2] + r K E b Pn[f2] + KA2 + 2D. Duchi and Namkoong If E[f2] > r, rescaling instead yields E[f2] E b Pn[f2] + E[f2] Combining the two cases, we obtain E[f2] K K 1E b Pn[f2] + KA2 + 2D. Noting that A 2 4c2 1M2r n + 2M2t n by convexity, we have the first result once we replace K with η = K 1 > 0. The second result similarly follows by reversing the roles of E[f] and E b Pn[f] in the above argument. Appendix E. Proof of Theorem 8 Recall our shorthand notation that π(θ) = argminθ S { θ θ 2} denotes the Euclidean projection of θ onto S , which is a closed convex set. Define also the localized empirical deviation function n(θ) := E [ℓ(θ; X) ℓ(π(θ); X)] E b Pn [ℓ(θ; X) ℓ(π(θ); X)] . (37) We begin with the following Claim 21 If b Sϵ S2ϵ , then n Var b Pn(ℓ(θ; X) ℓ(π(θ); X)) Deferring the proof of the claim, let us prove the theorem. First, the growth condition (26) shows that θ Θ : θ π(θ) 2 2ϵ θ Θ : dist(θ, S ) 2ϵ Therefore, we have for all θ S2ϵ that Var b Pn(ℓ(θ; X) ℓ(π(θ); X)) L2 dist(θ, S )2 L2 2ϵ and so by the assumption (27) that ϵ (8L2ρ γ 2(γ 1) ( 2 λ) 1 γ 1 , we have n Var b Pn(ℓ(θ; X) ℓ(π(θ); X)) L In particular, if the event (38) holds then sup θ S2ϵ n(θ) ϵ Variance-based Regularization with Convex Objectives and recalling the definition (37) of n, it then follows that P b Sϵ S2ϵ P sup θ S2ϵ n(θ) ϵ To bound the probability (39), we use standard bounded difference and symmetrization arguments (e.g. Boucheron et al., 2013, Theorem 6.5). Letting f(X1, . . . , Xn) := supθ S2ϵ n(θ), the function f satisfies bounded differences: sup x,x X |f(X1, , Xj 1, x, Xj+1, , Xn) f(X1, , Xj 1, x , Xj+1, , Xn)| sup x,x X sup θ S2ϵ 1 n(ℓ(θ; x) ℓ(π(θ); x)) 1 n(ℓ(θ; x ) ℓ(π(θ); x )) n sup θ S2ϵ dist(θ, S ) 2L for j = 1, . . . , n. Using the standard symmetrization inequality E[supθ S2ϵ n(θ)] 2E[Rn(S2ϵ )] and the bounded differences inequality (Boucheron et al., 2013, Theorem 6.5), we have sup θ S2ϵ n(θ) 2E[Rn(S2ϵ )] + t for all t 0. Letting u = nt2 γ above and recalling the assumption (27) upper bounding E[Rn(S2ϵ )], we have P(supθ S2ϵ n(θ) ϵ 2) e u. The theorem follows from the bound (39). Proof of Claim 21 If b Sϵ S2ϵ , then certainly it is the case that there is some θ Θ\S2ϵ such that Rn(θ, Pn) inf θ Θ Rn(θ, Pn) + ϵ Rn(π(θ), Pn) + ϵ. Using the convexity of Rn, we have for all t [0, 1] that Rn(tθ + (1 t)π(θ), Pn) t Rn(θ, Pn) + (1 t)Rn(π(θ), Pn) Rn(π(θ), Pn) + tϵ. For all t [0, 1], we have by definition of orthogonal projection (because the vector θ π(θ) belongs to the normal cone to S at π(θ); cf. (Hiriart-Urruty and Lemar echal, 1993, Prop. III.5.3.3)) that π(tθ+(1 t)π(θ)) = π(θ). Thus, choosing t appropriately, there exists θ bd S2ϵ with θ = tθ + (1 t)π(θ), π(θ ) = π(θ), and Rn(θ , Pn) Rn(π(θ ), Pn) + ϵ. Adding and subtracting the risk R(θ) and R(π(θ)), we have that for some θ bd S2ϵ that Rn(θ, Pn) R(θ) + R(π(θ)) Rn(π(θ), Pn) R(π(θ)) R(θ) + ϵ ϵ, where we have used that R(θ) = R(π(θ)) + 2ϵ by construction. Multiplying by 1 on each side of the preceding display and taking suprema, we find that ϵ sup θ S2ϵ {R(θ) Rn(θ, Pn) (R(π(θ)) Rn(π(θ), Pn))} sup θ S2ϵ sup P:Dφ(P|| b Pn) ρ/n {R(θ) R(π) + EP [ℓ(π(θ); X) ℓ(θ; X)]} . Applying the upper bound in inequality (10) of Theorem 1 gives the claim. Duchi and Namkoong Appendix F. Proof of Theorem 10 We begin by establishing a few technical lemmas, after which the proof of the theorem follows essentially standard arguments in asymptotics. To prove Theorem 10, we first show that (eventually) we have the exact expansion Rn(θ, Pn) = E b Pn[ℓ(θ, X)] + 2ρVar b Pn(ℓ(θ, X)) for all θ in a neighborhood of θ . As in the proof of Theorem 1, this exact equality holds once there is suitable variability in the values ℓ(θ, Xi) over i = 1, . . . , n, however, we require a bit more care as the values ℓ(θ, Xi) may be unbounded below and above. Heuristically, however, assuming that we have this exact expansion and that bθ rob n θ = OP (n 1 2 ), then we can write the expansions 0 = θRn(bθ rob n , Pn) i=1 ℓ(θ , Xi) + 2 1 i=1 ℓ(θ , Xi) (bθ rob n θ ) + 2ρVar b Pn(ℓ(bθ rob n , X)) n + o P (n 1 i=1 ℓ(θ , Xi) + 2R(θ )(bθ rob n θ ) + 2ρVar(ℓ(θ , X)) n + o P (n 1 Multiplying by n and solving for bθ rob n in the preceding expression, computing p Var(ℓ(θ , X)) then yields the theorem. The remainder of the proof makes this heuristic rigorous, and the outline is as follows: 1. We show that there is a uniform expansion of the form (12) in a neighborhood of θ . (See Section F.1.) 2. Using the uniform expansion, we can then leverage standard techniques for asymptotic analysis of finite-dimensional estimators (see, e.g. van der Vaart and Wellner (1996) or Lehmann and Casella (1998)), which proceed by performing a Taylor expansion of the objective in a neighborhood of the optimum and using local asymptotic normality arguments. (See Section F.2.) F.1. The uniform variance expansion To lighten notation, we define a few quantities similar to those used in the proof of Theorem 1. Let Z(θ) := ℓ(θ, X) E[ℓ(θ, X)] be the deviation of ℓ(θ, X) around its mean (the risk), and similarly let Zi(θ) be the version of this quantity for observation Xi. In addition, let s2 n(θ) = Var b Pn(Z(θ)) be the empirical variance of Z(θ), which is identical to the empirical variance of ℓ(θ, X). Now, recall the problem maximize P EP [Z(θ)] subject to Dφ(P|| b Pn) ρ Variance-based Regularization with Convex Objectives and for each θ Θ, let p(θ) = argmaxp Pn Pn i=1 pi Zi(θ) be the solution (probability) vectors. Following expression (9) we see for any ϵ 0 that 2ρ(Zi(θ) Z(θ)) nsn(θ) 1 for all θ θ + ϵB is sufficient for the exact variance expansion to hold. We now show that this is indeed likely. Let ϵ > 0 be small enough that Assumption A holds, that is, the random Lipschitz function L(X) satisfies |ℓ(θ, x) ℓ(θ , x)| L(x) θ θ for θ, θ θ + ϵB. Then because nsn(θ) nsn(θ ) sup u: u 2 1 i=1 ui ℓ(θ, Xi) ℓ(θ , Xi) sup u: u 2 1 i=1 ui L(Xi) θ θ i=1 L2(Xi) θ θ so θ 7 sn(θ) is q 1 n Pn i=1 L(Xi)2-Lipschitz for θ θ + ϵB, we have inf θ θ +ϵB min i [n] 2ρ(Zi(θ) Z(θ)) nsn(θ) 2ρ(Zi(θ ) Z(θ ) 2ϵL(Xi)) r n sn(θ ) ϵ q 1 n Pn j=1 L(Xj)2 . Summarizing our development thus far, we have the following lemma. Lemma 22 Let the conditions of the previous paragraph hold. Then 2ρ(Zi(θ ) Z(θ ) 2ϵL(Xi)) o n v u u tsn(θ ) ϵ 1 i=1 L(Xi)2 1 implies that Rn(θ, Pn) = E b Pn[ℓ(θ, X)] + n Var b Pn(ℓ(θ, X)) for all θ θ + ϵB. Now, we use the following standard result to show that the conditions of Lemma 22 eventually hold with probability one. Lemma 23 (Owen (Owen, 1990), Lemma 3) Let Yi be independent random variables with supi E[Y 2 i ] < . Then n 1 2 max1 i n |Yi| a.s. 0. Based on Lemma 23 and the strong law of large numbers, we see immediately that 1 n max 1 i n |Zi(θ )| a.s. 0, and 1 n max 1 i n L(Xi) a.s. 0, because E[Z(θ )2] < and E[L(Xi)2] < . Applying the strong law of large numbers to obtain sn(θ ) a.s. p Var(ℓ(θ , X)) and i=1 L(Xi)2 a.s. p we see immediately that for small enough ϵ > 0, the condition of Lemma 22 holds eventually with probability 1. That is, the following uniform expansion holds. Duchi and Namkoong Lemma 24 There exists ϵ > 0 such that, with probability 1, there exists an N (which may be random) such that n N implies Rn(θ, Pn) = E b Pn[ℓ(θ, X)] + 2ρVar b Pn(ℓ(θ, X)) n for all θ θ + ϵB. F.2. Asymptotics and Taylor expansions Let En,exact be the event that the exact variance expansion of Lemma 24 occurs for θ θ + ϵB. Now that we know that P(En,exact eventually) = 1, we may perform a few asymptotic expansions of the variance-regularized objective to provide the convergence guarantees specified by the theorem. We use the following lemma. Lemma 25 Let the conditions of the theorem hold. If bθ rob n argmin θ Rn(θ, Pn) then bθ rob n a.s. θ . (40) The proof is standard, but for completeness we include it in Section G.3. By combining Lemmas 24 and 25, we see that with probability 1, for any ϵ > 0, we eventually have both bθ rob n θ 2 < ϵ and Rn(bθ rob n , Pn) = E b Pn[ℓ(bθ rob n , X)] + n Var b Pn(ℓ(bθ rob n , X)). Assume for the remainder of the argument that both of these conditions hold. Standard results on subdifferentiability of maxima of collections of convex functions (Hiriart-Urruty and Lemar echal, 1993, Chapter X) give that Rn(θ, Pn) is differentiable near θ , and thus 0 = Rn(bθ rob n , Pn) = E b Pn[ ℓ(bθ rob n , X)] + n Var b Pn(ℓ(bθ rob n , X)) i=1 ℓ(bθ rob n , Xi) + h ( ℓ(bθ rob n , X) E b Pn[ ℓ(bθ rob n , X)])(ℓ(bθ rob n , X) E b Pn[ℓ(bθ rob n , X)]) i Var b Pn(ℓ(bθ rob n , X)) . Because bθ rob n a.s. θ , by the continuous mapping theorem and local uniform convergence of the empirical expectations E b Pn[ ] to E[ ], the second term of expression (41) satisfies h ( ℓ(bθ rob n , X) E b Pn[ ℓ(bθ rob n , X)])(ℓ(bθ rob n , X) E b Pn[ℓ(bθ rob n , X)]) i Var b Pn(ℓ(bθ rob n , X)) = Cov( ℓ(θ , X), ℓ(θ , X)) p Var(ℓ(θ , X)) | {z } =:b(θ ) For simplicity, we let b(θ ) denote the final term, which we shall see becomes an asymptotic bias. Thus, performing a Taylor expansion of the terms ℓ(bθ rob n , Xi) around θ in equality (41), there exist (random) error matrices En(Xi), where En(Xi) H(Xi) bθ rob n θ Variance-based Regularization with Convex Objectives by Assumption A, such that 0 = E b Pn[ ℓ(θ , X)] + 1 2ℓ(θ , Xi) + En(Xi) (bθ rob n θ ) + n (b(θ ) + o P (1)) = E b Pn[ ℓ(θ , X)] + 2R(θ ) + o P (1) (bθ rob n θ ) + n (b(θ ) + o P (1)). Multiplying both sides by n, using that 2R(θ ) + o P (1) is eventually invertible, and applying the continuous mapping theorem, we have n(bθ rob n θ ) = ( 2R(θ ) + o P (1)) 1 1 n i=1 ℓ(θ , Xi) p 2ρb(θ ) + o P (1). The first term on the right side of the above display converges in distribution to a N(0, Σ) distribution, where Σ = ( 2R(θ )) 1 Cov( ℓ(θ , X))( 2R(θ )) 1, so that n(bθ rob n θ ) d N p 2ρ b(θ ), Σ as claimed in the theorem statement. Appendix G. Proofs of Technical Lemmas G.1. Proof of Inequality (24) Define the Gaussian complexity Gn({ℓ H} r) := E sup h BH,c [0,1] X gicℓ(h(xi), yi) | E[ℓ(h(X), Y )2] r/c2 # where gi iid N(0, 1) (here we recall the standard result (Bartlett and Mendelson, 2002) that Gaussian complexity upper bounds Rademacher complexities up to a constant). Now, the set h h such that h BH is contained in 2BH, which is convex. Moreover, we have E[ℓ(h(X), Y )2] = E[(h(X) h (X))2] + σ2, and so we have for any c that {h BH | c2E[ℓ(h(X), Y )2] r} {h BH | E[(h(X) h (X))2] r/c2}, and E[ℓ(h(X), Y )2] r/c2 also implies σ2 r/c2. Returning to expression (42) and enlarging the sets over which we take suprema, we thus obtain sup h BH,c1,c2 [0,1] i=1 gi|c1(h(xi) h (xi)) c2ξi| | E[(h(X) h (X))2] r c2 1 , σ2 r sup f 2BH,c [0,1] i=1 gi|f(xi) cξi| | E[f(X)2] r, σ2 r/c2 # Duchi and Namkoong where we have used that h h 2BH and that the set BH is convex to obtain the second inequality. We now upper bound the final display using the classical Sudakov Fernique comparison theorem (e.g. Chatterjee, 2005). Indeed, define the two Gaussian processes indexed by f H and c [0, 1] by Yf,c = Pn i=1 gi|f(xi) cξi| and Zf,c = Pn i=1 gif(xi) + c Pn i=1 wiξi, where gi iid N(0, 1) and wi iid N(0, 1). Then we have for any f1, f2 H and c1, c2 [0, 1] that E[(Yf1,c1 Yf2,c2)2] = i=1 (|f1(xi) c1ξi| |f2(xi) c2ξi|)2 i=1 (f1(xi) f2(xi) + (c2 c1)ξi)2 i=1 (f1(xi) f2(xi))2 + 2(c2 c1)2 n X Moreover, E[(Zf1,c1 Zf2,c2)2] = Pn i=1(f1(xi) f2(xi))2 + (c1 c2)2 Pn i=1 ξ2 i . Thus, the Sudakov-Fernique inequality guarantees that E[supf,c Yf,c] 2E[supf,c Zf,c], and i=1 gif(xi) | E[f(X)2] r sup c [0,1] c i=1 wiξi | c2σ2 r The last term in the expression has bound nr by Jensen s inequality and the relaxation that c [ 1, 1]. For the first term, Mendelson (2003, Thm. 2.1) shows that for RKHS with kernel eigenvalues λ1, λ2, . . ., we have i=1 gif(Xi) | E[f(X)2] r j=1 min{λj, r} which yields our desired claim (24). G.2. Proof of Lemma 7 Defining Ny := card{i [n] : Xi = y} for y { 1, 0, 1}, we immediately obtain E b Pn[ℓ(θ; X)] = 1 n [N 1|θ + 1| + N1|θ 1| + N0|θ| (n N0)] , because N1 + N 1 + N0 = n. In particular, we find that the empirical risk minimizer θ satisfies bθ erm n := argmin θ R E b Pn[ℓ(θ; X)] = 1 if N1 > N0 + N 1 1 if N 1 > N0 + N1 [ 1, 1] otherwise. On the events N1 > N 1 + N0 or N 1 > N0 + N1, which are disjoint, then, we have R(bθ erm n ) = δ = R(θ ) + δ. Variance-based Regularization with Convex Objectives Let us give a lower bound on the probability of this event. Noting that marginally N1 Bin(n, 1 δ 2 ) and using N0 + N 1 = n N1, we have N1 > N0 + N 1 if and only if N1 > n 2 , and we would like to lower bound = P Bin n, 1 δ = P Bin n, 1 + δ Letting Φ(t) = 1 2π R t e u2/2du denote the standard Gaussian CDF, then Zubkov and Serov (2013) show that where Dkl (p||q) = p log p q + (1 p) log 1 p 1 q denotes the binary KL-divergence. We have by standard bounds on the KL-divergence (Tsybakov, 2009, Lemma 2.7) that Dkl(1 2 ) δ2 2(1 δ2), so that 2 or N 1 > n For n odd, the final probability is 0, while for n even, we have = 2 n n n/2 (1 δ2)n/2 (1 δ2)n/2 r where the inequality uses that 2n n 4n πn by Stirling s approximation. Summarizing, we find that 2 or N 1 > n (1 δ2)n/2 r G.3. Proof of Lemma 25 Under the conditions of the theorem, the compactness of θ + ϵB guarantees that sup θ θ +ϵB |E b Pn[ℓ(θ, X)] R(θ)| a.s. 0, as the functions θ 7 ℓ(θ, x) are Lipschitz in a neighborhood of θ by Assumption A. Similarly, sup θ θ +ϵB Var b Pn(ℓ(θ, X)) Var(ℓ(θ, X)) a.s. 0, using the local Lipschitzness of 2ℓ. (See, for example, the Glivenko-Cantelli results in Chapters 2.4 2.5 of van der Vaart and Wellner (1996).) Thus, using the two-sided bounds (10) of Theorem 1, we have that sup θ θ +ϵB |Rn(θ, Pn) R(θ)| sup θ θ +ϵB E b Pn[ℓ(θ, Pn)] R(θ) + n sup θ θ +ϵB Var b Pn(ℓ(θ, X)) a.s. 0. Duchi and Namkoong Now, we use the fact that 2R(θ ) 0, and that θ 7 2R(θ) is continuous in a neighborhood of θ . Fix ϵ > 0 small enough that the preceding uniform convergence guarantees hold over θ + 2ϵB and 2R(θ) λI for some λ > 0 and all θ θ + 2ϵB. Let θ θ + ϵB, but θ θ + 2ϵB. Then for sufficently large n, we have that Rn(θ, Pn) E b Pn[ℓ(θ, X)] (i) R(θ) λ (ii) R(θ ) + λ 2 θ θ 2 2 λ 4 ϵ2 (iii) R(θ ) + λ (iv) E b Pn[ℓ(θ , X)] + λ 8 ϵ2 = E b Pn[ℓ(θ , X)] + λ where inequalities (i) and (iv) follow from the uniform convergence guarantee, inequality (ii) from the strong convexity of R near θ , and (iii) because θ θ 2 ϵ. Finally, we have that E b Pn[ℓ(θ , X)] Rn(θ , Pn) n Var b Pn(ℓ(θ , X)) | {z } a.s. 0 so that eventually Rn(θ, Pn) > Rn(θ , Pn) for all θ θ + 2ϵB \ ϵB. By convexity, then this inequality holds for all θ θ + ϵB. Thus if bθ rob n argminθ Rn(θ, Pn), then for any ϵ > 0 we must eventually have bθ rob n θ 2 < ϵ. Appendix H. Efficient solutions to computing the robust expectation In this appendix, we give a detailed description of the procedure we use to compute the supremum problem (8). In particular, our procedure requires time O(n log n + log 1 ϵ log n), where ϵ is the desired solution accuracy. Let us reformulate this as a minimization problem in a variable p Rn for simplicity. Then we wish to solve minimize p z subject to 1 2n np 1 2 2 ρ, p 0, p 1 = 1. We take a partial dual of this minimization problem, then maximize this dual to find the optimizing p. Introducing the dual variable λ 0 for the constraint that 1 n and performing the standard min-max swap (Boyd and Vandenberghe, 2004) (strong duality obtains for this problem because the Slater condition is satisfied by p = 1 n1) yields the maximization problem maximize λ 0 f(λ) := inf p n + p z | p 0, 1 p = 1 . (43) If we can efficiently compute the infimum (43), then it is possible to binary search over λ. Recall the standard fact (Hiriart-Urruty and Lemar echal, 1993, Chapter VI.4.4) that for a collection {fp}p P of concave functions, if the infimum f(x) = infp P fp(x) is attained at some p0 then any vector fp0(x) is a supergradient of f(x). Thus, letting p(λ) be the (unique) minimizing value of p for any λ > 0, the objective (43) becomes f(λ) = λ 2 p(λ) 1 n +p(λ) z, whose derivative with respect to λ (holding p fixed) is f (λ) = 1 2 p(λ) 1 Variance-based Regularization with Convex Objectives Now we use well-known results on the Euclidean projection of a vector to the probability simplex (Duchi et al., 2008) to provide an efficient computation of the infimum (43). First, we assume with no loss of generality that z1 z2 zn and that 1 z = 0, because neither of these changes the original optimization problem (as 1 p = 0 and the objective is symmetric). Then we define the two vectors s, σ2 Rn, which we use for book-keeping in the algorithm, by j i zj, σ2 i = X and we let z2 be the vector whose entries are z2 i . The infimum problem (43) is equivalent to projecting the vector v(λ) Rn defined by onto the probability simplex. Notably (Duchi et al., 2008), the projection p(λ) has the form pi(λ) = (vi η)+ for some η R, where η is chosen such that Pn i=1 pi(λ) = 1. Finding such a value η is equivalent (Duchi et al., 2008, Figure 1) to finding the unique index i such that i X j=1 (vj vi) < 1 and j=1 (vj vi+1) 1, taking i = n if no such index exists (the sum Pi j=1(vj vi) is increasing in i and v1 v1 = 0). Given the index i, algebraic manipulations show that η = 1 i Pi j=1 zj/λ = 1 i si/λ satisfies the equality Pn i=1 (vi η)+ = 1 and that vj η 0 for all j i while vj η 0 for j > i. Of course, given the index i and η, we may calculate the derivative λf(λ) efficiently as well: p(λ) n 11 2 2 λρ p(λ) n 11 2 2 ρ j=1 (vj η n 1)2 + 1 λzj + η 2 + n i n = σ2 i 2λ2 + iη2 Finding the index optimal i can be done by a binary search, which requires O(log n) time, and f (λ) is then computable in O(1) time using the vectors s and σ2. It is then possible to perform a binary search over λ using f (λ), which which requires log 1 ϵ iterations to find λ within accuracy ϵ, from which it is easy to compute p(λ) via pi(λ) = (vi η)+ = n 1 λ 1zi η +. We summarize this discussion with pseudo-code in Figures 6 and 7, which provide a main routine and sub-routine for finding the optimal vector p. These routines show that, once provided the sorted vector z with z1 z2 zn (which requires n log n time to compute), we require only O(log 1 ϵ log n) computations. Duchi and Namkoong Inputs: Sorted vector z Rn with 1 z = 0, parameter ρ > 0, solution accuracy ϵ Set λmin = 0 and λmax = λ = max{n z , p n/2ρ z 2} Set si = P j i zj and σ2 i = P j i z2 j While |λmax λmin| > ϵλ Set λ = λmax+λmin 2 Set (η, i) = Find Shift(z, λ, s) // (Figure 7) Set f (λ) = 1 2λ2 σ2 i + η2 n If f (λ) > 0 Set λmin = λ Else Set λmax = λ Set λ = 1 2(λmax + λmin), (η, i) = Find Shift(z, λ, s) Set pi = 1 + and return p Figure 6. Procedure Find P to find the vector p minimizing Pn i=1 pizi subject to the constraint 1 2n np 1 2 2 ρ. Method takes log 1 ϵ iterations of the loop. Inputs: Sorted vector z with 1 z = 0, λ > 0, vector s with si = P j i zj Set ilow = 1, ihigh = n If 1 λ 0 Return (η = 0, i = n) While ilow = ihigh i = 1 2(ilow + ihigh) sleft = 1 λ(izi si) // (this is sleft = Pi j=1(vj vi)) sright = 1 λ((i + 1)zi+1 si+1) // (this is sright = Pi+1 j=1(vj vi+1)) If sright 1 and sleft < 1 Set η = 1 λisi and return (η, i) Else if sleft 1 Set ihigh = i 1 Else Set ilow = i + 1 Set i = ilow and η = 1 λisi and return (η, i) Figure 7. Procedure Find Shift to find index i and parameter η such that, for the definition vi = 1 λzi, we have vj η 0 for j i, vj η 0 for j > i, and Pn j=1 (vj η)+ = 1. Method requires time O(log n). M. Anthony and J. Shawe-Taylor. A result of Vapnik with applications. Discrete Applied Mathematics, 47:207 217, 1993. Variance-based Regularization with Convex Objectives P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002. P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Annals of Statistics, 33(4):1497 1537, 2005. P. L. Bartlett, M. I. Jordan, and J. Mc Auliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101:138 156, 2006. A. Ben-Tal, L. E. Ghaoui, and A. Nemirovski. Robust Optimization. Princeton University Press, 2009. A. Ben-Tal, D. den Hertog, A. D. Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341 357, 2013. A. Ben-Tal, E. Hazan, T. Koren, and S. Mannor. Oracle-based robust optimization via online learning. Operations Research, 63(3):628 638, 2015. D. Bertsimas, V. Gupta, and N. Kallus. Robust SAA. ar Xiv:1408.4445 [math.OC], 2014. URL http://arxiv.org/abs/1408.4445. M. Birman and M. Solomjak. Piecewise-polynomial approximations of functions of the classes W α p . Sbornik: Mathematics, 2(3):295 317, 1967. S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: a survey of some recent advances. ESAIM: Probability and Statistics, 9:323 375, 2005. S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: a Nonasymptotic Theory of Independence. Oxford University Press, 2013. O. Bousquet. A bennett concentration inequality and its application to suprema of empirical processes. Comptes Rendus Mathematique, 334(6):495 500, 2002a. O. Bousquet. Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. Ph D thesis, L Ecole Polytechnique, 2002b. O. Bousquet. Concentration inequalities for sub-additive functions using the entropy method. In Stochastic inequalities and applications, pages 213 247. Springer, 2003. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. P. B uhlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, 2011. S. Chatterjee. An error bound in the Sudakov-Fernique inequality. ar Xiv:0510424 [math.PR], 2005. N. Cristianini and J. Shawe-Taylor. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. Duchi and Namkoong J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the ℓ1ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, 2008. J. C. Duchi, P. W. Glynn, and H. Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. ar Xiv:1610.03425 [stat.ML], 2016. R. M. Dudley. Uniform Central Limit Theorems. Cambridge University Press, 1999. J.-y. Gotoh, M. J. Kim, and A. Lim. Robust empirical optimization is almost the same as mean-variance optimization. Available at SSRN 2827400, 2015. C. Gu. Smoothing spline ANOVA models. Springer, 2002. J. Hiriart-Urruty and C. Lemar echal. Convex Analysis and Minimization Algorithms I & II. Springer, New York, 1993. V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. Annals of Statistics, 34(6):2593 2656, 2006. T. K uhn. Covering numbers of Gaussian reproducing kernel Hilbert spaces. Journal of Complexity, 27(5):489 499, 2011. H. Lam and E. Zhou. Quantifying input uncertainty in stochastic optimization. In Proceedings of the 2015 Winter Simulation Conference. IEEE, 2015. M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer, 1991. E. L. Lehmann and G. Casella. Theory of Point Estimation, Second Edition. Springer, 1998. D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361 397, 2004. M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ ml. E. Mammen and A. B. Tsybakov. Smooth discrimination analysis. Annals of Statistics, 27: 1808 1829, 1999. A. Maurer and M. Pontil. Empirical Bernstein bounds and sample variance penalization. In Proceedings of the Twenty Second Annual Conference on Computational Learning Theory, 2009. S. Mendelson. On the performance of kernel classes. Journal of Machine Learning Research, 4(Oct):759 771, 2003. S. Mendelson. Learning without concentration. In Proceedings of the Twenty Seventh Annual Conference on Computational Learning Theory, 2014. Variance-based Regularization with Convex Objectives H. Namkoong and J. C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in Neural Information Processing Systems 30, 2016. A. Owen. Empirical likelihood ratio confidence regions. The Annals of Statistics, 18(1): 90 120, 1990. A. B. Owen. Empirical likelihood. CRC press, 2001. A. Shapiro, D. Dentcheva, and A. Ruszczy nski. Lectures on Stochastic Programming: Modeling and Theory. SIAM and Mathematical Programming Society, 2009. P. K. Shivaswamy and T. Jebara. Empirical Bernstein boosting. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010. P. K. Shivaswamy and T. Jebara. Variance penalizing Ada Boost. In Advances in Neural Information Processing Systems 25, 2011. N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low noise and fast rates. In nips2010, pages 2199 2207, 2010. A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statistics, 32(1):135 166, 2004. A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009. S. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000. A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York, 1996. V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, XVI(2):264 280, 1971. V. N. Vapnik and A. Y. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow, 1974. (In Russian). D.-X. Zhou. Capacity of reproducing kernel spaces in learning theory. IEEE Transactions on Information Theory, 49(7):1743 1752, 2003. H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67(2):301 320, 2005. A. Zubkov and A. Serov. A complete proof of universal inequalities for the distribution function of the binomial law. Theory of Probability & Its Applications, 57(3):539 544, 2013.