# variancebased_regularization_with_convex_objectives__9b6bc86b.pdf Variance-based Regularization with Convex Hongseok Namkoong Stanford University hnamk@stanford.edu John C. Duchi Stanford University jduchi@stanford.edu We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen s empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems. 1 Introduction Let X be a sample space, P0 a distribution on X, and a parameter space. For a loss function : X ! R, consider the problem of finding 2 minimizing the risk R( ) := E[ ( , X)] = ( , x)d P(x) (1) given a sample {X1, . . . , Xn} drawn i.i.d. according to the distribution P. Under appropriate conditions on the loss , parameter space , and random variables X, a number of researchers [2, 6, 12, 7, 3] have shown results of the form that with high probability, ( , Xi) + C1 Var( ( , X)) n for all 2 (2) where C1 and C2 depend on the parameters of problem (1) and the desired confidence guarantee. Such bounds justify empirical risk minimization, which chooses b n to minimize 1 i=1 ( , Xi) over 2 . Further, these bounds showcase a tradeoff between bias and variance, where we identify the bias (or approximation error) with the empirical risk 1 i=1 ( , Xi), while the variance arises from the second term in the bound. Considering the bias-variance tradeoff (1) in statistical learning, it is natural to instead choose to directly minimize a quantity trading between approximation and estimation error: ( , Xi) + C Var b Pn( ( , X)) where Var b Pn denotes the empirical variance. Maurer and Pontil [16] consider this idea, giving guarantees on the convergence and good performance of such a procedure. Unfortunately, even when 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. the loss is convex in , the formulation (3) is generally non-convex, which limits the applicability of procedures that minimize the variance-corrected empirical risk (3). In this paper, we develop an approach based on Owen s empirical likelihood [19] and ideas from distributionally robust optimization [4, 5, 10] that whenever the loss is convex provides a tractable convex formulation closely approximating the penalized risk (3). We give a number of theoretical guarantees and empirical evidence for its performance. To describe our approach, we require a few definitions. For a convex function φ : R+ ! R with φ(1) = 0, Dφ (P||Q) = d Q)d Q is the φ-divergence between distributions P and Q defined on X. Throughout this paper, we use φ(t) = 1 2(t 1)2, which gives the χ2-divergence. Given φ and an i.i.d. sample X1, . . . , Xn, we define the -neighborhood of the empirical distribution distributions P s.t. Dφ(P|| b Pn) where b Pn denotes the empirical distribution of the sample {Xi}n i=1, and our choice φ(t) = 1 2(t 1)2 means that Pn has support {Xi}n i=1. We then define the robustly regularized risk Rn( , Pn) := sup EP [ ( , X)] = sup EP [ ( , X)] : Dφ(P|| b Pn) As it is the supremum of a family of convex functions, the robust risk 7! Rn( , Pn) is convex in regardless of the value of 0 whenever the original loss ( ; X) is convex and is a convex set. Namkoong and Duchi [18] propose a stochastic procedure for minimizing (4) almost as fast as stochastic gradient descent. See Appendix C for a detailed account of an alternative method. We show that the robust risk (4) provides an excellent surrogate for the variance-regularized quantity (3) in a number of ways. Our first result (Thm. 1 in Sec. 2) is that for bounded loss functions, Rn( , Pn) = E b Pn[ ( , X)] + n Var b Pn( ( , X)) + "n( ), (5) where "n( ) 0 and is O(1/n) uniformly in . We show that when ( , X) has suitably large variance, we have "n = 0 with high probability. With the expansion (5) in hand, we can show a number of finite-sample convergence guarantees for the robustly regularized estimator EP [ ( , X)] : Dφ(P|| b Pn) Based on the expansion (5), solutions b rob n of problem (6) enjoy automatic finite sample optimality certificates: for 0, with probability at least 1 C1 exp( ) we have n ; X)] Rn(b rob n ; Pn) + C2 2 Rn( , Pn) + C2 where C1, C2 are constants (which we specify) that depend on the loss and domain . That is, with high probability the robust solution has risk no worse than the optimal finite sample robust objective up to an O( /n) error term. To guarantee a desired level of risk performance with probability 1 δ, we may specify the robustness penalty = O(log 1 Secondly, we show that the procedure (6) allows us to automatically and near-optimally trade between approximation and estimation error (bias and variance), so that n ; X)] inf E[ ( ; X)] + 2 n Var( ( ; X)) with high probability. When there are parameters with small risk R( ) (relative to the optimal parameter ?) and small variance Var( ( , X)), this guarantees that the excess risk R(b rob n ) R( ?) is essentially of order O( /n), where governs our desired confidence level. We give an explicit example in Section 3.2 where our robustly regularized procedure (6) converges at O(log n/n) compared to O(1/pn) of empirical risk minimization. Bounds that trade between risk and variance are known in a number of cases in the empirical risk minimization literature [15, 22, 2, 1, 6, 3, 7, 12], which is relevant when one wishes to achieve fast rates of convergence for statistical learning algorithms. In many cases, such tradeoffs require either conditions such as the Mammen-Tsybakov noise condition [15, 6] or localization results [3, 2, 17] made possible by curvature conditions that relate the risk and variance. The robust solutions (6) enjoy a variance-risk tradeoff that is differen but holds essentially without conditions except compactness of . We show in Section 3.3 that the robust solutions enjoy fast rates of convergence under typitcal curvature conditions on the risk R. We complement our theoretical results in Section 4, where we conclude by providing two experiments comparing empirical risk minimization (ERM) strategies to robustly-regularized risk minimization (6). These results validate our theoretical predictions, showing that the robust solutions are a practical alternative to empirical risk minimization. In particular, we observe that the robust solutions outperform their ERM counterparts on harder instances with higher variance. In classification problems, for example, the robustly regularized estimators exhibit an interesting tradeoff, where they improve performance on rare classes (where ERM usually sacrifices performance to improve the common cases increasing variance slightly) at minor cost in performance on common classes. 2 Variance Expansion We begin our study of the robust regularized empirical risk Rn( , Pn) by showing that it is a good approximation to the empirical risk plus a variance term (5). Although the variance of the loss is in general non-convex, the robust formulation (6) is a convex optimization problem for variance regularization whenever the loss function is convex [cf. 11, Prop. 2.1.2.]. To gain intuition for the variance expansion that follows, we consider the following equivalent formulation for the robust objective sup P 2Pn EP [Z] pizi subject to p 2 Pn = 2 , h1, pi = 1 where z 2 Rn is a vector. For simplicity, let s2 2 denote the empirical variance of the vector z, where z = 1 n h1, zi is the mean value of z. Then by introducing the variable u = p 1 n1, the objective in problem (7) satisfies hp, zi = z + hu, zi = z + hu, z zi because hu, 1i = 0. Thus problem (7) is equivalent to solving u2Rn z + hu, z zi subject to kuk2 n2 , h1, ui = 0, u 1 Notably, by the Cauchy-Schwarz inequality, we have hu, z zi p2 kz zk2 /n = 2 s2n/n, and equality is attained if and only if Of course, it is possible to choose such ui while satisfying the constraint ui 1/n if and only if p2 (zi z) p Thus, if inequality (8) holds for the vector z that is, there is enough variance in z we have hp, zi = z + For losses ( , X) with enough variance relative to ( , Xi) E b Pn[ ( , Xi)], that is, those satisfying inequality (8), then, we have Rn( , Pn) = E b Pn[ ( , X)] + n Var b Pn( ( , X)). A slight elaboration of this argument, coupled with the application of a few concentration inequalities, yields the next theorem. Recall that φ(t) = 1 2(t 1)2 in our definition of the φ-divergence. Theorem 1. Let Z be a random variable taking values in [M0, M1] where M = M1 M0 and fix 0. Then n Var b Pn(Z) 2M EP [Z] : Dφ(P|| b Pn) n Var b Pn(Z). If n max{ 24 Var(Z), 16 Var(Z), 1}M 2 and we set tn = sup P :Dφ(P || b Pn) EP [Z] = E b Pn[Z] + n Var b Pn(Z) (10) with probability at least 1 exp( nt2 n 2M 2 ) 1 exp( n Var(Z) See Appendix A.1 for the proof. Inequality (9) and the exact expansion (10) show that, at least for bounded loss functions , the robustly regularized risk (4) is a natural (and convex) surrogate for empirical risk plus standard deviation of the loss, and the robust formulation approximates exact variance regularization with a convex penalty. We also provide a uniform variant of Theorem 1 based on the standard notion of the covering number, which we now define. Let V be a vector space with (semi)norm k k on V, and let V V. We say a collection v1, . . . , v N V is an -cover of V if for each v 2 V , there exists vi such that kv vik . The covering number of V with respect to k k is then N(V, , k k) := inf {N 2 N : there is an -cover of V with respect to k k}. Now, let F be a collection of functions f : X ! R, and define the L1(X)-norm by kf gk L1(X) := supx2X |f(x) g(x)|. Although we state our results abstractly, we typically take F := { ( , ) | 2 }. As a motivating example, we give the following standard bound on the covering number of Lipschitz losses [24]. Example 1: Let Rd and assume that : X ! R is L-Lipschitz in with respect to the 2-norm for all x 2 X, meaning that | ( , x) ( 0, x)| L k 0k2. Then taking F = { ( , ) : 2 }, any -covering { 1, . . . , N} of in 2-norm guarantees that mini | ( , x) ( i, x)| L for all , x. That is, N(F, , k k L1(X)) N( , /L, k k2) 1 + diam( )L where diam( ) = sup , 02 k 0k2. Thus 2-covering numbers of control L1-covering numbers of the family F. With this definition, we provide a result showing that the variance expansion (5) holds uniformly for all functions with enough variance. Theorem 2. Let F be a collection of bounded functions f : X ! [M0, M1] where M = M1 M0, and let 0 be a constant. Define F := f 2 F : Var(f) 2 n . If 2 32 M 2 n , then with probability at least 1 N 32, k k L1(X) , we have for all f 2 F sup P :Dφ(P || b Pn) EP [f(X)] = E b Pn[f(X)] + n Var b Pn(f(X)). (11) We prove the theorem in Section A.2. Theorem 2 shows that the variance expansion of Theorem 1 holds uniformly for all functions f with sufficient variance. See Duchi, Glynn, and Namkoong [10] for an asymptotic analogue of the equality (11) for heavier tailed random variables. 3 Optimization by Minimizing the Robust Loss Based on the variance expansions in the preceding section, we show that the robust solution (6) automatically trades between approximation and estimation error. In addition to k k L1(X)-covering numbers defined in the previous section, we use the tighter notion of empirical 1-covering numbers. For x 2 X n, define F(x) = {(f(x1), . . . , f(xn)) : f 2 F} and the empirical 1-covering numbers N1(F, , n) := supx2X n N (F(x), , k k1), which bound the number of 1-balls of radius required to cover F(x). Note that we always have N1(F) N(F). Typically, we consider the function class F := { ( , ) : 2 }, though we state our minimization results abstractly. Although the below result is in terms of covering numbers for ease of exposition, a variant holds depending on localized Rademacher averages [2] of the class F, which can yield tighter guarantees (we omit such results for lack of space). We prove the following theorem in Section A.3. Theorem 3. Let F be a collection of functions f : X ! [M0, M1] with M = M1 M0. Define the empirical minimizer bf 2 argmin EP [f(X)] : Dφ(P|| b Pn) Then for t, with probability at least 1 2(N(F, , k k L1(X)) + 1)e t, E[ bf(X)] sup P :Dφ(P || b Pn) EP [ bf(X)] + 7M Further, for n 8M 2 t , t log 12, and 9t, with probability at least 1 2(3N1 (F, , 2n)+1)e t, E[ bf(X)] sup P :Dφ(P || b Pn) EP [ bf(X)] + 11 Unlike analogous results for empirical risk minimization [6], Theorem 3 does not require the selfbounding type assumption Var(f) BE[f]. A consequence of this is that when v = Var(f ) is small, where f 2 argminf2F E[f], we achieve O(1/n + v/n) (fast) rates of convergence. This condition is different from the typical conditions required for empirical risk minimization to have fast rates of convergence, highlighting the possibilities of variance-based regularization. It will be interesting to understand appropriate low-noise conditions (e.g. the Mammen-Tsybakov noise condition [15, 6]) guaranteeing good performance. Additionally, the robust objective Rn( , Pn) is an empirical likelihood confidence bound on the population risk [10], and as empirical likelihood confidence bounds are self-normalizing [19], other fast-rate generalizations may exist. 3.1 Consequences of Theorem 3 We now turn to a number of corollaries that expand on Theorem 3 to investigate its consequences. Our first corollary shows that Theorem 3 applies to standard Vapnik-Chervonenkis (VC) classes. As VC dimension is preserved through composition, this result also extends to the procedure (6) in typical empirical risk minimization scenarios. See Section A.4 for its proof. Corollary 3.1. In addition to the conditions of Theorem 3, let F have finite VC-dimension VC(F). Then for a numerical constant c < 1, the bounds (13) hold with probability at least 2VC(F) 1 + 2 Next, we focus more explicitly on the estimator b rob n defined by minimizing the robust regularized risk (6). Let us assume that Rd, and that we have a typical linear modeling situation, where a loss h is applied to an inner product, that is, ( , x) = h( >x). In this case, by making the substitution that the class F = { ( , ) : 2 } in Corollary 3.1, we have VC(F) d, and we obtain the following corollary. Recall the definition (1) of the population risk R( ) = E[ ( , X)], and the uncertainty set Pn = {P : Dφ(P|| b Pn) n}, and that Rn( , Pn) = sup P 2Pn EP [ ( , X)]. By setting = M/n in Corollary 3.1, we obtain the following result. Corollary 3.2. Let the conditions of the previous paragraph hold and assume that ( , x) 2 [0, M] for all 2 , x 2 X. Then if n 9 log 12, n ) Rn(b rob n , Pn) + 11M n Var( ( ; X)) with probability at least 1 2 exp(c1d log n c2 ), where ci are universal constants with c2 1/9. Unpacking Theorem 3 and Corollary 3.2 a bit, the first result (13a) provides a high-probability guarantees that the true expectation E[ bf] cannot be more than O(1/n) worse than its robustlyregularized empirical counterpart, that is, R(b rob n ) Rn(b rob n , Pn) + O( /n), which is (roughly) a consequence of uniform variants of Bernstein s inequality. The second result (13b) guarantee the convergence of the empirical minimizer to a parameter with risk at most O(1/n) larger than the best possible variance-corrected risk. In the case that the losses take values in [0, M], then Var( ( , X)) MR( ), and thus for = 1/n in Theorem 3, we obtain n ) R( ?) + C a type of result well-known and achieved by empirical risk minimization for bounded nonnegative losses [6, 26, 25]. In some scenarios, however, the variance may satisfy Var( ( , X)) MR( ), yielding improvements. To give an alternative variant of Corollary 3.2, let Rd and assume that for each x 2 X, inf 2 ( , x) = 0 and that is L-Lipschitz in . If D := diam( ) = sup , 02 k 0k < 1, then 0 ( , x) L diam( ) =: M. Corollary 3.3. Let the conditions of the preceeding paragraph hold. Set t = = log 2n + d log(2n DL) and = 1 n in Theorem 3 and assume that D . nk and L . nk for a numerical constant k. With probability at least 1 1/n, n ; X)] = R(b rob d Var( ( , X)) + C d LD log n where C is a numerical constant. 3.2 Beating empirical risk minimization We now provide an example in which the robustly-regularized estimator (6) exhibits a substantial improvement over empirical risk minimization. We expect the robust approach to offer performance benefits in situations in which the empirical risk minimizer is highly sensitive to noise, say, because the losses are piecewise linear, and slight underor over-estimates of slope may significantly degrade solution quality. With this in mind, we construct a toy 1-dimensional example estimating the median of a distribution supported on X = { 1, 0, 1} in which the robust-regularized estimator has convergence rate log n/n, while empirical risk minimization is at best 1/pn. Define the loss ( ; x) = | x| |x|, and for δ 2 (0, 1) let the distribution P be defined by P(X = 1) = 1 δ 2 , P(X = 1) = 1 δ 2 , P(X = 0) = δ. Then for 2 R, the risk of the loss is R( ) = δ| | + 1 δ 2 | 1| + 1 δ 2 | + 1| (1 δ). By symmetry, it is clear that ? := argmin R( ) = 0, which satisfies R( ?) = 0. (Note that ( , x) = ( , x) ( ?, x).) Without loss of generality, we assume that = [ 1, 1]. Define the empirical risk minimizer and the robust solution b erm := argmin E b Pn[ ( , X)] = argmin E b Pn[| X|], b rob Intuitively, if too many of the observations satisfy Xi = 1 or too many satisfy Xi = 1, then b erm will be either 1 or 1; for small δ, such events become reasonably probable. On the other hand, we have ( ?; x) = 0 for all x 2 X, so that Var( ( ?; X)) = 0 and variance regularization achieves the rate O(log n/n) as opposed to empirical risk minimizer s O(1/pn). See Section A.6 for the proof. Proposition 1. Under the conditions of the previous paragraph, for n = 3 log n, with probability at least 1 4 n, we have R(b rob n ) R( ?) 45 log n n . However, with probability at least 2Φ( 2 , we have R(b erm) R( ?) + n 1 For n 20, the probability of the latter event is .088. Hence, for this (specially constructed) example, we see that there is a gap of nearly n 1 2 in order of convergence. 3.3 Fast Rates In cases in which the risk R has curvature, empirical risk minimization often enjoys faster rates of convergence [6, 21]. The robust solution b rob n similarly attains faster rates of convergence in such cases, even with approximate minimizers of Rn( , Pn). For the risk R and 0, let S := { 2 : R( ) inf ?2 R( ?) + } denote the -sub-optimal (solution) set, and similarly let b S := { 2 : Rn( , Pn) inf 02 Rn( 0, Pn) + }. For a vector 2 , let S( ) = argmin ?2S k ? k2 denote the Euclidean projection of onto the set S. Our below result depends on a local notion of Rademacher complexity. For i.i.d. random signs "i 2 { 1}, the empirical Rademacher complexity of a function class F {f : X ! R} is "if(Xi) | X Although we state our results abstractly, we typically take F := { ( , ) | 2 }. For example, when F is a VC-class, we typically have E[Rn F] . VC(F)/n. Many other bounds on E[Rn F] are possible [1, 24, Ch. 2]. For A let Rn(A) denote the Rademacher complexity of the localized process {x 7! ( ; x) ( S( ); x) : 2 A}. We then have the following result, whose proof we provide in Section A.7. Theorem 4. Let Rd be convex and let ( ; x) be convex and L-Lipshitz for all x 2 X. For constants λ > 0, γ > 1, and r > 0, assume that R satisfies 2 R( ) λ dist( , S)γ for all such that dist( , S) r. (14) Let t > 0. If 0 1 2λrγ satisfies 2 2E[Rn(S2 )] + L then P(b S S2 ) 1 e t, and inequality (15) holds for all & ( L2(t+ +d) 4 Experiments We present two real classification experiments to carefully compare standard empirical risk minimization (ERM) to the variance-regularized approach we present. Empirically, we show that the ERM estimator b erm performs poorly on rare classes with (relatively) more variance, where the robust solution achieves improved classification performance on rare instances. In all our experiments, this occurs with little expense over the more common instances. 4.1 Protease cleavage experiments For our first experiment, we compare our robust regularization procedure to other regularizers using the HIV-1 protease cleavage dataset from the UCI ML-repository [14]. In this binary classification task, one is given a string of amino acids (a protein) and a featurized representation of the string of dimension d = 50960, and the goal is to predict whether the HIV-1 virus will cleave the amino acid sequence in its central position. We have a sample of n = 6590 observations of this process, where the class labels are somewhat skewed: there are 1360 examples with label Y = +1 (HIV-1 cleaves) and 5230 examples with Y = 1 (does not cleave). (a) test error (b) rare class (Yi = +1) (c) common class (Yi = 1) Figure 1: HIV-1 Protease Cleavage plots (2-standard error confidence bars). Comparison of misclassification test error rates among different regularizers. We use the logistic loss ( ; (x, y)) = log(1 + exp( y >x)). We compare the performance of different constraint sets by taking = 2 Rd : a1 k k1 + a2 k k2 r , which is equivalent to elastic net regularization [27], while varying a1, a2, and r. We experiment with 1-constraints (a1 = 1, a2 = 0) with r 2 {50, 100, 500, 1000, 5000}, 2-constraints (a1 = 0, a2 = 1) with r 2 {5, 10, 50, 100, 500}, elastic net (a1 = 1, a2 = 10) with r 2 {102, 2 102, 103, 2 103, 104}, our robust regularizer with 2 {102, 103, 104, 5 104, 105} and our robust regularizer coupled with the 1-constraint (a1 = 1, a2 = 0) with r = 100. Though we use a convex surrogate (logistic loss), we measure performance of the classifiers using the zero-one (misclassification) loss 1{sign( T x)y 0}. For validation, we perform 50 experiments, where in each experiment we randomly select 9/10 of the data to train the model, evaluating its performance on the held out 1/10 fraction (test). We plot results summarizing these experiments in Figure 1. The horizontal axis in each figure indexes our choice of regularization value (so Regularizer = 1 for the 1-constrained problem corresponds to r = 50). The figures show that the robustly regularized risk provides a different type of protection against overfitting than standard regularization or constraint techniques do: while other regularizers underperform in heavily constrained settings, the robustly regularized estimator n achieves low classification error for all values of . Notably, even when coupled with a fairly stringent 1-constraint (r = 100), robust regularization has performance better than 1 except for large values r, especially on the rare label Y = +1. We investigate the effects of the robust regularizer with a slightly different perspective in Table 1, where we use = { : k k1 100} for the constraint set for each experiment. We give error rates and logistic risk values for the different procedures, averaged over 50 independent runs. We note that all gaps are significant at the 3-standard error level. We see that the ERM solutions achieve good performance on the common class (Y = 1) but sacrifice performance on the uncommon class. As we increase , performance of the robust solution b rob n on the rarer label Y = +1 improves, while the error rate on the common class degrades a small (insignificant) amount. Table 1: HIV-1 Cleavage Error risk error (%) error (Y = +1) error (Y = 1) train test train test train test train test erm 0.1587 0.1706 5.52 6.39 17.32 18.79 2.45 3.17 100 0.1623 0.1763 4.99 5.92 15.01 17.04 2.38 3.02 1000 0.1777 0.1944 4.5 5.92 13.35 16.33 2.19 3.2 10000 0.283 0.3031 2.39 5.67 7.18 14.65 1.15 3.32 4.2 Document classification in the Reuters corpus For our second experiment, we consider a multi-label classification problem with a reasonably large dataset. The Reuters RCV1 Corpus [13] has 804,414 examples with d = 47,236 features, where feature j is an indicator variable for whether word j appears in a given document. The goal is to classify documents as a subset of the 4 categories where documents are labeled with a subset of those. As documents can belong to multiple categories, we fit binary classifiers on each of the four (a) (b) (c) Figure 2: Reuters corpus experiment. (a) Logistic risks. (b) Recall. (c) Recall on Economics (rare). categories. Each category has different number of documents (Corporate: 381, 327, Economics: 119, 920, Government: 239, 267, Markets: 204, 820) In this experiment, we expect the robust solution to outperform ERM on the rarer category (Economics), as the robustification (6) naturally upweights rarer (harder) instances, which disproportionally affect variance as in the previous experiment. For each category k 2 {1, 2, 3, 4}, we use the logistic loss ( k; (x, y)) = log(1 + exp( y > k x)). For each binary classifier, we use the 1 constraint set = 2 Rd : k k1 1000 . To evaluate performance on this multi-label problem, we use precision (ratio of the number of correct positive labels to the number classified as positive) and recall (ratio of the number of correct positive labels to the number of actual positive labels). We partition the data into ten equally-sized sub-samples and perform ten validation experiments, where in each experiment we use one of the ten subsets for fitting the logistic models and the remaining nine partitions as a test set to evaluate performance. In Figure 2, we summarize the results of our experiment averaged over the 10 runs, with 2-standard error bars (computed across the folds). To facilitate comparison across the document categories, we give exact values of these averages in Tables 2 and 3. Both b rob n and b erm have reasonably high precision across all categories, with increasing giving a mild improvement in precision (from .93 .005 to .94 .005). On the other hand, we observe in Figure 2(c) that ERM has low recall (.69 on test) for the Economics category, which contains about 15% of documents. As we increase from 0 (ERM) to 105, we see a smooth and substantial improvement in recall for this rarer category (without significant degradation in precision). This improvement in recall amounts to reducing variance in predictions on the rare class. This precision and recall improvement comes in spite of the increase in the average binary logistic risk for each of the 4 classes. In Figure 2(a), we plot the average binary logistic loss (on train and test sets) averaged over the 4 categories as well as the upper confidence bound Rn( , Pn) as we vary . The robust regularization effects reducing variance appear to improve the performance of the binary logistic loss as a surrogate for true error rate. Table 2: Reuters Corpus Precision (%) Precision Corporate Economics Government Markets train test train test train test train test train test erm 92.72 92.7 93.55 93.55 89.02 89 94.1 94.12 92.88 92.94 1E3 92.97 92.95 93.31 93.33 87.84 87.81 93.73 93.76 92.56 92.62 1E4 93.45 93.45 93.58 93.61 87.6 87.58 93.77 93.8 92.71 92.75 1E5 94.17 94.16 94.18 94.19 86.55 86.56 94.07 94.09 93.16 93.24 1E6 91.2 91.19 92 92.02 74.81 74.8 91.19 91.25 89.98 90.18 Table 3: Reuters Corpus Recall (%) Recall Corporate Economics Government Markets train test train test train test train test train test erm 90.97 90.96 90.20 90.25 67.53 67.56 90.49 90.49 88.77 88.78 1E3 91.72 91.69 90.83 90.86 70.42 70.39 91.26 91.23 89.62 89.58 1E4 92.40 92.39 91.47 91.54 72.38 72.36 91.76 91.76 90.48 90.45 1E5 93.46 93.44 92.65 92.71 76.79 76.78 92.26 92.21 91.46 91.47 1E6 93.10 93.08 92.00 92.04 79.84 79.71 91.89 91.90 92.00 91.97 Code is available at https://github.com/hsnamkoong/robustopt. Acknowledgments We thank Feng Ruan for pointing out a much simpler proof of Theorem 1 than in our original paper. JCD and HN were partially supported by the SAIL-Toyota Center for AI Research and HN was partially supported Samsung Fellowship. JCD was also partially supported by the National Science Foundation award NSF-CAREER-1553086 and the Sloan Foundation. [1] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002. [2] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Annals of Statistics, 33(4): 1497 1537, 2005. [3] P. L. Bartlett, M. I. Jordan, and J. Mc Auliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101:138 156, 2006. [4] A. Ben-Tal, D. den Hertog, A. D. Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341 357, 2013. [5] D. Bertsimas, V. Gupta, and N. Kallus. Robust SAA. ar Xiv:1408.4445 [math.OC], 2014. URL http: //arxiv.org/abs/1408.4445. [6] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: a survey of some recent advances. ESAIM: Probability and Statistics, 9:323 375, 2005. [7] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: a Nonasymptotic Theory of Independence. Oxford University Press, 2013. [8] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [9] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the 1-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, 2008. [10] J. C. Duchi, P. W. Glynn, and H. Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. ar Xiv:1610.03425 [stat.ML], 2016. URL https://arxiv.org/abs/1610.03425. [11] J. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms I & II. Springer, New York, 1993. [12] V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. Annals of Statistics, 34(6):2593 2656, 2006. [13] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361 397, 2004. [14] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml. [15] E. Mammen and A. B. Tsybakov. Smooth discrimination analysis. Annals of Statistics, 27:1808 1829, 1999. [16] A. Maurer and M. Pontil. Empirical Bernstein bounds and sample variance penalization. In Proceedings of the Twenty Second Annual Conference on Computational Learning Theory, 2009. [17] S. Mendelson. Learning without concentration. In Proceedings of the Twenty Seventh Annual Conference on Computational Learning Theory, 2014. [18] H. Namkoong and J. C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in Neural Information Processing Systems 29, 2016. [19] A. B. Owen. Empirical likelihood. CRC press, 2001. [20] P. Samson. Concentration of measure inequalities for Markov chains and φ-mixing processes. Annals of Probability, 28(1):416 461, 2000. [21] A. Shapiro, D. Dentcheva, and A. Ruszczy nski. Lectures on Stochastic Programming: Modeling and Theory. SIAM and Mathematical Programming Society, 2009. [22] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statistics, pages 135 166, 2004. [23] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009. [24] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York, 1996. [25] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. [26] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, XVI(2):264 280, 1971. [27] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67(2):301 320, 2005. [28] A. Zubkov and A. Serov. A complete proof of universal inequalities for the distribution function of the binomial law. Theory of Probability & Its Applications, 57(3):539 544, 2013.