# largescale_methods_for_distributionally_robust_optimization__f0b05aa7.pdf Large-Scale Methods for Distributionally Robust Optimization Daniel Levy , Yair Carmon , John C. Duchi and Aaron Sidford Stanford University {danilevy,jduchi,sidford}@stanford.edu ycarmon@cs.tau.ac.il We propose and analyze algorithms for distributionally robust optimization of convex losses with conditional value at risk (CVa R) and χ2 divergence uncertainty sets. We prove that our algorithms require a number of gradient evaluations independent of training set size and number of parameters, making them suitable for large-scale applications. For χ2 uncertainty sets these are the first such guarantees in the literature, and for CVa R our guarantees scale linearly in the uncertainty level rather than quadratically as in previous work. We also provide lower bounds proving the worst-case optimality of our algorithms for CVa R and a penalized version of the χ2 problem. Our primary technical contributions are novel bounds on the bias of batch robust risk estimation and the variance of a multilevel Monte Carlo gradient estimator due to Blanchet and Glynn [8]. Experiments on MNIST and Image Net confirm the theoretical scaling of our algorithms, which are 9 36 times more efficient than full-batch methods. 1 Introduction The growing role of machine learning in high-stakes decision-making raises the need to train reliable models that perform robustly across subpopulations and environments [11, 29, 70, 58, 36, 53, 39]. Distributionally robust optimization (DRO) [3, 66] shows promise as a way to address this challenge, with recent interest in both the machine learning community [68, 74, 22, 69, 34, 55] and in operations research [20, 3, 5, 27]. Yet while DRO has had substantial impact in operations research, a lack of scalable optimization methods has hindered its adoption in common machine learning practice. In contrast to empirical risk minimization (ERM), which minimizes an expected loss ES P0 ℓ(x; S) over x X Rd with respect to a training distribution P0, DRO minimizes the expected loss with respect to the worst distribution in an uncertainty set U(P0), that is, its goal is to solve minimize x X L(x; P0) := sup Q U(P0) ES Q ℓ(x; S). (1) The literature considers several uncertainty sets [3, 5, 7, 27], and we focus on two particular choices: (a) the set of distributions with bounded likelihood ratio to P0, so that L becomes the conditional value at risk (CVa R) [59, 67], and (b) the set of distributions with bounded χ2 divergence to P0 [3, 16]. Some of our results extend to more general φ-divergence (or Rényi divergence) balls [72]. Minimizers of these objectives enjoy favorable statistical properties [22, 34], but finding them is more challenging than standard ERM. More specifically, stochastic gradient methods solve ERM with a number of ℓcomputations independent of both N, the support size of P0 (i.e., number of data points), and d, the dimension of x (i.e., number of parameters). These guarantees do not directly apply to DRO because the supremum over Q in (1) makes cheap sampling-based gradient estimates biased. As a consequence, existing techniques for minimizing the χ2 objective [2, 20, 3, 5, 47, 22] have ℓ Equal contribution. Code is available on Git Hub at https://github.com/daniellevy/fast-dro/. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. CVa R at level α χ2 constraint ρ χ2 penalty λ Objective L(x; P0) = sup q 1 αN q ℓ(x) sup Dχ2(q) ρ q ℓ(x) sup q Nq ℓ(x) λDχ2(q) Subgradient method Nϵ 2 Nϵ 2 Nϵ 2 Dual SGM [Appendix A.3] α 2ϵ 2 - λ 2ϵ 2 Subsampling [22] - ρ2dϵ 4 - Stoch. primal-dual [17, 47] Nϵ 2 Nρϵ 2 - Ours α 1ϵ 2 (Thm. 2) ρϵ 3 (Thm. 4) λ 1ϵ 2 (Thm. 2) Lower Bound α 1ϵ 2 (Thm. 3) ρϵ 2 [22] λ 1ϵ 2 (Thm. 3) Table 1. Number of ℓevaluations to obtain E[L(x; P0)] infx X L(x ; P0) ϵ when P0 is uniform on N training points. For simplicity we omit the Lipschitz constant of ℓ, the size of the domain X, and logarithmic factors. We define ℓi(x) := ℓ(x; Si) and Dχ2(q) := N N 1 2 2. The suprema are over q in the simplex. evaluation complexity scaling linearly (or worse) in either N or d, which is prohibitive in large-scale applications. In this paper, we consider the setting in which ℓis a Lipschitz convex loss, a prototype case for stochastic optimization and machine learning [75, 49], and we propose methods for solving the problem (1) with ℓcomplexity independent of sample size N and dimension d, and with optimal (linear) dependence on the uncertainty set size. In Table 1 we summarize their complexities and compare them to previous work. Each entry of the table shows the number of (sub)gradient evaluations to obtain a point with optimality gap ϵ; for reference, recall that for ERM the stochastic subgradient method requires order ϵ 2 evaluations, independent of d and N. We discuss related work further in Section 1.1 after outlining our approach. We begin our development in Section 3 by considering the surrogate objective L(x; n) = E L(x; b Pn) corresponding to the average empirical robust objective over random batches of size n sampled from P0. In contrast to (1), it is straightforward to obtain unbiased gradient estimates for L using the minibatch estimator L(x; b Pn) and to optimize it efficiently with stochastic gradient methods. To obtain guarantees for the true objective L, we establish uniform bounds on the error |L(x; P0) L(x; n)|. For CVa R we prove a bound scaling as 1/ n and extend it to other uncertainty sets, including χ2 balls, via the Kusuoka representation [42]. Notably, for the penalty version of the χ2 objective (Table 1 right column) we prove a stronger bound scaling as 1/n. This analysis implies that, for large enough batch size n, an ϵ/2-minimizer of L is also an ϵ-minimizer of L. Furthermore, for CVa R and χ2 penalty we show that the variance of the gradient estimator decreases as 1/n, and we use Nesterov acceleration to decrease the required number of gradient steps. To obtain stronger guarantees, in Section 4 we present a theoretically more efficient multi-level Monte Carlo (MLMC) [31, 32] gradient estimator which is a slight modification of the general technique of Blanchet and Glynn [8]. The resulting estimator is unbiased for L(x; n) but requires only a logarithmic number of samples in n. For CVa R and χ2 penalty we control the second moment of the gradient estimator, resulting in complexity bounds scaling with ϵ 2. We further prove that these rates are worst-case optimal up to logarithmic factors. Unfortunately, direct application of the MLMC estimator for the χ2 uncertainty set (Table 1 center column) demonstrably fails for certain inputs. Instead, in Appendix E we optimize its Lagrange dual the χ2 penalty with respect to x and Lagrange multiplier λ. Using a doubling scheme on the λ domain, we obtain a complexity guarantee scaling as ϵ 3. Section 5 presents experiments where we use DRO to train linear models for digit classification (on a mixture between MNIST [44] and typed digits [19]), and Image Net [60]. To the best of our knowledge, the latter is the largest DRO problem solved to date. In both experiments DRO provides generalization improvements over ERM, and we show that our stochastic gradient estimators require far less ℓcomputations between 9 and 40 than full-batch methods. Our experiments also reveal two facts that our theory only hints at. First, using the mini-batch gradient estimator the error floor due to the difference between L(x; n) and L(x; P0) becomes negligible even for batch sizes as small as 10. Second, while the MLMC estimator avoids these error floors altogether, its increased variance makes it practically inferior to the mini-batch estimator with properly tuned batch size and learning rate. Our code implements our gradient estimators in Py Torch [56] and combines them seamlessly with the framework s optimizers; we show an example code snippet in Appendix F.3. 1.1 Related work Distributionally robust optimization grows from the robust optimization literature in operations research [3, 2, 4, 5], and the fundamental uncertainty about the data distribution at test time makes its application to machine learning natural. Experiments in the papers [47, 28, 22, 34, 17, 40] show promising results for CVa R and χ2-constrained DRO, while other works highlight the importance of incorporating additional constraints into the uncertainty set definition [38, 24, 55, 61]. Below, we review the prior art on solving these DRO problems at scale. Full-batch subgradient method. When P0 has support of size N it is possible to compute a subgradient of the objective L(x; P0) by evaluating ℓ(x; si) and ℓ(x; si) for i = 1, . . . , N, computing the q N attaining the supremum (1), whence g = PN i=1 qi ℓ(x; si) is a subgradient of L at x. As the Lipschitz constant of L is at most that of ℓ, we may use these subgradients in the subgradient method [51] and find an ϵ approximate solution in order ϵ 2 steps. This requires order Nϵ 2 evaluations of ℓ, regardless of the uncertainty set. CVa R. Robust objectives of the form (1) often admit tractable expression in terms of joint minimization over x and the Lagrange multipliers associated with the constrained maximization over Q [e.g., 59, 66]. For CVa R, this dual formulation is an ERM problem in x and η R, which we can solve in time independent of N using stochastic gradient methods. We refer to this as dual SGM, providing the associated complexity bounds in Appendix A.3. Fan et al. [28] apply dual SGM for learning linear classifiers, and Curi et al. [17] compare it to their proposed stochastic primal-dual method based on determinantal point processes. While the latter performs better in practice, its worst-case guarantees scale roughly as Nϵ 2, similarly to the full-batch method. Kawaguchi and Lu [40] propose to only use gradients from the highest k losses in every batch, which is essentially identical to our mini-batch estimator for CVa R; they do not, however, relate their algorithm to CVa R optimization. We contribute to this line of work by obtaining tight characterizations of the mini-batch and MLMC gradient estimators, resulting in optimal complexity bounds scaling as α 1ϵ 2. DRO with χ2 divergence. Similar dual formulations exist for both the constrained and penalized χ2 objectives, and dual SGM provides similar guarantees to CVa R for the penalized χ2 objective. For the constrained-χ2 problem, the additional Lagrange multiplier associated with the constraint induces a perspective transform [3, 22], making the method unstable. Indeed, Namkoong and Duchi [47] report that it fails to converge in practice and instead propose a stochastic primal-dual method with convergence rate (1 + ρN)ϵ 2. Their guarantee is optimal in the weak regularization regime where ρ 1/N , but is worse than the full-batch method in the setting where ρ 1. Hashimoto et al. [34] propose a different scheme alternating between ERM on x and line search over a Lagrange multiplier, but do not provide complexity bounds. Duchi and Namkoong [22] prove that for a sample of size N ρ2dϵ 2 the empirical objective converges to L(x; P0) uniformly in x X; substituting N into the full-batch complexity bound implies a rate of ρ2dϵ 4. This guarantee is independent of N, but features an undesirable dependence on d. Ghosh et al. [30] use the mini-batch gradient estimator and gradually increase the batch size to N as optimization progresses; they do not provide convergence rate bounds. We establish concrete rates for fixed batch sizes independent of N. MLMC gradient estimators. Multi-level Monte Carlo techniques [31, 32] facilitate the estimation of expectations of the form E F(S1, . . . , Sn), where the Si are i.i.d. In this work we leverage a variant of a particular MLMC estimator proposed by Blanchet and Glynn [8]. Prior work [6] uses the estimator of [8] in a DRO formulation of semi-supervised learning with Wasserstein uncertainty sets and F( ) a ratio of expectations, as opposed to a supremum of expectations in our setting. 2 Preliminaries We collect notation, establish a few assumptions, and provide the most important definitions for the remainder of the paper in this section. Notation. We denote the optimization variable by x Rd, and use s (or S when it is random) for a data sample in S. We use zm l as shorthand for the sequence zl, . . . , zm. For fixed x we denote the cdf of ℓ(x, S) by F(t) := P(ℓ(x, S) t) and its inverse by F 1(u) := inf{t : F(t) > u}, leaving the dependence on x and P0 implicit. We use to denote Euclidean norm, but remark that many of our results carry over to general norms. We let m denote the simplex in m dimensions. We write 1{A} for the indicator of event A, i.e., 1 if A holds and 0 otherwise, and write IC for the infinite indicator of the set C, IC(x) = 0 if x C and IC(x) = otherwise. The Euclidean projection to a set C is ΠC. We use to denote gradient with respect to x, or, for non-differentiable convex functions, an arbitrary subgradient. We denote the positive part of t R by (t)+ := max{t, 0}. Finally, f g means that there exists C R+, independent of any problem parameters, such that f Cg holds; we also write f g if f g f. Assumptions. Throughout, we assume that the domain X is closed convex and satisfies x y R for all x, y X. Moreover, we assume the loss function ℓ: X S [0, B] is convex and GLipschitz in x, i.e., 0 ℓ(x, s) B and |ℓ(x; s) ℓ(y; s)| G x y for x, y X and s S.2 In some cases, we entertain two additional assumptions: Assumption A1. The gradient ℓ(x, s) is H-Lipschitz in x. Assumption A2. The inverse cdf F 1 of ℓ(x; S) is Gicdf-Lipschitz for each x X. Most of our bounds do not require Assumptions A1 and A2. Moreover, in Appendix B.2 we argue that these assumptions are frequently not restrictive. The distributionally robust objective. We consider a slight generalization of φ-divergence distributionally robust optimization (DRO). For a convex φ : R+ R {+ } satisfying φ(1) = 0, the φ-divergence between distributions P and Q absolutely continuous w.r.t. P by Dφ (Q, P) := R φ( d Q d P (s))d P(s). Then, for convex φ, ψ with φ(1) = ψ(1) = 0, a constraint radius ρ 0, and penalty λ 0 the general form of the objectives we consider is L(x; P) := sup Q:Dφ(Q,P ) ρ n EQ[ℓ(x; S)] λDψ(Q, P) o . (2) As previewed, we consider the following objectives for general P0 (nonuniform with infinite support): χ2 constraint. Lχ2 corresponds to φ(t) = χ2(t) := 1 2(t 1)2 and ψ = 0. χ2 penalty. Lχ2-pen corresponds to φ = 0 and ψ(t) = χ2(t) = 1 Conditional value at risk α (0, 1] (CVa R). LCVa R corresponds to φ = 0 and ψ = I[0,1/α). Additionally, define the following smoothed version of the CVa R objective, which we use in Section 3. KL-regularized CVa R. Lkl-CVa R corresponds to φ = 0 and and ψ(t) = I[0,1/α](t)+t log t t+1. In Appendix A we present additional standard formulations and useful properties of these objectives. With mild abuse of notation, for a sample sn 1 Sn, we let L(x; sn 1) := L(x; b P[sn 1]) = sup q n:P i n 1 n φ(nqi) ρ qiℓ(x; si) 1 nψ(nqi) (3) denote the loss with respect to the empirical distribution on sn 1. Averaging the robust objective over random batches of size n, we define the surrogate objective L(x; n) := ESn 1 P n 0 L(x; Sn 1 ). (4) Complexity metrics. We measure complexity of our methods by the number of computations of ℓ(x; s) they require to reach a solution with accuracy ϵ. We can bound (up to a constant factor) the runtime of every method we consider by our complexity measure multiplied by d+Teval, where Teval denotes the time to evaluate ℓ(x; s) and ℓ(x; s) at a single point x and sample s, and is typically O(d). (In the problems we study, solving the problem (4) given ℓ(x; Sn 1 ) takes O(n log n) time; see Appendix A.2). 2Our results hold also when B denotes supx X,s,s S{ℓ(x; s) ℓ(x; s )}. The Lipschitz loss and bounded domain assumptions imply B B0 + GR if infx X ℓ(x; s) infx X ℓ(x ; s ) B0 for all s, s S, which typically holds with B0 0 in regression and classification problems. 3 Mini-batch gradient estimators In this section, we develop and analyze stochastic subgradient methods using the subgradients of the mini-batch loss (3). That is, we estimate L(x; P0) by sampling a mini-batch S1, . . . , Sn iid P0 and computing L(x; Sn 1 ) = i=1 q i ℓ(x; Si), where q n attains the supremum in Eq. (3). By definition (4) of the surrogate objective L, we have that E L(x; Sn 1 ) = L(x; n). Therefore, we expect stochastic subgradient methods using L(x; Sn 1 ) to minimize L. However, in general, L(x; n) = L(x; P0) and E L(x; Sn 1 ) = L(x; P0). To show that the mini-batch gradient estimator is nevertheless effective for minimizing L, we proceed in three steps. First, in Section 3.1 we prove uniform bounds on the bias L L that tend to zero with n. Second, in Section 3.2 we complement them with 1/n variance bounds on L(x; Sn 1 ). Finally, Section 3.3 puts the pieces together: we apply the SGM guarantees to bound the complexity of minimizing L to accuracy ϵ/2, using Nesterov acceleration to exploit our variance bounds, and choose the mini-batch size n large enough to guarantee (via our bias bounds) that the resulting solution is also an ϵ minimizer of the original objective L. 3.1 Bias analysis Proposition 1 (Bias of the batch estimator). For all x X and n N we have 0 L(x; P0) L(x; n) B min 1, (αn) 1/2 for L = LCVa R (1 + ρ)(log n)/n for L = Lχ2 B2(λn) 1 for L = Lχ2-pen Gicdf n 1 for any loss (2), where the bound (8) holds under Assumption A1. We present the proof in Appendix B.1.1 and make a few remarks before proceeding to discuss the main proof ideas. First, the bounds (5), (6) and (7) are all tight up to constant or logarithmic factors when ℓ(x, S) has a Bernoulli distribution, and so are unimprovable without further assumptions (see Proposition 4 in Appendix B.1.2). One such assumption is that ℓ(x; S) has Gicdf-Lipschitz inverse-cdf, and it allows us to obtain a general 1/n bias bound (8) independent of the uncertainty set size. As we discuss in Appendix B.2.2, this assumption has natural relaxations for uniform distributions with finite supports and, for CVa R at level α, we only need the inverse cdf F 1(β) to be Lipschitz around β = α, a common assumption in the risk estimation literature [71]. Proof sketch. To show that L(x; P0) L(x; n) for every loss of the form (2), we use Lagrange duality to write L(x; P0) = inf η,ν ESn 1 P n 0 1 n i=1 Υ(x; η, ν; Si) and L(x; n) = ESn 1 P n 0 inf η,ν 1 n i=1 Υ(x; η, ν; Si), for some Υ : X R R+ S R. This exposes the fundamental source of the mini-batch estimator bias: when infimum and expectation do not commute (as is the case in general), exchanging them strictly decreases the result. Our upper bound analysis begins with CVa R, where LCVa R = 1 α R 1{β 1 α}F 1(β)dβ and LCVa R = 1 α R Iα(β)F 1(β)dβ, with F 1 the inverse cdf of ℓ(x, S) and Iα a soft step function that we write in closed form as a sum of Beta densities. To obtain the bound (5) we express R (1{β 1 α} Iα(β))+dβ as a sum of binomial tail probabilities and apply Chernoff bounds. For CVa R only, the improved bound (8) follows from arguing that replacing F 1(β) with Gicdf β overestimates the bias, and showing that R (1{β 1 α} Iα(β))βdβ (n + 1) 1 for any α. To transfer the CVa R bounds to other objectives we express the objective (2) as a weighted CVa R average over different α values, essentially using the Kusuoka representation of coherent risk measures [42]. Given any bias bound bb(α) for CVa R at level α, this expression implies the bound L L supw W(L) R bb(α)dw(α), where W(L) is a set of probability measures. Substituting bb(α) = 1/ nα and using the Cauchy-Schwartz inequality gives the bound (6), while substituting bb(α) = Gicdf/n shows this bound in fact holds for any L, as we claim in (8). Showing the bound (7) requires a fairly different argument. Our proof uses the dual representation of Lχ2-pen as a minimum of an expected risk over a Lagrange multiplier η imposing the constraint that q in (3) sums to 1 (or that Q in (2) integrates to 1). Using convexity with respect to η we relate the value of the risk at ηn (the minimizer for sample Sn 1 ) to η (the population minimizer), which on expectation are Lχ2-pen and Lχ2-pen, respectively. We then apply Cauchy-Schwartz and bound the variance of ηn with the Efron-Stein inequality [26] to obtain a 1/n bias bound. 3.2 Variance analysis With the bias bounds in Proposition 4 established, we analyze the variance of the stochastic gradient estimators L(x; Sn 1 ). More specifically, we prove that the variance of the mini-batch gradient estimator decreases as 1/n for penalty-type robust objectives (with φ = 0) for which the maximizing Q has bounded χ2 divergence from P0, which we call χ2-bounded objectives (see Appendix A.4). Noting that Lkl-CVa R (with LCVa R as a special case) and Lχ2-pen are χ2-bounded yields the following. Proposition 2 (Variance of the batch estimator). For all n N, x X, and Sn 1 P n 0 , Var h Lkl-CVa R(x; Sn 1 ) i G2 αn and Var h Lχ2-pen(x; Sn 1 ) i G2(1 + B/λ) (Note that the variance bound on Lkl-CVa R is independent of λ and therefore holds also for LCVa R where λ = 0). We prove Proposition 2 in Appendix B.3 and provide a proof sketch below.3 Unfortunately, the bounds do not extend to the χ2 constrained formulation: in Appendix B.3 (Proposition 5) we prove that for any n there exist ℓ, P0, and x such that Var[ Lχ2(x; P0)] ρ. Whether Proposition 2 holds when adding a χ2 penalty to the χ2 constraint remains an open question. Proof sketch. The Efron-Stein inequality [26] is Var[ L(x; Sn 1 )] n 2 E L(x; Sn 1 ) L(x; Sn 1 ) 2, where Sn 1 and Sn 1 are identical except in a random entry I [n] for which SI is an i.i.d. copy of SI. We bound L(x; Sn 1 ) L(x; Sn 1 ) Gq I + G q q 1 with the triangle inequality, where q and q attain the maximum in (3) for S and S, respectively. The crux of our proof is the equality q q 1 = 2|q I q I|, which holds since increasing one coordinate of ℓ(x; S1), . . . , ℓ(x; Sn) must decrease all other coordinates in q. Noting that E (q I q I)2 4 E(q I 1/n)2 = 8 n2 E Dχ2(q, 1 n1), the results follow by observing that Dχ2(q, 1 n1) is bounded by 1/α and B/λ for Lkl-CVa R and Lχ2-pen, respectively. 3.3 Complexity guarantees With the bias and variance guarantees established, we now provide bounds on the complexity of minimizing L(x; P0) to arbitrary accuracy ϵ using standard gradient methods with the gradient estimator g(x) = L(x; Sn 1 ). (Recall from Section 2 that we measure complexity by the number of individual first order evaluations (ℓ(x; s), ℓ(x; s)).) Writing ΠX for the Euclidean projection onto X, the stochastic gradient method (SGM) with fixed step-size η and x0 X iterates xt+1 = ΠX (xt η g(xt)), and xt = 1 τ t xτ. (9) We also consider Nesterov s accelerated gradient method [50, 43]. For x0 = y0 = z0 X, a fixed step-size η > 0 and a sequence {θt}, we iterate zt+1 = ΠX (zt η θt g(xt)), yt+1 = θtzt+1+(1 θt)yt, and xt+1 = θt+1zt+1+(1 θt)yt+1. (10) In Appendix B.4, we state the rates of convergence of the iterations (9) and (10) following the analysis in [43], with a small variation where the stochastic gradient estimates are unbiased for a uniform approximation of the true objective with additive error δ. Since our gradient estimator has norm bounded by G, SGM allows us to find an ϵ-minimizer of L in T (GR)2/ϵ2 steps. Therefore, 3In the appendix we provide bounds on the variance of L(x; Sn 1 ) in addition to L(x; Sn 1 ). choosing n large enough in accordance to Proposition 1 guarantees that we find an ϵ-minimizer of L. The accelerated scheme (10) admits convergence guarantees that scale with the gradient estimator variance instead of its second moment, allowing us to leverage Proposition 2 to reduce T to the order of 1/ϵ. The accelerated guarantees require the loss L to have order 1/ϵ-Lipschitz gradients fortunately, this holds for Lχ2-pen and Lkl-CVa R. Claim 1. Let Assumption A1 hold. For all P, Lkl-CVa R(x; P) and Lχ2-pen(x; P) are ( G2 λ + H)- Lipschitz in x, and 0 LCVa R(x; P) Lkl-CVa R(x; P) λ log(1/α) for all x. See proof in Appendix A.1.6. Thus, to minimize LCVa R we instead minimize Lkl-CVa R and choose λ ϵ/ log(1/α) to satisfy the smoothness requirement while incurring order ϵ approximation error. For Lχ2-pen with λ ϵ we get sufficient smoothness for free.4 As computing every gradient estimator requires n evaluations of ℓ, the total gradient complexity is n T, and we have the following suite of guarantees (see Appendix B.5 for proof). Theorem 1. Let Assumptions A1 and A2 hold, possibly trivially (with H = or Gicdf = ). Let ϵ (0, B) and write ν = H G2 ϵ. With suitable choices of the batch size n and iteration count T, the gradient methods (9) and (10) find x satisfying E L( x, P0) infx X L(x ; P0) ϵ with complexity n T admitting the following bounds. For L = LCVa R, we have n T (GR)2 1 + min n αGicdf α +ν GR , B2 α +ν GRϵ , B2 For L = Lχ2-pen with λ B, we have n T (GR)2B 1 + min n B GR For L = Lχ2, we have n T (1+ρ)(GR)2B2 ϵ4 log (1+ρ)B2 For any loss of the from (2), we have n T (GR)2Gicdf The smoothness parameter H only appears in rates resulting from Nesterov acceleration. Even there, H appears in lower-order terms in ϵ since ν = H G2 ϵ. We also note that the final Gicdfϵ 3 rate holds even when the uncertainty set is the entire simplex; therefore, when Gicdf < it is possible to approximately minimize the maximum loss [64] in sublinear time. Theorem 1 achieves the claimed rates of convergence in Table 1 in certain settings. In particular, it recovers the rates for LCVa R and Lχ2-pen (the first and last column of the table) when ν 1, λ (B/(GR))2ϵ, and α GR/Gicdf. In the next section, we show how to attain the claimed optimal rates for LCVa R and Lχ2-pen without conditions, returning to address the rates for the constrained χ2 objective Lχ2 in Appendix E. 4 Multi-level Monte Carlo (MLMC) gradient estimators In the previous section, we optimized the mini-batch surrogate L(x; n) to the risk L(x; P0), using Proposition 1 to guarantee the surrogate s fidelity for sufficiently large n. The increasing (linear) complexity of computing the estimator L(x; Sn 1 ) as n grows limits the (theoretical) efficiency of the method. To that end, in this section we revisit a multi-level Monte Carlo (MLMC) gradient estimator of Blanchet and Glynn [8] to form an unbiased approximation to L(x; n) whose sample complexity is logarithmic in n. We provide new bounds on the variance of this MLMC estimator, leading immediately to improved (and, as we shall see, optimal) efficiency estimates for stochastic gradient methods using it. To define the estimator, let J min{Geo(1/2), jmax} be a truncated geometric random variable supported on {1, . . . , jmax}, and let q(j) = P(J = j) = 2 j+1{j=jmax}. For a realization of J we draw a sample of size 2Jn0 and compute the multi-level Monte-Carlo estimator as follows: c M[ L] := L(x; Sn0 1 )+ 1 q(J) b D2Jn0, where b Dk := L(x; Sk 1 ) L(x; Sk/2 1 ) + L(x; Sk k/2+1) 4We can also handle the case λ < ϵ by adding a KL-divergence term to ψ for Lχ2-pen. Our estimator differs from the proposal [8] in two aspects: the distribution of J and the option to set n0 > 1. As we further discuss in Appendix C.3, the former difference is crucial for our setting, while the latter is pratically and theoretically helpful yet not crucial. The following properties of the MLMC estimator are key to our analysis (see Appendix C.1 for proofs). Claim 2. The estimator c M[ L] with parameters n = 2jmaxn0 satisfies E c M[ L] = E L(x; Sn 1 ) = L(x; n), requiring expected sample size E 2Jn0 = n0(1+log2(n/n0)). Proposition 3 (Second moment of MLMC gradient estimator). For all x X, the multi-level Monte Carlo estimator with parameters n and n0 satisfies E c M LCVa R 2 1 + log n G2 and E c M Lχ2-pen 2 1 + B log n Claim 2 follows from a simple calculation, while the core of Proposition 3 is a sign-consistency argument for simplifying a 1-norm, similar to the proof of Proposition 2. Further paralleling Proposition 2, we obtain similar bounds on the MLMC estimates of LCVa R and Lχ2-pen (in addition to their gradients), and demonstrate that similar bounds fail to hold for Lχ2 (Proposition 7 in Appendix C.1). Therefore, directly using the MLMC estimator on Lχ2 cannot provide guarantees for minimizing Lχ2; instead, in Appendix E we develop a doubling scheme that minimizes the dual objective Lχ2-pen(x; P0) + λρ jointly over x and λ. This scheme relies on MLMC estimators for both the gradient Lχ2-pen and the derivative of Lχ2-pen with respect to λ. Proposition 3 guarantees that the second moment of our gradient estimators remain bounded by a quantity that depends logarithmically on n. For these estimators, Proposition 6 thus directly provides complexity guarantees to minimize LCVa R and Lχ2-pen. We also provide a high probability bound on the total complexity of the algorithm using a one-sided Bernstein concentration bound. We state the guarantee below and present a short proof in Appendix C.2. Theorem 2 (MLMC complexity guarantees). For ϵ (0, B), set n B2 αϵ2 , 1 n0 log n α and T (GR)2 n0αϵ2 log2 n. The stochastic gradient iterates (9) with g(x) = c M[ LCVa R(x; )] satisfy E[LCVa R( x T ; P0)] infx X LCVa R(x; P0) ϵ with complexity at most (n log n)2 + n0n T log n (GR + B)2 αϵ2 log2 B2 αϵ2 w.p 1 1 The same conclusion holds when replacing LCVa R with Lχ2-pen and α 1 with 1 + B/λ. Lower bounds. We match the guarantees of Theorem 2 with lower bounds that hold in a standard stochastic oracle model [48, 43, 10], where algorithms interact with a problem instance by iteratively querying xt X (for t N) and observing ℓ(xt; S) and ℓ(xt; S) with S P0 (independent of xt). All algorithms we consider fit into this model, with each gradient evaluation corresponding to an oracle query. Therefore, to demonstrate that our MLMC guarantees are unimprovable in the worst case (ignoring logarithmic factors), we formulate a lower bound on the number of queries any oracle-based algorithm requires. Theorem 3 (Minimax lower bounds). Let G, R, α, λ > 0, ϵ (0, GR/64), and sample space S = [ 1, 1]. There exists a numerical constant c > 0 such that the following holds. For each d 1, domain X = {x Rd | x R}, and any algorithm, there exists a distribution P0 on S and convex G-Lipschitz loss ℓ: X S [0, GR] such that αϵ2 implies E[LCVa R(x T ; P0)] inf x X LCVa R(x ; P0) > ϵ. There exists dϵ (GR)2ϵ 2 log GR ϵ such that for X = {x Rd | x R}, the same conclusion holds when replacing LCVa R with Lχ2-pen and α with λ/(GR). We present the proof in Appendix D. Our proof for the penalized χ2 lower bound leverages a classical high-dimensional hard instance construction for oracle-based optimization, while our proof for CVa R is information-theoretic. Consequently, the CVa R lower bound is stronger: it holds for d = 1 and extends to a global model where at every round the oracle provides the entire function ℓ( ; S) rather than ℓ(x; S) and ℓ(x; S) at the query point x. 50 100 150 200 250 300 0.4 Digits: CVa R with α = 0.02 50 100 150 200 250 300 Digits: Constrained-χ2 with ρ = 1.0 50 100 150 200 250 300 Digits: Penalized-χ2 with λ = 0.05 5 10 15 20 25 30 3 Image Net: CVa R with α = 0.1 5 10 15 20 25 30 3.5 Image Net: Constrained-χ2 with ρ = 1.0 5 10 15 20 25 30 2 Image Net: Penalized-χ2 with λ = 0.4 250 275 300 250 275 300 0.180 250 275 300 0.1310 25 30 1.905 Batch n = 10 Batch n = 50 Batch n = 500 Batch n = 5K MLMC n0 = 10 Full Batch Batch n = 10 Batch n = 50 Batch n = 500 Batch n = 5K Batch n = 50K Batch n = 150K MLMC n0 = 10 Full Batch Robust (training) objective Gradient evaluations (in passes over the data) Figure 1. Convergence of DRO objective in our digits and Image Net classification experiments. Shaded areas indicate range of variability across 5 repetitions (minimum to maximum), and the zoomed-in regions highlight the (often very low) bias floor of small batch sizes. 5 Experiments We test our theoretical predictions with experiments on two datasets. Our main focus is measuring how the total work in solving the DRO problems depends on different gradient estimators. In particular, we quantify the tradeoffs in choosing the mini-batch size n in the estimator L(x; Sn 1 ) of Section 3 and the effect of using the MLMC technique of Section 4. To ensure that we operate in practically meaningful settings, our experiments involve heterogeneous data, and we tune the DRO objective to improve the generalization performance of ERM on the hardest subpopulation. We provide a full account of experiments in Appendix F and summarize them below. Our digit recognition experiment reproduces [22, Section 3.2], where the training data includes the 60K MNIST training images mixed with 600 images of typed digits from [19], while our Image Net experiment uses the ILSVRC-2012 1000-way classification task. In each experiment we use DRO to learn linear classifiers on top of pretrained neural network features (i.e., training the head of the network), taking ℓto be the logarithmic loss with squared-norm regularization; see Appendix F.1. Each experiments compares different gradient estimators for minimizing the LCVa R, Lχ2 and Lχ2-pen objectives. Appendix F.2 details our hyper-parameter settings and their tuning procedures. Figure 1 plots the training objective as optimization progresses. In Appendix F.4 we provide expanded figures that also report the robust generalization performance. We find that the benefits of DRO manifest mainly when the metric of interest is continuous (e.g., log loss) as opposed to the 0-1 loss. Our analysis in Section 3.1 bounds the suboptimality of solutions resulting from using a mini-batch estimators with batch size n, showing it must vanish as n increases. Figure 1 shows that smaller batch sizes indeed converge to suboptimal solutions, and that their suboptimality becomes negligible very quickly: essentially every batch size larger than 10 provides fairly small bias (with the exception of Lχ2 in the digits experiment). The effect of bias is particularly weak for Lχ2-pen, consistent with its superior theoretical guarantees. We note, however, that the suboptimality we see in practice is far smaller than the worst-case bounds in Proposition 1; we investigate this in detail in Appendix F.5. While the MLMC estimator does not suffer from a bias floor (by design), it is also much slower to converge. This may appear confusing, since the MLMC convergence guarantees are optimal (for LCVa R and Lχ2-pen) while the mini-batch estimator achieves the optimal rate only under certain assumptions. Recall, however, that these assumptions are smoothness of the loss (which holds in our experiments) and for CVa R sufficiently rapid decay of the bias floor, which we verify empirically. For batch sizes in the range 50 5K, the traces in Figure 1 look remarkably similar. This is consistent with our theoretical analysis for LCVa R and Lχ2-pen, which shows that the variance decreases linearly with the batch size and we may therefore (with Nesterov acceleration) increase the step size proportionally and expect the total work to remain constant. As theory predicts, this learning rate increase is only possible up to a certain batch size (roughly 5K in our experiments), after which larger batches become less efficient. Indeed, to reach within 2% of the optimal value, the full-batch method requires 27 36 more work than batch sizes 50 5K for Image Net, and 9 16 more work for the digits experiment (see Table 4 and 5 for a precise breakdown of these numbers). We also repeat our experiments with the dual SGM and prima-dual methods mentioned in Table 1 and compare them with them our proposed method; see Appendix F.6 for details. Broader Impact The robustness of machine learning (ML) models, or lack thereof, has far-reaching present and future societal consequences: in autonomous vehicles [39, 18], medical diagnosis [53], facial recognition [11], credit scoring [29], and recidivism prediction [12, 1], failure of ML to perform robustly across sub-population or under distribution shift can have disastrous real-life consequences, particularly for members of underserved and/or under-represented groups. Distributionally robust optimization (DRO) is emerging as a methodology for imposing the constraint that models perform uniformly well across subgroups, and several works conduct experiments demonstrating its benefit in promoting fairness [34, 24, 74] and robustness [68, 61] in ML. However, the computational experiments in these works are relatively small in scale, and there exist serious computational impediments to scaling up DRO. Consequently, the potential benefits of several DRO formulations remain unexplored. The main contribution of our work is in strengthening the theoretical and algorithmic foundations of two fundamental DRO formulations. In particular, for χ2-divergence uncertainty sets we give the first proof that stochastic gradient methods can scale to large data similarly to they way they scale for standard empirical risk minimization. We believe that our algorithms will serve a basis for future experimentation with CVa R and χ2 divergence DRO, and we hope that the resulting findings would lead to more robust and fair machine learning algorithms with positive societal impact. Towards that end, we will release an implementation of our DRO gradient estimators that integrates seamlessly into the Py Torch optimization framework and is therefore suitable for application in a wide range of ML tasks. In addition, we believe that our work is a step towards a suite of algorithms capable of solving a broader class of DRO problems at scale, including e.g., uncertainty set with explicit group structure as proposed in [38, 61]. We believe that such algorithm suite will empower machine learning researchers and engineers to create more reliable and ethical systems. However, greater applicability and simplicity always comes with the risk of irresponsible and superficial use. In particular, we are concerned with the possibility that DRO might become a marketing scheme to sell off ML systems as robust without proper verification. Therefore, the development of robust training procedures must go hand-in-hand with the development of rigorous and independent evaluation methodologies for auditing of algorithms [36, 54, 41, 14, 45]. Acknowledgments The authors would like to thank Hongseok Namkoong for discussions and insights, as well as Nimit Sohoni for comments on an earlier draft. Sources of funding DL, YC and JCD were supported by the NSF under CAREER Award CCF-1553086 and HDR 1934578 (the Stanford Data Science Collaboratory) and Office of Naval Research YIP Award N0001419-2288. YC was supported by the Stanford Graduate Fellowship. AS is supported by a Microsoft Research Faculty Fellowship, NSF CAREER Award CCF-1844855, NSF Grant CCF-1955039, a Pay Pal research gift, and a Sloan Research Fellowship. Competing interests The authors declare no competing interests. [1] S. Barocas and A. D. Selbst. Big data s disparate impact. 104 California Law Review, 3: 671 732, 2016. [2] A. Ben-Tal, L. E. Ghaoui, and A. Nemirovski. Robust Optimization. Princeton University Press, 2009. [3] A. Ben-Tal, D. den Hertog, A. D. Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2): 341 357, 2013. [4] D. Bertsimas, D. Brown, and C. Caramanis. Theory and applications of robust optimization. SIAM Review, 53(3):464 501, 2011. [5] D. Bertsimas, V. Gupta, and N. Kallus. Data-driven robust optimization. Mathematical Programming, Series A, 167(2):235 292, 2018. [6] J. Blanchet and Y. Kang. Semi-supervised Learning Based on Distributionally Robust Optimization, chapter 1, pages 1 33. John Wiley & Sons, Ltd, 2020. ISBN 9781119721871. [7] J. Blanchet, Y. Kang, and K. Murthy. Robust Wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830 857, 2019. [8] J. H. Blanchet and P. W. Glynn. Unbiased Monte Carlo for optimization and functions of expectations via multi-level randomization. In 2015 Winter Simulation Conference (WSC), pages 3656 3667. IEEE, 2015. [9] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: a Nonasymptotic Theory of Independence. Oxford University Press, 2013. [10] G. Braun, C. Guzmán, and S. Pokutta. Lower bounds on the oracle complexity of nonsmooth convex optimization via information theory. IEEE Transactions on Information Theory, 63(7), 2017. [11] J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, pages 77 91, 2018. [12] A. Chouldechova. A study of bias in recidivism prediciton instruments. Big Data, pages 153 163, 2017. [13] K. Clarkson, E. Hazan, and D. Woodruff. Sublinear optimization for machine learning. Journal of the Association for Computing Machinery, 59(5), 2012. [14] S. Corbett-Davies and S. Goel. The measure and mismeasure of fairness: A critical review of fair machine learning. ar Xiv:1808.00023, 2018. [15] N. Cressie and T. R. Read. Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society, Series B, pages 440 464, 1984. [16] I. Csiszár. Information-type measures of difference of probability distributions and indirect observation. Studia Scientifica Mathematica Hungary, 2:299 318, 1967. [17] S. Curi, K. Levy, S. Jegelka, A. Krause, et al. Adaptive sampling for stochastic risk-averse learning. ar Xiv:1910.12511 [cs.LG], 2019. [18] D. Dai and L. Van Gool. Dark model adaptation: Semantic image segmentation from daytime to nighttime. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pages 3819 3824. IEEE, 2018. [19] T. E. de Campos, B. R. Babu, and M. Varma. Character recognition in natural images. In Proceedings of the Fourth International Conference on Computer Vision Theory and Applications, February 2009. [20] E. Delage and Y. Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3):595 612, 2010. [21] J. C. Duchi. Introductory lectures on stochastic convex optimization. In The Mathematics of Data, IAS/Park City Mathematics Series. American Mathematical Society, 2018. [22] J. C. Duchi and H. Namkoong. Learning models with uniform performance via distributionally robust optimization. Annals of Statistics, to appear, 2020. [23] J. C. Duchi, P. L. Bartlett, and M. J. Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674 701, 2012. [24] J. C. Duchi, T. Hashimoto, and H. Namkoong. Distributionally robust losses against mixture covariate shifts. 2019. [25] R. Durrett. Probability: theory and examples, volume 49. Cambridge University Press, 2019. [26] B. Efron and C. Stein. The jackknife estimate of variance. The Annals of Statistics, 9(3): 586 596, 1981. [27] P. M. Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, Series A, 171(1 2):115 166, 2018. [28] Y. Fan, S. Lyu, Y. Ying, and B. Hu. Learning with average top-k loss. In Advances in Neural Information Processing Systems 30, pages 497 505, 2017. [29] A. Fuster, P. Goldsmith-Pinkham, T. Ramadorai, and A. Walther. Predictably unequal? the effects of machine learning on credit markets. Social Science Research Network: 3072038, 2018. [30] S. Ghosh, M. Squillante, and E. Wollega. Efficient stochastic gradient descent for distributionally robust learning. ar Xiv:1805.08728 [stats.ML], 2018. [31] M. B. Giles. Multilevel Monte Carlo path simulation. Operations research, 56(3):607 617, 2008. [32] M. B. Giles. Multilevel Monte Carlo methods. Acta Numerica, 24:259 328, 2015. [33] C. Guzmán and A. Nemirovski. On lower complexity bounds for large-scale smooth convex optimization. Journal of Complexity, 31(1):1 14, 2015. [34] T. Hashimoto, M. Srivastava, H. Namkoong, and P. Liang. Fairness without demographics in repeated loss minimization. In Proceedings of the 35th International Conference on Machine Learning, 2018. [35] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016. [36] D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In Proceedings of the Seventh International Conference on Learning Representations, 2019. [37] J. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms I. Springer, New York, 1993. [38] W. Hu, G. Niu, I. Sato, and M. Sugiayma. Does distributionally robust supervised learning give robust classifiers? In Proceedings of the 35th International Conference on Machine Learning, 2018. [39] N. Kalra and S. M. Paddock. Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? Transportation Research Part A: Policy and Practice, 94:182 193, 2016. [40] K. Kawaguchi and H. Lu. Ordered SGD: A new stochastic optimization framework for empirical risk minimization. In Proceedings of the 23nd International Conference on Artificial Intelligence and Statistics, 2020. [41] M. Kearns, S. Neel, A. Roth, and Z. S. Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. ar Xiv:1711.05144 [cs.LG], 2018. [42] S. Kusuoka. On law invariant coherent risk measures. In Advances in Mathematical Economics, pages 83 95. Springer, 2001. [43] G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, Series A, 133(1 2):365 397, 2012. [44] Y. Le Cun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A. Muller, E. Sackinger, P. Simard, and V. Vapnik. Comparison of learning algorithms for handwritten digit recognition. In International Conference on Artificial Neural Networks, pages 53 60, 1995. [45] M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru. Model cards for model reporting. In Conference on Fairness, Accountability and Transparency, pages 220 229, 2019. [46] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. [47] H. Namkoong and J. C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in Neural Information Processing Systems 29, 2016. [48] A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley, 1983. [49] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009. [50] Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady, 27(2):372 376, 1983. [51] Y. Nesterov. Introductory Lectures on Convex Optimization. Kluwer Academic Publishers, 2004. [52] Y. Nesterov. Smooth minimization of nonsmooth functions. Mathematical Programming, Series A, 103:127 152, 2005. [53] L. Oakden-Rayner, J. Dunnmon, G. Carneiro, and C. Ré. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In Proceedings of the ACM Conference on Health, Inference, and Learning, pages 151 159, 2020. [54] M. O Kelly, A. Sinha, H. Namkoong, J. Duchi, and R. Tedrake. Scalable end-to-end autonomous vehicle testing via rare-event simulation. In Advances in Neural Information Processing Systems 30, 2018. [55] Y. Oren, S. Sagawa, T. Hashimoto, and P. Liang. Distributionally robust language modeling. In Empirical Methods in Natural Language Processing (EMNLP), 2019. [56] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In Neural Information Processing Systems (NIPS) Workshop on Automatic Differentiation, 2017. [57] J. Pitman. Probability. Springer-Verlag, 1993. [58] B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do Image Net classifiers generalize to Image Net? In Proceedings of the 36th International Conference on Machine Learning, 2019. [59] R. T. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk. Journal of Risk, 2: 21 42, 2000. [60] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Image Net large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211 252, 2015. [61] S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In Proceedings of the Eighth International Conference on Learning Representations, 2020. [62] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012. [63] S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. In Advances in Neural Information Processing Systems 19, 2006. [64] S. Shalev-Shwartz and Y. Wexler. Minimizing the maximal loss: How and why? In Proceedings of the 33rd International Conference on Machine Learning, 2016. [65] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning, pages 71 79, 2013. [66] A. Shapiro. Distributionally robust stochastic programming. SIAM Journal on Optimization, 27 (4):2258 2275, 2017. [67] A. Shapiro, D. Dentcheva, and A. Ruszczy nski. Lectures on Stochastic Programming: Modeling and Theory. SIAM and Mathematical Programming Society, 2009. [68] A. Sinha, H. Namkoong, and J. Duchi. Certifying some distributional robustness with principled adversarial training. In Proceedings of the Sixth International Conference on Learning Representations, 2018. [69] M. Staib and S. Jegelka. Distributionally robust optimization and generalization in kernel methods. In Advances in Neural Information Processing Systems 32, pages 9134 9144, 2019. [70] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1521 1528. IEEE, 2011. [71] A. A. Trindade, S. Uryasev, A. Shapiro, and G. Zrazhevsky. Financial prediction with constrained tail risk. Journal of Banking & Finance, 31(11):3524 3538, 2007. [72] T. van Erven and P. Harremoës. Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797 3820, 2014. [73] M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019. [74] S. Wang, W. Guo, H. Narasimhan, A. Cotter, M. Gupta, and M. I. Jordan. Robust optimization for fairness with noisy protected groups. ar Xiv:2002.09343 [cs.LG], 2020. [75] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on Machine Learning, 2003.