# adaptive_sampling_for_stochastic_riskaverse_learning__369f33e4.pdf Adaptive Sampling for Stochastic Risk-Averse Learning Sebastian Curi Dept. of Computer Science ETH Zurich scuri@inf.ethz.ch Kfir Y. Levy Faculty of Electrical Engineering Technion kfirylevy@technion.ac.il Stefanie Jegelka CSAIL MIT stefje@mit.edu Andreas Krause Dept. of Computer Science ETH Zurich krausea@inf.ethz.ch In high-stakes machine learning applications, it is crucial to not only perform well on average, but also when restricted to difficult examples. To address this, we consider the problem of training models in a risk-averse manner. We propose an adaptive sampling algorithm for stochastically optimizing the Conditional Valueat-Risk (CVa R) of a loss distribution, which measures its performance on the α fraction of most difficult examples. We use a distributionally robust formulation of the CVa R to phrase the problem as a zero-sum game between two players, and solve it efficiently using regret minimization. Our approach relies on sampling from structured Determinantal Point Processes (DPPs), which enables scaling it to large data sets. Finally, we empirically demonstrate its effectiveness on large-scale convex and non-convex learning tasks. 1 Introduction Machine learning systems are increasingly deployed in high-stakes applications. This imposes reliability requirements that are in stark discrepancy with how we currently train and evaluate these systems. Usually, we optimize expected performance both in training and evaluation via empirical risk minimization (Vapnik, 1992). Thus, we sacrifice occasional large losses on difficult examples in order to perform well on average. In this work, we instead consider a risk-averse optimization criterion, namely the Conditional Value-at-Risk (CVa R), also known as the Expected Shortfall. In short, the α-CVa R of a loss distribution is the average of the losses in the α-tail of the distribution. Optimizing the CVa R is well-understood in the convex setting, where duality enables a reduction to standard empirical risk minimization using a modified, truncated loss function from Rockafellar et al. (2000). Unfortunately, this approach fails when stochastically optimizing the CVa R especially on non-convex problems, such as training deep neural network models. A likely reason for this failure is that mini-batch estimates of gradients of the CVa R suffer from high variance. To address this issue, we propose a novel adaptive sampling algorithm ADA-CVAR (Section 4). Our algorithm initially optimizes the mean of the losses but gradually adjusts its sampling distribution to increasingly sample tail events (difficult examples), until it eventually minimizes the CVa R (Section 4.1). Our approach naturally enables the use of standard stochastic optimizers (Section 4.2). We provide convergence guarantees of the algorithm (Section 4.3) and an efficient implementation (Appendix A). Finally, we demonstrate the performance of our algorithm in a suite of experiments (Section 5). 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. 2 Related Work Risk Measures Risk aversion is a well-studied human behavior, in which agents assign more weight to adverse events than to positive ones (Pratt, 1978). Approaches for modeling risk include using utility functions that emphasize larger losses (Rabin, 2013); prospect theory that re-scales the probability of events (Kahneman and Tversky, 2013); or direct optimization of coherent risk-measures (Artzner et al., 1999). Rockafellar et al. (2000) introduce the CVa R as a particular instance of the latter class. The CVa R has found many applications, such as portfolio optimization (Krokhmal et al., 2002) or supply chain management (Carneiro et al., 2010), as it does not rely on specific utility or weighing functions, which offers great flexibility. CVa R in Machine Learning The ν-SVM algorithm by Schölkopf et al. (2000) can be interpreted as optimizing the CVa R of the loss, as shown by Gotoh and Takeda (2016). Also related, Shalev Shwartz and Wexler (2016) propose to minimize the maximal loss among all samples. The maximal loss is the limiting case of the CVa R when α 0. Fan et al. (2017) generalize this work to the top-k average loss. Although they do not mention the relationship to the CVa R, their learning criterion is equal to the CVa R for empirical measures. For optimization, they use an algorithm proposed by Ogryczak and Tamir (2003) to optimize the maximum of the sum of k functions; this algorithm is the same as the truncated algorithm of Rockafellar et al. (2000) to optimize the CVa R. Recent applications of the CVa R in ML include risk-averse bandits (Sani et al., 2012), risk-averse reinforcement learning (Chow et al., 2017), and fairness (Williamson and Menon, 2019). All these use the original truncated formulation of Rockafellar et al. (2000) to optimize the CVa R. One of the major shortcomings of this formulation is that mini-batch gradient estimates have high variance. In this work, we address this via a method based on adaptive sampling, inspired by Shalev-Shwartz and Wexler (2016), that allows us to handle large datasets and complex (deep neural network) models. Distributionally Robust Optimization The CVa R also has a natural distributionally robust optimization (DRO) interpretation (Shapiro et al., 2009, Section 6.3), which we exploit in this paper. For example, Ahmadi-Javid (2012) introduces the entropic value-at-risk by considering a different DRO set. Duchi et al. (2016); Namkoong and Duchi (2017); Esfahani and Kuhn (2018); Kirschner et al. (2020) address related DRO problems, but with different uncertainty sets. We use the DRO formulation of the CVa R to phrase its optimization as a game. To solve the game, we propose an adaptive algorithm for the learning problem. Our algorithm is most related to (Namkoong and Duchi, 2016), who develop an algorithm for DRO sets induced by Cressie-Read f-divergences. Instead, we use a different DRO set that arises in common data sets (Mehrabi et al., 2019) and we provide an efficient algorithm to solve the DRO problem in large-scale datasets. 3 Problem Statement We consider supervised learning with a risk-averse learner. The learner has a data set comprised of i.i.d. samples from an unknown distribution, i.e., D = {(x1, y1), . . . (x N, y N)} (X Y)N DN, and her goal is to learn a function hθ : X R that is parametrized by θ Θ Rd. The performance of hθ at a data point is measured by a loss function l : Θ X Y [0, 1]. We write the random variable Li(θ) = l(θ; xi, yi). The learner s goal is to minimize the CVa R of the losses on the (unknown) distribution D w.r.t. the parameters θ. Probability α Figure 1: CVa R of a loss distribution CVa R properties The CVa R of a random variable L P is defined as Cα[L] = EP [L|L ℓα], where ℓα is the 1 α quantile of the distribution, also called the Value-at-Risk (Va R). We illustrate the mean, Va R and CVa R of a typical loss distribution in Figure 1. It can be shown that the CVa R of a random variable has a natural distributionally robust optimization (DRO) formulation, namely as the expected value of the same random variable under a different law. This law arises from the following optimization problem (Shapiro et al., 2009, Sec. 6.3): Cα[L] = max Q Qα EQ[L], (1) where Qα = n Q P, d Q α o . Here, Q P means that Q is absolutely continuous w.r.t. P. The distribution Q that solves Problem (1) places all the mass uniformly in the tail, i.e., the blue shaded region of Figure 1. Thus, optimizing the CVa R can be viewed as guarding against a particular kind of distribution shift, which reweighs arbitrary parts of the data up to a certain amount 1 α. Rockafellar et al. (2000) prove strong duality for Problem (1). The dual objective is: Cα[L] = min ℓ R ℓ+ 1 αEP [max {0, L ℓ}] . (2) Learning with the CVa R Problem (2) can be used to estimate the CVa R of a random variable by replacing the expectation EP by the empirical mean ˆE, yielding min ℓ R,θ Θ ℓ+ 1 αN i=1 [max {0, Li(θ) ℓ}] . (3) For convex Li, Problem (3) has computable subgradients, and hence lends itself to subgradient-based optimization. Furthermore, in this case Problem (3) is jointly convex in (ℓ, θ). We refer to this standard approach as TRUNC-CVAR, as it effectively optimizes a modified loss, truncated at ℓ. Problem (3) is indeed a sensible learning objective in the sense that the empirical CVa R concentrates around the population CVa R uniformly for all functions h H. Proposition 1. Let h : X Y be a finite function class |H|. Let L(h) : H [0, 1] be a random variable. Then, for any 0 < α 1, with probability at least 1 δ, b Cα[L(h)] Cα[L(h)] 1 log(2|H|/δ) Proof. See Appendix B.1. The result above is easily extended to classes H with finite VC (pseudo-)dimension. Concurrent to this work, Lee et al. (2020), Soma and Yoshida (2020), and Mhammedi et al. (2020) present similar statistical rates based on different assumptions. Challenges for Stochastic Optimization In the common case that a variant of SGD is used to optimize the learning problem (3), the expectation is approximated with a mini-batch of data. But, when this batch is sampled uniformly at random from the data, only a fraction α of points will contain gradient information. The gradient of the remaining points gets truncated to zero by the max{ } non-linearity. Furthermore, the gradient of the examples that do contain information is scaled by 1/α, leading to exploding gradients. These facts make stochastic optimization of Problem (3) extremely challenging, as we demonstrate in Section 5. Our key observation is that the root of the problem lies in the mismatch between the sampling distribution P and the unknown distribution Q , from which we would ideally like to sample. In fact, Problem (3) can be interpreted as a form of rejection sampling samples with losses smaller than ℓare rejected. It is well known that Monte Carlo estimation of rare events suffers from high variance (Rubino and Tuffin, 2009). To address this issue, we propose a novel sampling algorithm that adaptively learns to sample events from the distribution Q while optimizing the model parameters θ. 4 ADA-CVAR: Adaptive Sampling for CVa R Optimization We directly address the DRO problem (1) on the empirical measure ˆP for learning. The DRO set is Qα = q RN | 0 qi 1 i qi = 1 with k = αN . The learning problem becomes: min θ Θ max q Qα Eq[Li(θ)] = min θ Θ max q Qα q L(θ), (4) where L(θ) RN has i-th index Li(θ). The learning problem (4) is a minimax game between a θ-player (the learner), whose goal is to minimize the objective function by selecting θ Θ, against a q-player (the sampler), whose goal is to maximize the objective function by selecting q Qα. To solve the game (4), we use techniques from regret minimization with partial (bandit) feedback. In particular, we exploit that one can solve minimax games by viewing both players as online learners that compete, and by equipping each of them with no-regret algorithms (Freund and Schapire, 1999). With the partial (bandit) feedback model, we only need to consider a small subset of the data in each iteration of the optimization algorithm. In contrast, full-information feedback would require a full pass over the data per iteration, invalidating all benefits of stochastic optimization. Algorithm 1: ADA-CVAR input Learning rates ηs, ηl. 1: Sampler: Initialize k-DPP w1 = 1N. 2: Learner: Initialize parameters θ0 Θ. 3: for t = 1, . . . , T do 4: Sampler: Sample data point it qt = 1 k Pwt(i). 5: Learner: θt = θt 1 ηl Lit(θt 1). 6: Sampler: Build estimate ˆLt = Lit(θt) qt,it [[i == it]]. 7: Sampler: Update k-DPP wt+1,i = wt,ieηs ˆLt,i. 8: end for output θ, q u.a.r {(θt, qt)}T t=1 Next, we describe and analyze an online learning algorithm for each of the two players and prove guarantees with respect to the DRO problem (4). We outline the final algorithm, which we call ADA-CVAR, in Algorithm 1, where we use an adaptive sampling scheme for the qplayer and SGD for the θ-player. Initially, the qplayer (sampler) plays the uniform distribution and the θ-player (learner) selects any parameter in the set Θ. In iteration t, the sampler samples a data point (or a mini-batch) with respect to the distribution qt. Then, the learner performs an SGD step on the sample(s) selected by the sampler player1. Finally, the q-player adapts the distribution to favor examples with higher loss and thus maximize the objective in (4). 4.1 Sampler (q-Player) Algorithm In every iteration t, the learner player sets a vector of losses through θt. We denote by L(θt; xi, yi) = Lt,i the loss at time t for example i and by L(θt) = Lt the vector of losses. The sampler player chooses an index it (or a mini-batch) and a vector qt. Then, only Lt,it is revealed and she suffers a cost q t Lt. In such setting, the best the player can aim to do is to minimize its regret: SRT := max q Qα XT t=1 q Lt XT t=1 q t Lt. (5) The regret measures how good the sequence of actions of the sampler is, compared to the best single action (i.e., distribution over the data) in hindsight, after seeing the sequence of iterates Lt. The sampler player problem is a linear adversarial bandit. Exploration and sampling in this setting are hard (Bubeck et al., 2012). Our efficient implementation exploits the specific combinatorial structure. In particular, the DRO set Qα is a polytope with N k vertices, each corresponding to a different subset I of size k of the ground set 2[N]. As the inner optimization problem over q in (4) is a linear program, the optimal solution q is a vertex. Thus, the sampler problem can be reduced to a best subset selection problem: find the best set among all size-k subsets Ik = I 2[N] | |I| = k . Here, the value of a set I at time t is simply the average of the losses (1/k) P i I Li(θt). The problem of maximizing the value over time t can be viewed as a combinatorial bandit problem, as we have a combinatorial set of arms , one per I Ik (Lattimore and Szepesvári, 2018, Chapter 30). Building on Alatur et al. (2020), we develop an efficient algorithm for the sampler. Starting Point: EXP.3 A well known no-regret bandit algorithm is the celebrated EXP.3 algorithm (Auer et al., 2002). Let I := n W R( N k)| P I WI = 1, WI 0 o be the simplex of distributions over the N k subsets. Finding the best distribution W I I is equivalent to finding the best subset I Ik. By transitivity, this is equivalent to finding the best q Qα. To do this, EXP.3 maintains a vector WI,t I, samples an element It WI,t and observes a loss associated with element It. Finally, it updates the distribution using multiplicative weights. Unfortunately, EXP.3 is intractable in two ways: Sampling a k-subset It would require evaluating the losses of k = αN data points, which is impractical. Furthermore, the naive EXP.3 algorithm is intractable because the dimension of WI,t is exponential in k. In turn, the regret of this algorithm also depends on the dimension of WI,t. Efficiency through Structure The crucial insight is that we can exploit the combinatorial structure of the problem and additivity of the loss to exponentially improve efficiency. First, we exploit that 1Note that we do not use any importance sampling correction. weights of individual elements and sets of them are related by Wt,I = P i I wt,i. Thus, instead of observing the loss LIt, we let the q-player sample only a single element it uniformly at random from the set It WI,t, observe its loss Lit, and use it to update a weight vector wt,i. The single element it sampled by the algorithm provides information about the loss of all N 1 k 1 sets that contain it. This allows us to obtain regret guarantees that are sub-linear in N (rather than in N k). Second, we exponentially improve computational cost by developing an algorithm that maintains a vector w RN and uses k-Determinantal Point Processes to map it to distributions over subsets of size k. Definition 4.1 (k-DPP, Kulesza et al. (2012)). A k-Determinantal Point Process over a ground set N is a distribution over all subsets of size k s.t. the probability of a set is P(I) det(KI), where K is a positive definite kernel matrix and KI is the submatrix of K indexed by I. In particular, we consider k-DPPs with diagonal kernel matrices K = diag w, with w RN 0 and at least k strictly positive elements. This family of distributions is sufficient to contain, for example, the uniform distribution over the N k subsets and all the vertices of Qα. We use such k-DPPs to efficiently map a vector of size N to a distribution over N k subsets. We also denote the marginal probability of element i by Pw(i). It is easy to verify that the vector of marginals 1 k Pw(i) Qα. Hence, we directly use the k-DPP marginals as the sampler s decision variables. We can finally describe the sampler algorithm. We initialize the k-DPP kernel with the uniform distribution w1 = 1N. In iteration t, the sampler plays the distribution qt = 1 k Pwt( ) Qα and samples an element it qt. The loss at index it, Lt,it, is revealed to the sampler and only the index it of wt is updated according to the multiplicative update wt+1,it = wt+1,itek Lt,it/qt,it. This approach addresses the disadvantages of the EXP.3 algorithm. Computationally, it only requires O(N) memory. After sampling every element it, the distribution over the N 1 k 1 sets that contain it are updated. This yield rates that depend sub-linearly on the data set size which we prove next. Lemma 1. Let the sampler player play the ADA-CVAR Algorithm with ηs = q NT . Then, for any sequence of losses she suffers a regret (5) of at most O( TN log N). Proof sketch. For a detailed proof please refer to Appendix B.2. First, we prove in Proposition 3 that the iterates of the algorithm are effectively in Qα. Next, we prove in Proposition 4 that the comparator in the regret of Alatur et al. (2020) and in the sampler regret (5) have the same value (scaled by k). Finally, the result follows as a corollary from these propositions and Alatur et al. (2020, Lemma 1). 4.2 Learner (θ-Player) Algorithm Analogous to the sampler player, the learner player seeks to minimize its regret t=1 q t L(θt) min θ Θ t=1 q t L(θ). (6) Crucially, the learner can choose θt after the sampler selects qt. Thus, the learner can play the Be-The-Leader (BTL) algorithm: θt = arg min θ Θ τ=1 q τ L(θ) = arg min θ Θ q t L(θ), (7) where qt = 1 t Pt τ=1 qτ is the average distribution (up to time t) that the sampler player proposes. Instead of assuming access to an exact optimizer, we assume to have an ERM oracle available. Assumption 1 (ϵoracle-correct ERM Oracle). The learner has access to an ERM oracle that takes a distribution q over the dataset as input and outputs ˆθ, such that q L(ˆθ) min θ Θ q L(θ) + ϵoracle. To implement the ERM Oracle, the learner player must solve a weighted empirical loss minimization in Problem (7) in every round. For non-convex problems, this is in general NP-hard (Murty and Kabadi, 1987), so obtaining efficient and provably no-regret guarantees in the non-convex setting seems unrealistic in general. Despite this hardness, the success of deep learning empirically demonstrates that stochastic optimization algorithms such as SGD are able to find very good (even if not necessarily optimal) solutions for the ERM-like non-convex problems. Furthermore, SGD on the sequence of samples {iτ qτ}t τ=1 approximately solves the BTL problem. To see why, we note that such sequence of samples is an unbiased estimator of qt from the BTL algorithm (7). Then, for the freshly sampled it qt, a learner that chooses θt := θt 1 ηl Lit(θt 1) is (approximately) solving the BTL algorithm with SGD. Lemma 2. A learner player that plays the BTL algorithm with access to an ERM oracle as in Assumption 1, achieves at most ϵoracle T regret. Proof. See Appendix B.3. For convex problems, we know that it is not necessary to solve the BTL problem (7), and algorithms such as online projected gradient descent (Zinkevich, 2003) achieve no-regret guarantees. As shown in Appendix C, the learner suffers LRT = O( T) regret by playing SGD in convex problems. 4.3 Guarantees for CVa R Optimization Next, we show that if both players play the no-regret algorithms discussed above, they solve the game (4). Using J(θ, q) := q L(θ), the minimax equilibrium of the game is the point (θ , q ) such that θ Θ, q Qα; J(θ , q) J(θ , q ) J(θ, q ). We assume that this point exists (which is guaranteed, e.g., when the sets Qα and Θ are compact). The game regret is Game Regret T := XT t=1 J(θt, q ) J(θ , qt). (8) Theorem 1 (Game Regret). Let Li( ) : Θ [0, 1], i = {1, ..., N} be a fixed set of loss functions. If the sampler plays ADA-CVAR and the learner plays the oracle-BTL algorithm, then the game has regret O( TN log N + ϵoracle T). Proof sketch. We bound the Game regret with the sum of the learner and sampler regret and use the results of Lemma 2 and Lemma 1. For a detailed proof please refer to Appendix B.4. This immediately implies our main theoretical result, namely a performance guarantee for the solution obtained by ADA-CVAR for the central problem of minimizing the empirical CVa R (4). Corollary 1 (Online to Batch Conversion). Let Li( ) : Θ [0, 1], i = {1, ..., N} be a set of loss functions sampled from a distribution D. Let θ be the minimizer of the CVa R of the empirical distribution ˆCα. Let θ be the output of ADA-CVAR, selected uniformly at random from the sequence {θt}T t=1. Its expected excess CVa R is bounded as: EˆCα[L( θ)] ˆCα[L(θ )] + O( p N log N/T) + ϵoracle where the expectation is taken w.r.t. the randomization in the algorithm, both for the sampling steps and the randomization in choosing θ. Proof sketch. For a detailed proof please refer to Appendix B.5. The excess CVa R is bounded by the duality gap, which in turn is upper-bounded by the average game regret. Corollary 2 (Population Guarantee). Let L( ) : Θ [0, 1] be a Random Variable induced by the data distribution D. Let θ be the minimizer of the CVa R at level α of such Random Variable. Let θ be the output of ADA-CVAR, selected uniformly at random from the sequence {θt}T t=1. Then, with probability at least δ the expected excess CVa R of θ is bounded as: ECα[L( θ)] Cα[L(θ )] + O( p N log N/T) + ϵoracle + ϵstat where ϵstat = O( 1 1 N ) comes from the statistical error and the expectation is taken w.r.t. the randomization in the algorithm, both for the sampling steps and the randomization in choosing θ. Proof sketch. For a detailed proof please refer to Appendix B.6. We bound the statistical error using Proposition 1 and the optimization error is bounded using Corollary 1. CVa R Avg.Loss 0.0 Normalized Test Scores Regression (Convex) Accuracy CVa R Avg.Loss Classification (Convex) Accuracy Avg.Loss Distribution Shift (Convex) Ada CVa R Trunc CVa R Soft CVa R Mean Figure 2: Scores are normalized between 0 and 1 to compare different data sets. Left: Linear regression tasks. ADA-CVAR has lower CVa R than benchmarks. Middle: Binary classification (logistic regression) tasks. ADA-CVAR obtains the same accuracy as MEAN and SOFT-CVAR with lower CVa R. TRUNC-CVAR outputs an approximately uniform distribution yielding low CVa R but poor predictive accuracy. Right: Binary classification (logistic regression) tasks with train/test 90% distribution shift. ADA-CVAR has the highest test accuracy and low average surrogate loss. It is instructive to consider the special cases k = 1 and k = N. For k = N, qt remains uniform and ADA-CVAR reduces to SGD. For k = 1, the sampler simply plays standard EXP.3 over data points and ADA-CVAR reduces to the algorithm of Shalev-Shwartz and Wexler (2016) for the max loss. 5 Experiments In our experimental evaluation, we compare ADA-CVAR on both convex (linear regression and classification) and non-convex (deep learning) tasks. In addition to studying how it performs in terms of the CVa R and empirical risk on the training and test set, we also investigate to what extent it can help guard against distribution shifts. In Appendix D, we detail the experimental setup. We provide an open-source implementation of our method, which is available at http: //github.com/sebascuri/adacvar. Baseline Algorithms We compare our adaptive sampling algorithm (ADA-CVAR) to three baselines: first, an i.i.d. sampling scheme that optimizes Problem (3) using the truncated loss (TRUNC-CVAR); second, an i.i.d. sampling scheme that uses a smoothing technique to relax the P i[xi]+ non-linearity (SOFT-CVAR). Tarnopolskaya and Zhu (2010) compare different smoothing techniques for the P i[xi]+ non-linearity. Of these, we use the relaxation T log(P i exi/T ) proposed by Nemirovski and Shapiro (2006). In each iteration, we heuristically approximate the population sum with a mini-batch. Third, we also compare a standard i.i.d. sampling ERM scheme that stochastically minimizes the average of the losses (MEAN). 5.1 Convex CVa R Optimization We first compare the different algorithms in a controlled convex setting, where the classical TRUNCCVAR algorithm is expected to perform well. We consider three UCI regression data sets, three synthetic regression data sets, and eight different UCI classification data sets (Dua and Graff, 2017). The left and middle plots in Figure 2 present a summary of the results (see Table 1 in Appendix E for a detailed version). We evaluate the CVa R (α = 0.01) and average loss for linear regression and the accuracy, and CVa R and average surrogate loss for classification (logistic regression) on the test set. In linear regression, ADA-CVAR performs comparably or better to benchmarks in terms of the CVa R of the loss and is second best in terms of the average loss. In classification, TRUNC-CVAR performs better in terms of the CVa R for the surrogate loss but performs poorly in terms of accuracy. This is due to the fact that it learns a predictive distribution that is close to uniform. ADA-CVAR has a comparable accuracy to ERM (MEAN algorithm) but a much better CVa R. Hence, it finds a good predictive model while successfully controlling the prediction risk. Accuracy CVa R Avg.Loss 0.0 Normalized Test Scores Classification (Non-Convex) -3 -2 -1 0 0 Test Accuracy Ada CVa R Trunc CVa R Soft CVa R Mean Power Law Parameter [log β] Distribution Shift Accuracy (Non-Convex) Figure 3: Non Convex Optimization tasks. Left: Normalized scores in image classification tasks. ADA-CVAR attains state-of-the-art accuracy and lowest CVa R. Middle and Right: Test accuracy under train/test distribution shift on CIFAR-10 for VGG16-BN (middle) and Res Net-18 (right). Lower β indicates larger shift. ADA-CVAR has always better test accuracy than benchmarks. 5.2 Convex CVa R Distributional Robustness We use the same classification data sets and classifiers as in Section 5.1. To produce the distribution shift, we randomly sub-sample the majority class in the training set, so that the new training set has a 10%/90% class imbalance and the majority/minority classes are inverted. The test set is kept unchanged. Such shifts in class frequencies are common (Mehrabi et al., 2019, Section 3.2). We consider α = 0.1, which is compatible with the data imbalance. The right plot in Figure 2 shows a summary of the results (See Table 2 in Appendix E for detailed results). ADA-CVAR has higher test accuracy than the benchmarks and is comparable to TRUNC-CVAR on average log-likelihood. We note that the CVa R provides robustness with respect to worst-case distribution shifts. Such a worst case distribution might be too pessimistic to be encountered in practice, however. Instead, ADACVAR appears to benefit from the varying distributions during training and protects better against non-adversarial distribution shifts. Other techniques for dealing with imbalanced data might also be useful to address this distribution shift empirically, but are only useful if there is an a-priori knowledge of the class ratio in the test set. Instead, the CVa R optimization guards against any distribution shift. Furthermore, with such a-priori knowledge, such techniques can also be used together with ADACVAR. We provide extended experiments analyzing distribution shift in Appendix E.1. 5.3 Non-Convex (Deep Learning) CVa R Optimization We test our algorithm on common non-convex optimization benchmarks in deep learning (MNIST, Fashion-MNIST, CIFAR-10). As it is common in these setting, we perform data-augmentation on the training set. Thus, the effective training set size is infinite. To address this, we consider a mixture of distributions in a similar spirit as Borsos et al. (2019). Each data point serves as a representative of a distribution over all its possible augmentations. We optimize the CVa R of this mixture of distributions as a surrogate of the CVa R of the infinite data set. The left plot in Figure 3 summarizes the results (See Table 3 in Appendix E). ADA-CVAR reaches the same accuracy as ERM in all cases and has lower CVa R. Only in CIFAR-10 it does not outperform TRUNC-CVAR in terms of the CVa R of the surrogate loss. This is because the TRUNC-CVAR yields a predictive model that is close to uniform. Instead, ADA-CVAR still yields useful predictions while controlling the CVa R. Gradient Magnitude and Training Time The gradients of TRUNC-CVAR are either 0 or 1/α times larger than the gradients of the same point using MEAN. A similar but smoothed phenomenon arises with SOFT-CVAR. This makes training these losses considerably harder due to exploding gradients and noisier gradient estimates. With the same learning rates, these algorithms usually produce numerical overflows and, to stabilize learning, we used considerably smaller learning rates. In turn, this increased the number of iterations required for convergence. ADA-CVAR does not suffer from this as the gradients have the same magnitude as in MEAN. For example, to reach 85 % train accuracy ADA-CVAR requires 7 epochs, MEAN 9, SOFT-CVAR 21, and TRUNC-CVAR never surpassed 70 % train accuracy. There was no significant difference between time per epoch of each of the algorithms. 5.4 Distributional Robustness in Deep Learning through Optimizing the CVa R Lastly, we demonstrate that optimizing the CVa R yields improved robustness to distribution shifts in deep learning. We simulate distribution shift through mismatching training and test class frequencies. Since we consider multi-class problems, we simulate power-law class frequencies, which are commonly encountered in various applications (Clauset et al., 2009). More specifically, we sub-sample each class of the training set of CIFAR-10 so that the class size follows a power-law distribution p(|c|) clog β, where |c| is the size of the c-th class and keep the test set unchanged. In middle and right plots of Figure 3, we show the test accuracy for different values of β for VGG16-BN and Res Net18 networks. The algorithms do not know a-priori the amount of distribution shift to protect against and consider a fixed α = 0.1. For all distribution shifts, ADA-CVAR is superior to the benchmarks. When a high-capacity network learns a perfectly accurate model, then the average and CVa R of the loss distribution have both zero value. This might explain the similarity between MEAN and ADA-CVAR for Res Net-18. Instead, there is a stark discrepancy between ADA-CVAR and MEAN in VGG16. This shows the advantage of training in a risk averse manner, particularly when the model makes incorrect predictions due to a strong inductive bias. 6 Conclusions The CVa R is a natural criterion for training ML models in a risk-aware fashion. As we saw, the traditional way of optimizing it via truncated losses fails for modern machine learning tasks due to high variance of the gradient estimates. Our novel adaptive sampling algorithm ADA-CVAR exploits the distributionally robust optimization formulation of the CVa R, and tackles it via regret minimization. It naturally enables the use of standard stochastic optimization approaches (e.g., SGD), applied to the marginal distribution of a certain k-DPP. Finally, we demonstrate in a range of experiments that ADA-CVAR is superior to the TRUNC-CVAR algorithm for regression and classification tasks, both in convex and non-convex learning settings. Furthermore, ADA-CVAR provides higher robustness to (non-adversarial) distribution shifts than TRUNC-CVAR, SOFT-CVAR, or MEAN algorithms. Broader Impact Increasing reliability is one of the central challenges when deploying machine learning in high-stakes applications. We believe our paper makes important contributions to this endeavor by going beyond simply optimizing the average performance, and considering risk in deep learning. The CVa R is also known to be an avenue towards enforcing fairness constraints in data sets (Williamson and Menon, 2019). Hence, our algorithm also contributes to optimizing fair deep models, by counteracting inherent biases in the data (e.g., undersampling of certain parts of the population). Acknowledgments and Disclosure of Funding This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research, innovation programme grant agreement No 815943, a DARPA YFA award, and NSF CAREER award 1553284. N. Agarwal, A. Gonen, and E. Hazan. Learning in non-convex games with an optimization oracle. ar Xiv:1810.07362, 2018. A. Ahmadi-Javid. Entropic value-at-risk: A new coherent risk measure. Journal of Optimization Theory and Applications, 155(3):1105 1123, 2012. P. Alatur, K. Y. Levy, and A. Krause. Multi-player bandits: The adversarial case. Journal of Machine Learning Research, 21(77):1 23, 2020. N. Anari, S. O. Gharan, and A. Rezaei. Monte carlo markov chain algorithms for sampling strongly rayleigh distributions and determinantal point processes. In Conference on Learning Theory, pages 103 115, 2016. P. Artzner et al. Coherent measures of risk. Mathematical finance, pages 203 228, 1999. J.-Y. Audibert, S. Bubeck, and G. Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31 45, 2013. P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002. S. Barthelmé, P.-O. Amblard, N. Tremblay, et al. Asymptotic equivalence of fixed-size and varyingsize determinantal point processes. Bernoulli, pages 3555 3589, 2019. A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 175, 2003. Z. Borsos, S. Curi, K. Y. Levy, and A. Krause. Online variance reduction with mixtures. In International Conference on Machine Learning, pages 705 714, 2019. D. B. Brown. Large deviations bounds for estimating conditional value-at-risk. Operations Research Letters, 35(6):722 730, 2007. C. Brownlees, E. Joly, G. Lugosi, et al. Empirical risk minimization for heavy-tailed losses. The Annals of Statistics, 43(6):2507 2536, 2015. S. Bubeck, N. Cesa-Bianchi, and S. M. Kakade. Towards minimax policies for online linear optimization with bandit feedback. In Conference on Learning Theory, pages 41 1, 2012. M. C. Carneiro et al. Risk management in the oil supply chain: a cvar approach. Industrial & Engineering Chemistry Research, pages 3286 3294, 2010. N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404 1422, 2012. Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070 6120, 2017. A. Clauset, C. R. Shalizi, and M. E. Newman. Power-law distributions in empirical data. SIAM review, 51(4):661 703, 2009. M. Derezi nski, D. Calandriello, and M. Valko. Exact sampling of determinantal point processes with sublinear time preprocessing. ar Xiv:1905.13476, 2019. L. Devroye, L. Györfi, and G. Lugosi. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media, 2013. D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci. edu/ml. J. Duchi, P. Glynn, and H. Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. ar Xiv:1610.03425, 2016. J. P. Eaton and C. A. Haas. Titanic, triumph and tragedy. WW Norton & Company, 1995. P. M. Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171 (1-2):115 166, 2018. Y. Fan, S. Lyu, Y. Ying, and B. Hu. Learning with average top-k loss. In Advances in Neural Information Processing Systems, pages 497 505, 2017. Y. Freund and R. E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79 103, 1999. J.-y. Gotoh and A. Takeda. Cvar minimizations in support vector machines. Financial Signal Processing and Machine Learning, pages 233 265, 2016. E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, pages 157 325, 2016. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. ar Xiv:1207.0580, 2012. S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ar Xiv:1502.03167, 2015. D. Kahneman and A. Tversky. Prospect theory: An analysis of decision under risk. In Handbook of the fundamentals of financial decision making: Part I, pages 99 127. World Scientific, 2013. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ar Xiv:1412.6980, 2014. J. Kirschner, I. Bogunovic, S. Jegelka, and A. Krause. Distributionally robust bayesian optimization. In The 23rd International Conference on Artificial Intelligence and Statistics, 2020. W. M. Koolen, M. K. Warmuth, and J. Kivinen. Hedging structured concepts. In COLT, pages 93 105, 2010. A. Krizhevsky, V. Nair, and G. Hinton. The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, 2014. P. Krokhmal, J. Palmquist, and S. Uryasev. Portfolio optimization with conditional value-at-risk objective and constraints. Journal of risk, 4:43 68, 2002. A. Kulesza, B. Taskar, et al. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 5(2 3):123 286, 2012. T. Lattimore and C. Szepesvári. Bandit algorithms. preprint, 2018. Y. Le Cun, Y. Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995. Y. Le Cun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. J. Lee, S. Park, and J. Shin. Learning bounds for risk-sensitive learning. ar Xiv preprint ar Xiv:2006.08138, 2020. N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan. A survey on bias and fairness in machine learning. ar Xiv preprint ar Xiv:1908.09635, 2019. Z. Mhammedi, B. Guedj, and R. C. Williamson. Pac-bayesian bound for the conditional value at risk. ar Xiv preprint ar Xiv:2006.14763, 2020. K. G. Murty. Linear programming. Springer, 1983. K. G. Murty and S. N. Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematical programming, 39(2):117 129, 1987. H. Namkoong and J. C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in Neural Information Processing Systems, pages 2208 2216, 2016. H. Namkoong and J. C. Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, pages 2971 2980, 2017. A. Nemirovski and A. Shapiro. Convex approximations of chance constrained programs. SIAM Journal on Optimization, 17(4):969 996, 2006. W. Ogryczak and A. Tamir. Minimizing the sum of the k largest functions in linear time. Information Processing Letters, 85(3):117 122, 2003. A. Paszke et al. Automatic differentiation in pytorch. In NIPS Autodiff Workshop, 2017. J. W. Pratt. Risk aversion in the small and in the large. In Uncertainty in Economics, pages 59 79. Elsevier, 1978. M. Rabin. Risk aversion and expected-utility theory. In Handbook of the Fundamentals of Financial Decision Making, pages 241 252. World Scientific, 2013. R. T. Rockafellar, S. Uryasev, et al. Optimization of conditional value-at-risk. Journal of risk, 2: 21 42, 2000. G. Rubino and B. Tuffin. Rare event simulation using Monte Carlo methods. John Wiley & Sons, 2009. A. Sani, A. Lazaric, and R. Munos. Risk-aversion in multi-armed bandits. In Advances in Neural Information Processing Systems, pages 3275 3283, 2012. B. Schölkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural computation, 12(5):1207 1245, 2000. S. Shalev-Shwartz and Y. Wexler. Minimizing the maximal loss: How and why. In ICML, pages 793 801, 2016. A. Shapiro, D. Dentcheva, and A. Ruszczy nski. Lectures on stochastic programming: modeling and theory. SIAM, 2009. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv:1409.1556, 2014. T. Soma and Y. Yoshida. Statistical learning with conditional value at risk. ar Xiv preprint ar Xiv:2002.05826, 2020. T. Tarnopolskaya and Z. Zhu. Cvar-minimising hedging by a smoothing method. ANZIAM, 52: 237 256, 2010. T. Uchiya, A. Nakamura, and M. Kudo. Algorithms for adversarial bandit problems with multiple plays. In International Conference on Algorithmic Learning Theory, pages 375 389. Springer, 2010. V. Vapnik. Principles of risk minimization for learning theory. In Advances in neural information processing systems, pages 831 838, 1992. R. Williamson and A. Menon. Fairness risk measures. In International Conference on Machine Learning, pages 6786 6797, 2019. H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv:1708.07747, 2017. D. W. Zimmerman. Teacher s corner: A note on interpretation of the paired-samples t test. Journal of Educational and Behavioral Statistics, 22(3):349 360, 1997. M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 928 936, 2003.