# doro_distributional_and_outlier_robust_optimization__e9d2dc9e.pdf DORO: Distributional and Outlier Robust Optimization Runtian Zhai * 1 Chen Dan * 1 J. Zico Kolter 1 Pradeep Ravikumar 1 Many machine learning tasks involve subpopulation shift where the testing data distribution is a subpopulation of the training distribution. For such settings, a line of recent work has proposed the use of a variant of empirical risk minimization(ERM) known as distributionally robust optimization (DRO). In this work, we apply DRO to real, large-scale tasks with subpopulation shift, and observe that DRO performs relatively poorly, and moreover has severe instability. We identify one direct cause of this phenomenon: sensitivity of DRO to outliers in the datasets. To resolve this issue, we propose the framework of DORO, for Distributional and Outlier Robust Optimization. At the core of this approach is a refined risk function which prevents DRO from overfitting to potential outliers. We instantiate DORO for the Cressie-Read family of R enyi divergence, and delve into two specific instances of this family: CVa R and χ2-DRO. We theoretically prove the effectiveness of the proposed method, and empirically show that DORO improves the performance and stability of DRO with experiments on large modern datasets, thereby positively addressing the open question raised by (Hashimoto et al., 2018). Codes are available at https://github.com/Runtian Z/doro. 1. Introduction Many machine learning tasks require models to perform well under distributional shift, where the training and the testing data distributions are different. One type of distributional shift that arouses great research interest is subpopulation shift, where the testing distribution is a specific or the worst-case subpopulation of the training distribution. A wide range of tasks can be modeled as subpopulation *Equal contribution 1School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA. Correspondence to: Runtian Zhai . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). shift problems, such as learning for algorithmic fairness (Dwork et al., 2012; Barocas & Selbst, 2016) where we want to test model s performance on key demographic subpopulations, and learning with class imbalance (Japkowicz, 2000; Galar et al., 2011) where we train a classifier on an imbalanced dataset with some minority classes having much fewer samples than the others, and we want to maximize the classifier s accuracy on the minority classes instead of its overall average accuracy. Distributionally robust optimization (DRO) (Namkoong & Duchi, 2016; Duchi & Namkoong, 2018) refers to a family of learning algorithms that minimize the model s loss over the worst-case distribution in a neighborhood of the observed training distribution. Generally speaking, DRO trains the model on the worst-off subpopulation, and when the subpopulation membership is unknown, it focuses on the worst-off training instances, that is, the tail performance of the model. Previous work has shown effectiveness of DRO in subpopulation shift settings, such as algorithmic fairness (Hashimoto et al., 2018) and class imbalance (Xu et al., 2020). However, in our empirical investigations, when we apply DRO to real tasks on modern datasets, we observe that DRO suffers from poor performance and severe instability during training. The issue that DRO is sensitive to outliers has been raised by several previous papers (Hashimoto et al., 2018; Hu et al., 2018; Zhu et al., 2020) . In this paper, we study the cause of these problems with DRO, and develop approaches to address them. In particular, we identify and study one key factor that we find directly leads to DRO s sub-optimal behavior: DRO s sensitivity to outliers that widely exist in modern datasets. In general, DRO maximizes a model s tail performance by putting more weights on the harder instances, i.e. those which incur higher losses during training. On the one hand, this allows DRO to focus its attention on worst-off subpopulations. But on the other hand, since outliers are intuitively hard instances that incur higher losses than inliers, DRO is prone to assign large weights to outliers, resulting in both a drop in performance, and training instability. To provide empirical insights into how outliers affect DRO, in Section 3 we conducted experiments examining how the performance of DRO changes as we removed or added outliers DORO: Distributional and Outlier Robust Optimization Figure 1. DORO avoids overfitting to outliers. to the dataset. The results of these experiments indicate that outliers bring about the observed bad performance of DRO. Thus, it is crucial to first enhance the robustness of DRO to outliers before applying it to real-world applications. To this end, we propose DORO, an outlier robust refinement of DRO which takes inspiration from robust statistics. At the core of this approach is a refined risk function which prevents DRO from overfitting to potential outliers. Intuitively speaking, the new risk function adaptively filters out a small fraction of data with high risk during training, which is potentially caused by outliers. Figure 1 illustrates the difference between DRO and DORO. In Section 4 we implement DORO for the Cressie-Read family of R enyi divergence, and for our theoretical and empirical study we primarily focus on CVa R-DORO and χ2-DORO. In Section 5 we provide theoretical results guaranteeing that DORO can effectively handle subpopulation shift in the presence of outliers. Then, in Section 6 we empirically demonstrate that DORO improves the performance and stability of DRO. We conduct large-scale experiments on three datasets: the tabular dataset COMPAS, the vision dataset Celeb A, and the language dataset Civil Comments-Wilds. Contributions Our contributions are summarized below: We demonstrate that the sensitivity of DRO to outliers is a direct cause of the irregular behavior of DRO with some intriguing experimental results in Section 3. We propose and implement DORO as an outlier robust refinement of DRO in Section 4. Then, in Section 5 we provide theoretical guarantees for DORO. We conduct large-scale experiments in Section 6 and empirically show that DORO improves the performance and stability of DRO. We also analyze the effect of hyperparameters on DRO and DORO. Related Work Distributional shift naturally arises in many machine learning applications and has been widely studied in statistics, applied probability and optimization (Shimodaira, 2000; Huang et al., 2006; Bickel et al., 2007; Quionero-Candela et al., 2009). One common type of distributional shift is domain generalization where the training and testing distributions consist of distinct domains, and relevant topics include domain adaptation (Patel et al., 2015; Wang & Deng, 2018) and transfer learning (Pan & Yang, 2009; Tan et al., 2018). Another common type of distributional shift studied in this paper is subpopulation shift, where the two distributions consist of the same group of domains. Subpopulation shift is closely related to algorithmic fairness and class imbalance. For algorithmic fairness, a number of fairness notions have been proposed, such as individual fairness (Dwork et al., 2012; Zemel et al., 2013), group fairness (Hardt et al., 2016; Zafar et al., 2017), counterfactual fairness (Kusner et al., 2017) and Rawlsian Max-Min fairness (Rawls, 2001; Hashimoto et al., 2018). The setting of subpopulation shift is most closely related to the Rawlsian Max-Min fairness notion. Several recent papers (Hashimoto et al., 2018; Oren et al., 2019; Xu et al., 2020) proposed using DRO to deal with subpopulation shift, but it was also observed that DRO was prone to overfit in practice (Sagawa et al., 2020a;b). (Hashimoto et al., 2018) raised the open question whether it is possible to design algorithms both fair to unknown latent subpopulations and robust to outliers, and this work answers this question positively. Outlier robust estimation is a classic problem in statistics starting with the pioneering works of (Tukey, 1960; Huber, 1992). Recent works in statistics and machine learning (Lai et al., 2016; Diakonikolas et al., 2017; Prasad et al., 2018; Diakonikolas et al., 2019) provided efficiently computable outlier-robust estimators for high-dimensional mean estimation with corresponding error guarantees. Outliers have a greater effect on the performance of DRO than ERM (Hu et al., 2018), due to its focus on the tail performance, so removing this negative impact of outliers is crucial for the success of DRO in its real-world applications. One closely related recent work is (Lee et al., 2020), and DORO can be viewed as a combination of risk-averse and risk-seeking methods discussed in this paper. 2. Background This section provides the necessary background of subpopulation shift and DRO. 2.1. Subpopulation Shift A machine learning task with subpopulation shift requires a model that performs well on the data distribution of each subpopulation. Let the input space be X and the label space be Y. We are given a training set containing m samples i.i.d. sampled from some data distribution P over X Y. There are K predefined domains (subpopulations) D1, , DK, each of which is a subset of X Y. For example, in an algorithmic fairness task, domains are demographic groups defined by a number of protected features such as race and sex. Let Pk(z) = P(z|z Dk) be the conditional training distribution over Dk, where z = (x, y). The goal is to train a model fθ : X Y parameterized by θ Θ that performs well over every Pk. Denote the expected risk DORO: Distributional and Outlier Robust Optimization over P by R(θ; P) = EZ P [ℓ(θ; Z)] where ℓ(θ; z) is a measurable loss function. Then the expected risk over Pk is Rk(θ; P) = EZ Pk[ℓ(θ; Z)]. The objective is to minimize the worst-case risk defined as Rmax(θ; P) = max k=1, ,K Rk(θ; P) (1) Several different settings were studied by previous work: Overlapping vs Non-overlapping The overlapping setting allows the domains to overlap with each other while non-overlapping does not. For example, suppose we have two protected features: race (White and Others) and sex (Male and Female). Under either setting we will have four domains. Under the overlapping setting we will have White, Others, Male and Female, while under the non-overlapping setting we will have White Male, White Female, Others Male and Others Female. All the experiments in this work are conducted under the overlapping setting. Each instance may belong to zero, one or more domains. Domain-Aware vs Domain-Oblivious Some previous work has assumed that domain memberships of instances are known at least during training. This is called the domainaware setting. However, (Hashimoto et al., 2018) argue that in many real applications, domain memberships are unknown during training, either because it is hard to extract the domain information from the input, or because it is hard to identify all protected features. Thus, a line of recent work (Hashimoto et al., 2018; Lahoti et al., 2020) studies the domain-oblivious setting, in which the training algorithm does not know the domain membership of any instance (even the number of domains K is unknown). In this work, we focus on the domain-oblivious setting. 2.2. Distributionally Robust Optimization (DRO) Under the domain-oblivious setting, we cannot compute the worst-case risk since we have no access to D1, , DK. In this case, the framework of DRO instead maximizes the performance over the worst-off subpopulation in general. Specifically, given some divergence D between distributions, DRO aims to minimize the expected risk over the worst-case distribution Q (that is absolutely continuous with respect to training distribution P, so that Q P) in a ball w.r.t. divergence D around the training distribution P. Thus, while empirical risk minimization (ERM) algorithm minimizes the expected risk R(θ; P), DRO minimizes the expected DRO risk defined as: RD,ρ(θ; P) = sup Q P {EQ[ℓ(θ; Z)] : D(Q P) ρ} (2) for some ρ > 0. Different divergence functions D derive different DRO risks. In this work, we focus on the Cressie- Read family of R enyi divergence (Cressie & Read, 1984) formulated as: Dβ(Q P) = Z fβ(d Q d P )d P (3) where β > 1, and fβ(t) is defined as: fβ(t) = 1 β(β 1) tβ βt + β 1 (4) An advantage of the Cressie-Read family is that it has the following convenient dual characterization (see Lemma 1 of (Duchi & Namkoong, 2018) for the proof): RDβ,ρ(θ; P) = inf η R n cβ(ρ)EP [(l(θ; Z) η)β + ] 1 β + η o (5) where β = β β 1, and cβ(ρ) = (1 + β(β 1)ρ) 1 β . The following proposition shows that DRO can handle subpopulation shift under the domain-oblivious setting. The only information DRO needs during training is α, the ratio between the size of the smallest domain and the size of the population. See the proof in Appendix A.1. Proposition 1. Let α = mink=1, ,K P(Dk) be the minimal group size, and define ρ = fβ( 1 Rmax(θ; P) RDβ,ρ(θ; P) (6) While the Cressie-Read formulation only defines the fdivergence for finite β (1, + ), it can be shown that the dual characterization is valid for β = as well, for which the DORO risk becomes the well-known conditional value-at-risk (CVa R) (See e.g. (Duchi & Namkoong, 2018), Example 3). In our theoretical analysis and experiments, we delve into two most widely-used sepecial cases of the Cressie-Read family: (i) β = , which corresponds to CVa R; (ii) β = 2, which corresponds to χ2-DRO risk used in (Hashimoto et al., 2018). Table 1 summarizes the relevant quatities in these two special cases. Table 1. CVa R and χ2-DRO. α is the ratio between the size of the smallest domain and the size of the population. CVa R χ2-DRO β 2 β 1 2 ρ log(α) 1 2( 1 cβ(ρ) α 1 q Dβ(Q P) sup log d Q d P 1 2 R (d Q/d P 1)2d P DRO Risk CVa Rα(θ; P) RDχ2 ,ρ(θ; P) For example, the dual form of CVa R is CVa Rα(θ; P) = inf η R{α 1EP [(ℓ(θ; Z) η)+] + η} (7) DORO: Distributional and Outlier Robust Optimization It is easy to see that the optimal η of (7) is the α-quantile of l(θ; Z) defined as qθ(α) = inf q {PZ P (ℓ(θ; Z) > q) α} (8) The dual form (7) shows that CVa R in effect minimizes the expected risk on the worst α portion of the training data. The following corollary of Proposition 1 shows that both CVa Rα(θ; P) and RDχ2,ρ(θ; P) are upper bounds of Rmax(θ; P), so that minimizing either of them guarantees a small worst-case risk (see the proof in Appendix A.2): Corollary 2. Let α = mink=1, ,K P(Dk) be the minimal group size, and ρ = 1 α 1)2. Then Rmax(θ; P) CVa Rα(θ; P) RDχ2,ρ(θ; P) (9) 3. DRO is Sensitive to Outliers Although the construction of DRO aims to be effective against subpopulation shift as detailed in the previous section, when applied to real tasks DRO is found to have poor and unstable performance. After some examination, we pinpoint one direct cause of this phenomenon: the vulnerablity of DRO to outliers that widely exist in modern datasets. In this section, we will provide some intriguing experimental results to show that: 1. DRO methods have poor and unstable performances. 2. Sensitivity to outliers is a direct cause of DRO s poor performance. To support this argument, we show that DRO becomes good and stable on a clean dataset constructed by removing the outliers from the original dataset, and new outliers added to this clean dataset compromise DRO s performance and stability. We conduct experiments on COMPAS (Larson et al., 2016), a recidivism prediction dataset with 5049 training instances (after preprocessing and train-test splitting). We select two features as protected features: race and sex. The two protected features define four overlapping demographic groups: White, Others, Male and Female. A two-layer feed-forward neural network with Re LU activations is used as the classification model. We train three models on this dataset with ERM, CVa R and χ2-DRO. Then we remove the outliers from the training set using the following procedure: We first train a model with ERM, and then remove 200 training instances that incur the highest loss on this model, as outliers are likely to have poorer fit. Then we reinitialize the model, train it on the new training set with ERM, and remove 200 more instances with the highest loss from the new training set. This process is repeated 5 times, so that 1000 training instances are removed and we get a new training set with (a) Average (Original) (b) Worst (Original) (c) Train Loss (Original) (d) Test Loss (Original) (e) Average (Outliers removed) (f) Worst (Outliers removed) (g) Average (Labels flipped) (h) Worst (Labels flipped) (i) Average (Original) (j) Worst (Original) Figure 2. Average/Worst-case test accuracies on the COMPAS dataset (Original, clean with the outliers removed, and clean with label noise with 20% of the labels flipped). The second row shows the train/test loss of ERM and DRO on the original dataset (average over all samples). The last row shows the performance of DORO on the original dataset. 4049 instances. Note that this procedure is not guaranteed to remove all outliers and retain all inliers, but is sufficient for the purposes of our demonstration. We then run the three algorithms again on this same clean training set. We plot the test accuracies (average and worst across four demographic groups) of the models achieved by the three methods in Figure 2. The first row shows the results on the original dataset, and the second row shows the results on the clean dataset with the outliers removed. We can DORO: Distributional and Outlier Robust Optimization see that in the first row, for both average and worst-case test accuracies, the DRO curves are below the ERM curves and jumping up and down, which implies that DRO has lower performance than ERM and is very unstable on the original dataset. However, the third row shows that DRO becomes good and stable after the outliers are removed. For comparison, in the second row we plot the train/test loss on the original dataset of the three methods (for ERM we plot the ERM loss, and for DRO we plot the corresponding DRO loss). The train and test losses of DRO descend steadily while the average and worst-case accuracies jump up and down, which indicates that the instability is not an optimization issue, but rather stems from the existence of outliers. It should also be emphasized that these outliers naturally exist in the original dataset since no outliers have been manually added yet. To further substantiate our conclusion, we consider another common source of outliers: incorrect labels. We randomly flip 20% of the labels of the clean COMPAS dataset with the outliers removed, and run the three training methods again. The results are plotted in the fourth row of Figure 2, which shows that while the label noise just slightly influences ERM, it significantly downgrades the performance and stability of the two DRO methods. Likewise, (Hu et al., 2018) also found in their experiments that DRO had even lower performance than ERM (see their Table 1). Essentially, DRO methods minimize the expected risk on the worst portion of the training data, which contains a higher density of outliers than the whole population. Training on these instances naturally result in the observed bad performance of DRO. In the next section we will propose DORO as a solution to the problem revealed by the experiments in this section. We plot the performances of the two DORO algorithms we implement in the last row of Figure 2, which compared to the first row shows that DORO improves the performance and stability of DRO on the original dataset. Problem Setting The goal is to train a model on a dataset with outliers to achieve high tail performance on the clean underlying data distribution P. Denote the observed contaminated training distribution by Ptrain. We formulate Ptrain with Huber s ϵ-contamination model (Huber, 1992), in which the training instances are i.i.d. sampled from Ptrain = (1 ϵ)P + ϵ P (10) where P is an arbitrary outlier distribution, and 0 < ϵ < 1 2 is the noise level. The objective is to minimize Rmax(θ; P), the worst-case risk over the clean distribution P. Algorithm 1 DORO with Dβ Divergence Input: Batch size n, outlier fraction ϵ, minimal group size α for each iteration do Sample a batch z1, , zn Ptrain Compute losses: ℓi = ℓ(θ, zi) for i = 1, , n Sort the losses: ℓi1 ℓin Find η = arg minη F(θ, η) where F(θ, η) = cβ(ρ) [ 1 n ϵn Pn j= ϵn +1(ℓ(θ; zij) η)β + ]1/β + η Update θ by one step to minimize ℓ(θ) = F(θ, η ) with some gradient method end for DORO Risk We propose to minimize the following expected ϵ-DORO risk: RD,ρ,ϵ(θ; Ptrain) = inf P {RD,ρ(θ; P ) : P s.t. Ptrain = (1 ϵ)P + ϵ P } (11) The DORO risk is motivated by the following intuition: we would like the algorithm to avoid the hardest instances that are likely to be outliers, and the optimal P of (11) consists of the easiest (1 ϵ)-portion of the training set given the current model parameters θ. The ϵ in DORO is a hyperparameter selected by the user since the real noise level of the dataset is unknown. Let the real noise level of Ptrain be ϵ0. For any ϵ ϵ0, there exist P0 and P such that Ptrain = (1 ϵ0)P + ϵ0 P0 = (1 ϵ)P + ϵ P, so we only need to make sure that ϵ is not less than the real noise level. The following proposition provides the formula for computing the DORO risk for the Cressie-Read family (See the proof in Appendix A.3.1): Proposition 3. Let ℓbe a continuous non-negative loss function, and suppose Ptrain is a continuous distribution. Then the formula for computing the DORO risk with Dβ is RDβ,ρ,ϵ(θ; Ptrain) = inf η {cβ(ρ)EZ Ptrain[(ℓ(θ; Z) η)β + | PZ Ptrain(ℓ(θ; Z ) > ℓ(θ; Z)) ϵ] 1 β + η} Remark In Proposition 3, we assume the continuity of Ptrain to keep the formula simple. For an arbitrary distribution Ptrain, we can obtain a similar formula, but the formula is much more complex than (12). The general formula can be found in Appendix A.3.2. With this formula, we develop Algorithm 1. In the algorithm, we first order the batch samples according to their training losses, then find the optimal η using some numerical method (we use Brent s method (Brent, 1971) in our implementation), and finally update θ with some gradient method. DORO: Distributional and Outlier Robust Optimization Note that generally it is difficult to find the minimizer of the DORO risk for neural networks, and our algorithm is inspired by the ITLM algorithm (Shen & Sanghavi, 2019), in which they proved that the optimization converges to ground truth for a few simple problems. Particularly, using the quantities listed in Table 1, we can implement CVa R-DORO and χ2-DORO. In the sections that follow, we will focus on the performances of CVa R-DORO and χ2-DORO in particular. We denote the CVa R-DORO risk by CVa Rα,ϵ(θ; Ptrain), and the χ2-DORO risk by RDχ2,ρ,ϵ(θ; Ptrain). 5. Theoretical Analysis Having the DORO algorithms implemented, in this section we prove that DORO can effectively handle subpopulation shift in the presence of outliers. The proofs to the results in this section can be found in Appendix A.4. We summarize our theoretical results as follows: 1. The minimizer of DORO over the contaminated distribution Ptrain achieves a DRO risk close to the minimum over the clean distribution P (Theorem 5). We complement our analysis with information-theoretical lower bounds (Theorem 6) implying that the optimality gaps given by Theorem 5 are optimal. 2. The worst-case risk Rmax over P is upper bounded by the DORO risk over Ptrain times a constant factor (Theorem 7). This result parallels Corollary 2 in the uncontaminated setting and guarantees that minimizing the DORO risk over Ptrain effectively minimizes Rmax over P. Our results are based on the following lemma which lower bounds the DORO risk over Ptrain by the infimum of the original DRO risk in a TV-ball centered at P: Lemma 4. Let TV(P, Q) = 1 X Y |P(z) Q(z)|dz be the total variation, and Ptrain be defined by (10). Then the DORO risk can be lower bounded by: RD,ρ,ϵ(θ; Ptrain) inf P {RD,ρ(θ; P ) : TV(P, P ) ϵ 1 ϵ} (13) The main results we are about to present only require very mild assumptions. For the first result, we assume that ℓhas a bounded (2k)-th moment on P, a standard assumption in the robust statistics literature: Theorem 5. Let Ptrain be defined by (10). Denote the minimizer of the DORO risk by ˆθ. If ℓis nonnegative, and ℓ(ˆθ; Z) has a bounded (2k)-th moment: EZ P [l(ˆθ; Z)2k] = σ2k 2k < + , then we have: CVa Rα(ˆθ; P) inf θ CVa Rα(θ; P) Oα,k(1)σ2kϵ1 1 and if k > 1, then we have: RDχ2,ρ(ˆθ; P) inf θ RDχ2,ρ(θ; P) Oρ,k(1)σ2kϵ( 1 Furthermore, the above optimality gaps are optimal: Theorem 6. There exists a pair of (P, Ptrain) where Ptrain = (1 ϵ)P + ϵP and P has uniformly bounded 2k-th moment: θ Θ, EP [l(θ, Z)2k] σ2k 2k such that for any learner with only access to Ptrain, the best achievable error in DRO over P is lower bounded by CVa Rα(ˆθ; P) inf θ Θ CVa Rα(θ; P) Ωα,k(1)σ2kϵ1 1 RDχ2,ρ(ˆθ; P) inf θ Θ RDχ2,ρ(θ; P) Ωρ,k(1)σ2kϵ( 1 We make a few remarks on these theoretical results. The O(ϵ1 1 2k ) and O(ϵ 1 2 1 2k ) rates resemble the existing works on robust mean/moment estimation, see e.g. (Kothari et al., 2018; Prasad et al., 2020). The robust mean estimation problem can be seen as a special case of CVa R when α = 1, where CVa R of any θ is just the mean of l(θ, Z). On the other hand, the connection between CVa R and robust moment estimation can be built with the dual characterization (5): for any fixed dual variable η, evaluating the dual is nothing but a robust (β -th) moment estimation of the random variable (l(θ, Z) η)+. However, the problem we are trying to tackle in the above theorems is more challenging, in the sense that (1) DRO risk involves taking infimum over all η R, but the moments of (l(θ, Z) η)+ are not uniformly bounded for all possible η s; and (2) the optimal dual variable η can be very different even for distributions extremely close in total-variation distance. In Appendix A.4 we discuss how to overcome these difficulties in detail. Our second result is a robust analogue to Corollary 2: we show that the worst-case risk Rmax can be upper bounded by a constant factor times the DORO risk CVa Rα,ϵ, under the very mild assumption that ℓhas a uniformly bounded second moment on P and Rmax is not exceedingly small: Theorem 7. Let Ptrain be defined by (10). Let α = mink=1, ,K P(Dk), and ρ = 1 α 1)2. If ℓ(θ; Z) is a non-negative loss function with a uniformly bounded second moment: EZ P [ℓ(θ; Z)2] σ2 for all θ, then we have: Rmax(θ; P) max{3CVa Rα,ϵ(θ; Ptrain), 3α 1σ r ϵ 1 ϵ} max{3Dχ2,ρ,ϵ(θ; Ptrain), 3α 1σ r ϵ 1 ϵ} Note that a similar result can be derived under the bounded 2k-th moment condition with different constants. DORO: Distributional and Outlier Robust Optimization 6. Experiments In this section, we conduct large-scale experiments on modern datasets. Our results show that DORO improves the performance and stability of DRO. We also analyze the effect of hyperparameters on DRO and DORO. Datasets Our goal is to apply DRO to real tasks with subpopulation shift on modern datasets. While many previous work used small tabular datasets such as COMPAS, these datasets are insufficient for our purpose. Therefore, apart from COMPAS, we use two large datasets: Celeb A (Liu et al., 2015) and Civil Comments-Wilds (Borkan et al., 2019; Koh et al., 2020). Celeb A is a widely used vision dataset with 162,770 training instances, and Civil Comments-Wilds is a recently released language dataset with 269,038 training instances. Both datasets are captured in the wild and labeled by potentially biased humans, so they can reveal many challenges we need to face in practice. We summarize the datasets we use as follows: (i) COMPAS: recidivism prediction, where the target is whether the person will reoffend in two years; (ii) Celeb A: human face recognition, where the target is whether the person has blond hair; (iii) Civil Comments-Wilds: toxicity identification, where the target is whether the user comment contains toxic contents. All targets are binary. For COMPAS, we randomly sample 70% of the instances to be the training data (with a fixed random seed) and the rest is the validation/testing data. Both Celeb A and Civil Comments-Wilds have official train-validation-test splits, so we use them directly. Domain Definition On COMPAS we define 4 domains (subpopulations), and on Celeb A and Civil Comments-Wilds we define 16 domains for each. Our domain definitions cover several types of subpopulation shift, such as different demographic groups, class imbalance, labeling biases, confounding variables, etc. See Appendix B.1 for details. Training We use a two-layer feed-forward neural network activated by Re LU on COMPAS, a Res Net18 (He et al., 2016) on Celeb A, and a BERT-base-uncased model (Devlin et al., 2019) on Civil Comments-Wilds. On each dataset, we run ERM, CVa R, χ2-DRO, CVa R-DORO and χ2-DORO. Each algorithm is run 300 epochs on COMPAS, 30 epochs on Celeb A and 5 epochs on Civil Comments-Wilds. For each method we collect the model achieved at the end of every epoch, and select the best model through validation. (On Civil Comments-Wilds we collect 5 models each epoch, one for every 20% of the training instances.) Model Selection To select the best model, we assume that the domain membership of each instance is available 0 10 20 30 Epochs ERM CVa R CVa R-DORO (a) Average Accuracy 0 10 20 30 Epochs (b) Worst-case Accuracy Figure 3. Test accuracies of CVa R and CVa R-DORO on Celeb A (α = 0.1, ϵ = 0.01). 0 10 20 30 Epochs 2-DRO 2-DORO (a) Average Accuracy 0 10 20 30 Epochs (b) Worst-case Accuracy Figure 4. Test accuracies of χ2-DRO and χ2-DORO on Celeb A (α = 0.3, ϵ = 0.01). in the validation set, and select the model with the highest worst-case validation accuracy. This is an oracle strategy since it requires a domain-aware validation set. Over the course of our experiments, we have realized that model selection with no group labels during validation is a very hard problem. On the other hand, model selection has a huge impact on the performance of the final model. We include some preliminary discussions on this issue in Appendix B.2. Since model selection is not the main focus of this paper, we pose it as an open question. 6.2. Results The 95% confidence intervals of the mean test accuracies on each dataset are reported in Table 2. For every DRO and DORO method, we do a grid search to pick the best α and ϵ that achieve the best worst-case accuracy (see the optimal hyperparameters in Appendix B.3). Each experiment is repeated 10 times on COMPAS and Celeb A, and 5 times on Civil Comments-Wilds with different random seeds. Table 2 clearly shows that on all datasets, DORO consistently improves the average and worst-case accuracies of DRO. Next, we analyze the stability of the algorithms on the Celeb A dataset. We use the α that achieves the optimal DRO performance for each of CVa R and χ2-DRO, and compare them to DORO with the same value of α and ϵ = 0.01. χ2-DRO achieves its optimal performance with a bigger α than CVa R because it is less stable. To quantitatively compare the stability, we compute the standard deviations of the test accuracies across epochs and report the results in Table 3. To further visualize the training dynamics, we run all algorithms with one fixed random seed, and plot the test accuracies during training in Figures 3 and 4. Table 3 shows that the standard deviation of the test accuracy of DORO is smaller and in Figures 3a and 4a the DORO curves DORO: Distributional and Outlier Robust Optimization Table 2. The average and worst-case test accuracies of the best models achieved by different methods. (%) Dataset Method Average Accuracy Worst-case Accuracy ERM 69.31 0.19 68.83 0.18 CVa R 68.52 0.31 68.22 0.30 CVa R-DORO 69.38 0.10 69.11 0.05 χ2-DRO 67.93 0.40 67.32 0.60 χ2-DORO 69.62 0.16 69.22 0.11 ERM 95.01 0.38 53.94 2.02 CVa R 82.83 1.33 66.44 2.34 CVa R-DORO 92.91 0.48 72.17 3.14 χ2-DRO 83.85 1.42 67.76 3.22 χ2-DORO 82.18 1.17 68.33 1.79 Civil Comments-Wilds ERM 92.04 0.24 64.62 2.48 CVa R 89.11 0.76 63.90 4.42 CVa R-DORO 90.45 0.70 68.00 2.10 χ2-DRO 90.08 0.92 65.55 1.51 χ2-DORO 90.11 1.09 67.19 2.51 Table 3. Standard deviations of average/worst-case test accuracies during training on Celeb A. (α = 0.1 for CVa R/CVa R-DORO; α = 0.3 for χ2-DRO/χ2-DORO. ϵ = 0.01) (%) Method Average Worst-case ERM 0.73 0.06 8.59 0.90 CVa R 11.53 1.72 21.47 0.71 CVa R-DORO 4.03 1.57 16.84 0.91 χ2-DRO 8.88 2.98 19.06 1.18 χ2-DORO 1.60 0.34 13.01 1.40 0 0.5 1 1.5 2 2.5 ( 0.01) Avg Acc Worst Acc (a) CVa R-DORO 0 0.5 1 1.5 2 2.5 ( 0.01) (b) χ2-DORO Figure 5. Effect of ϵ on the test accuracies of CVa R/χ2-DORO on Celeb A (α = 0.2). DORO with ϵ = 0 is equivalent to DRO. are flatter than the DRO curves, which implies that DORO improves the stability of DRO. Although it is hard to tell whether DORO has a more stable worst-case accuracy from the figures, our quantitative results in Table 3 confirm that DORO has more stable worst-case test accuracies. 6.3. Effect of Hyperparameters In this part, we study how α and ϵ affect the test accuracies of DORO with two experiments on Celeb A, providing insight into how to select the optimal hyperparameters. In the first experiment, we fix α = 0.2, and run the two DORO algorithms with different values of ϵ. The results are plotted in Figure 5. We can see that for both methods, as ϵ increases, the average accuracy slightly decreases, while 0.05 0.1 0.15 0.2 0.25 0.3 0.4 Avg Acc Worst Acc 0.05 0.1 0.15 0.2 0.25 0.3 0.2 0.05 0.1 0.15 0.2 0.25 0.3 0.6 Avg Acc Worst Acc (c) CVa R-DORO 0.05 0.1 0.15 0.2 0.25 0.3 (d) χ2-DORO Figure 6. Effect of α on the test accuracies of DRO and DORO on Celeb A (ϵ = 0.01). the worst-case accuracy first rises and then drops. Both average and worst-case accuracies will drop if ϵ is too big. Moreover, both methods achieve the optimal worst-case accuracy at ϵ = 0.005. We conjecture that the real noise level of the Celeb A dataset is around 0.005, and that the optimal ϵ should be close to the real noise level. In the second experiment, we run DRO and DORO (ϵ = 0.01) with different values of α. The results are plotted in Figure 6. First, we observe that for all methods, the optimal α is much bigger than the real α of the dataset. The real α of the Celeb A dataset is around 0.008 (see Appendix B.1, Table 1), much smaller than those achieving the highest worst-case accuracies in the figures. Second, in all four figures the overall trend of the average accuracy is that it grows with α. Third, both CVa R-DORO and χ2-DORO achieve the optimal worst-case accuracy at α = 0.25, but the worst-case accuracy drops as α goes to 0.3. DORO: Distributional and Outlier Robust Optimization 7. Discussion In this work we pinpointed one direct cause of the performance drop and instability of DRO: the sensitivity of DRO to outliers in the dataset. We proposed DORO as an outlier robust refinement of DRO, and implemented DORO for the Cressie-Read family of R enyi divergence. We made a positive response to the open question raised by (Hashimoto et al., 2018) by demonstrating the effectiveness of DORO both theoretically and empirically. One alternative approach to making DRO robust to outliers is removing the outliers from the dataset via preprocessing. In Section 3 we used a simple version of iterative trimming (Shen & Sanghavi, 2019) to remove outliers from the training set. Compared to iterative trimming, DORO does not require retraining the model and does not throw away any data. In addition, preprocessing methods such as iterative trimming cannot cope with online data (where new instances are received sequentially), but DORO is still feasible. The high-level idea of DORO can be extended to other algorithms that deal with subpopulation shift, such as static reweighting (Shimodaira, 2000), adversarial reweighting (Hu et al., 2018; Lahoti et al., 2020) and group DRO (Sagawa et al., 2020a). The implementations might be different, but the basic ideas are the same: to prevent the algorithm from overfitting to potential outliers. We leave the design of such algorithms to future work. There is one large open question from this work. In our experiments, we found that model selection without domain information in the validation set is very hard. In Appendix B.2 we study several strategies, such as selecting the model with the lowest CVa R risk or the lowest CVa R-DORO risk, but none of them is satisfactory. A recent paper (Michel et al., 2021) proposed two selection methods Minmax and Greedy Minmax, but their performances are still much lower than the oracle s (see their Table 2a). (Gulrajani & Lopez-Paz, 2021) also pointed out the difficulty of model selection in domain-oblivious distributional shift tasks. Thus, we believe this question to be fairly non-trivial. Acknowledgements We acknowledge the support of DARPA via HR00112020006, and NSF via IIS-1909816, OAC1934584. Barocas, S. and Selbst, A. D. Big data s disparate impact. Calif. L. Rev., 104:671, 2016. Bickel, S., Br uckner, M., and Scheffer, T. Discriminative learning for differing training and test distributions. In Proceedings of the 24th international conference on Machine learning, pp. 81 88, 2007. Borkan, D., Dixon, L., Sorensen, J., Thain, N., and Vasserman, L. Nuanced metrics for measuring unintended bias with real data for text classification. In Companion Proceedings of The 2019 World Wide Web Conference, pp. 491 500, 2019. Brent, R. P. An algorithm with guaranteed convergence for finding a zero of a function. The Computer Journal, 14 (4):422 425, 1971. Cressie, N. and Read, T. R. Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society: Series B (Methodological), 46(3):440 464, 1984. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171 4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, A., and Stewart, A. Being robust (in high dimensions) can be practical. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 999 1008, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. Diakonikolas, I., Kamath, G., Kane, D., Li, J., Moitra, A., and Stewart, A. Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48(2):742 864, 2019. Duchi, J. and Namkoong, H. Learning models with uniform performance via distributionally robust optimization. ar Xiv preprint ar Xiv:1810.08750, 2018. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214 226, 2012. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., and Herrera, F. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4): 463 484, 2011. Gulrajani, I. and Lopez-Paz, D. In search of lost domain generalization. In International Conference on Learning Representations, 2021. DORO: Distributional and Outlier Robust Optimization Hardt, M., Price, E., Price, E., and Srebro, N. Equality of opportunity in supervised learning. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 29, pp. 3315 3323. Curran Associates, Inc., 2016. Hashimoto, T., Srivastava, M., Namkoong, H., and Liang, P. Fairness without demographics in repeated loss minimization. In Dy, J. and Krause, A. (eds.), International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1929 1938, Stockholmsm assan, Stockholm Sweden, 10 15 Jul 2018. PMLR. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Hu, W., Niu, G., Sato, I., and Sugiyama, M. Does distributionally robust supervised learning give robust classifiers? In International Conference on Machine Learning, pp. 2029 2037. PMLR, 2018. Huang, J., Gretton, A., Borgwardt, K., Sch olkopf, B., and Smola, A. Correcting sample selection bias by unlabeled data. Advances in neural information processing systems, 19:601 608, 2006. Huber, P. J. Robust estimation of a location parameter. In Breakthroughs in statistics, pp. 492 518. Springer, 1992. Japkowicz, N. The class imbalance problem: Significance and strategies. In Proc. of the Int l Conf. on Artificial Intelligence, volume 56. Citeseer, 2000. Koh, P. W., Sagawa, S., Marklund, H., Xie, S. M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R. L., Beery, S., et al. Wilds: A benchmark of in-thewild distribution shifts. ar Xiv preprint ar Xiv:2012.07421, 2020. Kothari, P. K., Steinhardt, J., and Steurer, D. Robust moment estimation and improved clustering via sum of squares. In Diakonikolas, I., Kempe, D., and Henzinger, M. (eds.), Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pp. 1035 1046. ACM, 2018. Kusner, M. J., Loftus, J., Russell, C., and Silva, R. Counterfactual fairness. In Advances in neural information processing systems, pp. 4066 4076, 2017. Lahoti, P., Beutel, A., Chen, J., Lee, K., Prost, F., Thain, N., Wang, X., and Chi, E. Fairness without demographics through adversarially reweighted learning. Advances in Neural Information Processing Systems, 33, 2020. Lai, K. A., Rao, A. B., and Vempala, S. Agnostic estimation of mean and covariance. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pp. 665 674. IEEE, 2016. Larson, J., Mattu, S., Kirchner, L., and Angwin, J. How we analyzed the compas recidivism algorithm. Pro Publica (5 2016), 9(1), 2016. Lee, J., Park, S., and Shin, J. Learning bounds for risksensitive learning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 13867 13879. Curran Associates, Inc., 2020. Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730 3738, 2015. Michel, P., Hashimoto, T., and Neubig, G. Modeling the second player in distributionally robust optimization. In International Conference on Learning Representations, 2021. Namkoong, H. and Duchi, J. C. Stochastic gradient methods for distributionally robust optimization with fdivergences. In Advances in neural information processing systems, pp. 2208 2216, 2016. Oren, Y., Sagawa, S., Hashimoto, T., and Liang, P. Distributionally robust language modeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp. 4227 4237, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1432. Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10): 1345 1359, 2009. Patel, V. M., Gopalan, R., Li, R., and Chellappa, R. Visual domain adaptation: A survey of recent advances. IEEE signal processing magazine, 32(3):53 69, 2015. Prasad, A., Suggala, A. S., Balakrishnan, S., and Ravikumar, P. Robust estimation via robust gradient estimation. ar Xiv preprint ar Xiv:1802.06485, 2018. Prasad, A., Balakrishnan, S., and Ravikumar, P. A robust univariate mean estimator is all you need. In Chiappa, S. and Calandra, R. (eds.), Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 4034 4044. PMLR, 26 28 Aug 2020. DORO: Distributional and Outlier Robust Optimization Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. Dataset shift in machine learning. The MIT Press, 2009. Rawls, J. Justice as fairness: A restatement. Harvard University Press, 2001. Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In International Conference on Learning Representations, 2020a. Sagawa, S., Raghunathan, A., Koh, P. W., and Liang, P. An investigation of why overparameterization exacerbates spurious correlations. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 8346 8356. PMLR, 13 18 Jul 2020b. Shen, Y. and Sanghavi, S. Learning with bad training data via iterative trimmed loss minimization. In International Conference on Machine Learning, pp. 5739 5748. PMLR, 2019. Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227 244, 2000. Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., and Liu, C. A survey on deep transfer learning. In International conference on artificial neural networks, pp. 270 279. Springer, 2018. Tukey, J. W. A survey of sampling from contaminated distributions. Contributions to probability and statistics, pp. 448 485, 1960. Wang, M. and Deng, W. Deep visual domain adaptation: A survey. Neurocomputing, 312:135 153, 2018. Xu, Z., Dan, C., Khim, J., and Ravikumar, P. Class-weighted classification: Trade-offs and robust approaches. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 10544 10554. PMLR, 13 18 Jul 2020. Zafar, M. B., Valera, I., Gomez Rodriguez, M., and Gummadi, K. P. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th international conference on world wide web, pp. 1171 1180, 2017. Zemel, R., Wu, Y., Swersky, K., Pitassi, T., and Dwork, C. Learning fair representations. In International Conference on Machine Learning, pp. 325 333, 2013. Zhu, B., Jiao, J., and Steinhardt, J. Generalized resilience and robust statistics, 2020.