# improving_outofdistribution_robustness_via_selective_augmentation__a5979319.pdf

Improving Out-of-Distribution Robustness via Selective Augmentation

Huaxiu Yao * 1 Yu Wang * 2 Sai Li 3 Linjun Zhang 4 Weixin Liang 1

James Zou 1 Chelsea Finn 1

Machine learning algorithms typically assume that training and test examples are drawn from the same distribution. However, distribution shift is a common problem in real-world applications and can cause models to perform dramatically worse at test time. In this paper, we specifically consider the problems of subpopulation shifts (e.g., imbalanced data) and domain shifts. While prior works often seek to explicitly regularize internal representations or predictors of the model to be domain invariant, we instead aim to learn invariant predictors without restricting the model s internal representations or predictors. This leads to a simple mixup-based technique which learns invariant predictors via selective augmentation called LISA. LISA selectively interpolates samples either with the same labels but different domains or with the same domain but different labels. Empirically, we study the effectiveness of LISA on nine benchmarks ranging from subpopulation shifts to domain shifts, and we find that LISA consistently outperforms other state-of-the-art methods and leads to more invariant predictors. We further analyze a linear setting and theoretically show how LISA leads to a smaller worst-group error. Code is released in https://github.com/huaxiuyao/LISA

1. Introduction

To deploy machine learning algorithms in real-world applications, we must pay attention to distribution shift, i.e. when the test distribution is different from the training distribution, which substantially degrades model performance. In this

*Equal contribution . This work was done when Yu Wang was mentored by Huaxiu Yao remotely. 1Stanford University, CA, USA 2University of California San Diego, CA, USA 3Renmin University of China, Beijing, China 4Rutgers University, NJ, USA. Correspondence to: Huaxiu Yao <huaxiu@cs.stanford.edu>, Sai Li <saili@ruc.edu.cn>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

paper, we refer this problem as out-of-distribution (OOD) generalization and specifically consider performance gaps caused by two kinds of distribution shifts: subpopulation shifts and domain shifts. In subpopulation shifts, the test domains (or subpopulations) are seen but underrepresented in the training data. When subpopulation shift occurs, models may perform poorly when they falsely rely on spurious correlations between the particular subpopulation and the label. For example, in health risk prediction, a machine learning model trained on the entire population may associate the labels with demographic features (e.g., gender and age), making the model fail on the test set when such an association does not hold in reality. In domain shifts, the test data is from new domains, which requires the trained model to generalize well to test domains without seeing the data from those domains at training time. In the health risk example, we may want to train a model on patients from a few sampled hospitals and then deploy the model to a broader set of hospitals (Koh et al., 2021).

To improve model robustness under these two kinds of distribution shifts, prior works have proposed various regularizers to learn representations or predictors that are invariant to different domains while still containing sufficient information to fulfill the task (Li et al., 2018; Sun & Saenko, 2016; Arjovsky et al., 2019; Krueger et al., 2021; Rosenfeld et al., 2021). However, designing regularizers that are widely suitable to datasets from diverse domains is challenging, and unsuitable regularizers may adversely limit the model s expressive power or yield a difficult optimization problem, leading to inconsistent performance among various realworld datasets. For example, on the WILDS datasets, invariant risk minimization (IRM) (Arjovsky et al., 2019) with reweighting a representative method for learning invariant predictor outperforms empirical risk minimization (ERM) on Civil Comments, but fails to improve robustness on a variety of other datasets like Camelyon17 and Rx Rx1 (Koh et al., 2021).

Instead of explicitly imposing regularization, we propose to learn invariant predictors through data interpolation, leading to a simple algorithm called LISA (Learning Invariant Predictors with Selective Augmentation). Concretely, inspired by mixup (Zhang et al., 2018), LISA linearly interpolates the features for a pair of samples and applies the same

Improving Out-of-Distribution Robustness via Selective Augmentation

𝜆= 0.0 𝜆= 0.25 𝜆= 0.5 𝜆= 0.75 𝜆= 1.0

𝜆= 0.0 𝜆= 0.25 𝜆= 0.5 𝜆= 0.75 𝜆= 1.0

Domain information is not the reason for the label change y = [1, 0] y = [0.25, 0.75] y = [0.5, 0.5] y = [0.75, 0.25] y = [0, 1]

All y = [0, 1]

𝑑!: Green 𝑑": Red

𝑦!: [1, 0] 𝑦": [0, 1]

40% of train data

40% of train data 10% of train data

10% of train data

One-hot label

(b) Intra-label LISA: interpolates samples with the same label but different domains

(c) Intra-domain LISA: interpolates samples with the same domain but different labels

(a) Colored MNIST dataset

Figure 1. Illustration of the variants of LISA (Intra-label LISA and Intra-domain LISA) on Colored MNIST dataset. λ represents the interpolation ratio, which is sampled from a Beta distribution. (a) Colored MNIST (CMNIST). We classify MNIST digits as two classes, and original digits (0,1,2,3,4) and (5,6,7,8,9) are labeled as class 0 and 1, respectively. Digit color is used as domain information, which is spuriously correlated with labels in training data; (b) Intra-label LISA (LISA-L) cancels out spurious correlation by interpolating samples with the same label; (c) Intra-domain LISA (LISA-D) interpolates samples with the same domain but different labels to encourage the model to learn specific features within a domain.

interpolation strategy on the corresponding labels. Critically, the pairs are selectively chosen according to two selective augmentation strategies intra-label LISA (LISAL) and intra-domain LISA (LISA-D), which are described below and illustrated on Colored MNIST dataset in Figure 1. Intra-label LISA (Figure 1(b)) interpolates samples with the same label but from different domains, aiming to eliminate domain-related spurious correlations. Intra-domain LISA (Figure 1(c)) interpolates samples with the same domain but different labels, such that the model should learn to ignore the domain information and generate different predicted values as the interpolation ratio changes. In this way, LISA encourages the model to learn domain-invariant predictors without any explicit constraints or regularizers.

The primary contributions of this paper are as follows: (1) We propose a simple yet widely-applicable method for learning domain invariant predictors that is shown to be robust to subpopulation shifts and domain shifts. (2) We conduct broad experiments to evaluate LISA on nine benchmark datasets from diverse domains. In these experiments, we make the following key observations. First, we observe that LISA consistently outperforms seven prior methods to address subpopulation and domain shifts. Second, we find that LISA produces predictors that are consistently more domain invariant than prior approaches. Third, we identify that the performance gains of LISA are from canceling out domain-specific information or spurious correlations and learning invariant predictors, rather than simply involving more data via interpolation. Finally, when the degree of distribution shift increases, LISA achieves more significant

performance gains. (3) We provide a theoretical analysis of the phenomena distilled from the empirical studies, where we provably demonstrate that LISA can mitigate spurious correlations and therefore lead to smaller worst-domain error compared with ERM and vanilla mixup. We also note that to the best of our knowledge, our work provides the first theoretical framework of studying how mixup (with or without the selective augmentation strategies) affects misclassification error.

2. Preliminaries

In this paper, we consider the setting where one predicts the label y Y based on the input feature x X. Given a parameter space Θ and a loss function ℓ, we need to train a model fθ under the training distribution Ptr, where θ Θ. In empirical risk minimization (ERM), the empirical distribution over the training data is ˆPtr; ERM optimizes the following objective:

θ := arg min θ Θ E(x,y) ˆ P [ℓ(fθ(x), y)]. (1)

In a traditional machine learning setting, a test set, sampled from a test distribution Pts, is used to evaluate the generalization of the trained model θ , where the test distribution is assumed to be the same as the training distribution, i.e., P tr = P ts. In this paper, we are interested in the setting when distribution shift occurs, i.e., P tr = P ts.

Specifically, following Muandet et al. (2013); Albuquerque et al. (2019); Koh et al. (2021), we regard the overall data distribution containing D = {1, . . . , D} domains and each

Improving Out-of-Distribution Robustness via Selective Augmentation

domain d D is associated with a data distribution Pd over a set (X, Y, d) = {(xi, yi, d)}Nd i=1, where N d is the number of samples in domain d. Then, we formulate the training distribution as the mixture of D domains, i.e., P tr = P

d D rtr d Pd, where {rtr d } denotes the mixture probabilities in training set. Here, the training domains are defined as Dtr = {d D|rtr d > 0}. Similarly, the test distribution could be represented as P ts = P d D rts d Pd, where {rts d } is the mixture probabilities in test set. The test domains are defined as Dts = {d D|rts d > 0}.

In subpopulation shifts, the test set has domains that have been seen in the training set, but with a different proportion of subpopulations, i.e., Dts Dtr but {rts d } = {rtr d }. Under this setting, following Sagawa et al. (2020a), we consider group-based spurious correlations, where each group g G is defined to be associated with a domain d and a label y, i.e., g = (d, y). We assume that the domain is spuriously correlated with the label. For example, we illustrate the CMNIST dataset in Figure 1, where the digit color d (green or red) is spuriously correlated with the label y ([1, 0] or [0, 1]). Based on the group definition, we evaluate the model via the worst test group error, i.e., maxg E(x,y) g[ℓ0 1(fθ(x), y)], where ℓ0 1 represents the 0-1 loss.

In domain shifts, we investigate the problem where the test domains are disjoint from the training domains, i.e., Dtr Dts = . In general, we assume the test domains share some common properties with the training domains. For example, in Camelyon17 (Koh et al., 2021), we train the model on some hospitals and test it in a new hospital. We evaluate the worst-domain and/or average performance of the classifier across all test domains.

3. Learning Invariant Predictors with Selective Augmentation

This section presents LISA, a simple way to improve robustness to subpopulation shifts and domain shifts. The key idea behind LISA is to encourage the model to learn invariant predictors by selective data interpolation, which could also alleviates the effects of domain-related spurious correlations. Before detailing how to select interpolated samples, we first provide a general formulation for data interpolation.

In LISA, we perform linear interpolation between training samples. Specifically, given samples (xi, yi, di) and (xj, yj, dj) drawn from domains di and dj, we apply mixup (Zhang et al., 2018), a simple data interpolation strategy, separately on the input features and corresponding labels as:

xmix = λxi + (1 λ)xj, ymix = λyi + (1 λ)yj, (2)

where the interpolation ratio λ [0, 1] is sampled from a Beta distribution Beta(α, β) and yi and yj are one-hot

vectors for classification problem. Notice that the mixup approach in (2) can be replaced by Cut Mix (Yun et al., 2019), which shows stronger empirical performance in vision-based applications. In text-based applications, we can use Manifold Mixup (Verma et al., 2019), interpolating the representations of a pre-trained model, e.g., the output of BERT (Devlin et al., 2019).

After obtaining the interpolated features and labels, we replace the original features and labels in ERM with the interpolated ones. Then, the optimization process in (1) is reformulated as:

θ := arg min θ Θ E{(xi,yi,di),(xj,yj,dj)} ˆ P [ℓ(fθ(xmix), ymix)].

(3) Without additional selective augmentation strategies, vanilla mixup will regularize the model and reduce overfitting (Zhang et al., 2021b), allowing it to attain good indistribution generalization. However, vanilla mixup may not be able to cancel out spurious correlations, causing the model to still fail at attaining good OOD generalization (see empirical comparisons in Section 4.3 and theoretical discussion in Section 5). In LISA, we instead adopt a new strategy where mixup is only applied across specific domains or groups, which leans towards learning invariant predictors and thus better OOD performance. Specifically, the two kinds of selective augmentation strategies are presented as:

Intra-label LISA (LISA-L): Interpolating samples with the same label. Intra-label LISA interpolates samples with the same label but different domains (i.e., di = dj, yi = yj). As shown in Figure 1(a), this produces datapoints that have both domains partially present, effectively eliminating spurious correlations between domain and label in cases where the pair of domains correlate differently with the label. As a result, intra-label LISA should learn domaininvariant predictors for each class and thus achieve better OOD robustness.

Intra-domain LISA (LISA-D): Interpolating samples with the same domain. Supposing domain information is highly spuriously correlated with the label information, intra-domain LISA (Figure 1(b)) applies the interpolation strategy on samples with the same domain but different labels, i.e., di = dj, yi = yj. Intuitively, even within the same domain, the model is supposed to generate different predicted labels since the interpolation ratio λ is randomly sampled, corresponding to different labels ymix. This causes the model to make predictions that are less dependent on the domain, again improving OOD robustness.

In this paper, we randomly perform intra-label or intradomain LISA during the training process with probability psel and 1 psel, where psel is treated as a hyperparameter and determined via cross-validation. Intuitively, the choice of psel depends on the number of domains and the strength

Improving Out-of-Distribution Robustness via Selective Augmentation

of the spurious correlations. Empirically, using intra-label LISA brings more benefits when there are more domains or when the the spurious correlations are not very strong. Intra-domain LISA benefits performance when domain information is highly spuriously correlated with the label. The pseudocode of LISA is in Algorithm 1.

Algorithm 1 Training Procedure of LISA

Require: Training data D, step size η, learning rate γ, shape parameters α, β of Beta distribution 1: while not converge do 2: Sample λ Beta(α, β) 3: Sample minibatch B1 D 4: Initialize B2 {} 5: Select strategy s Bernoulli(psel) 6: if s is True then 7: for (xi, yi, di) B1 do 8: Randomly sample (xj, yj, dj) {(x, y, d) D} which satisfies (yi = yj) and (di = dj). 9: Put (xj, yj, dj) into B2. 10: else 11: for (xi, yi, di) B1 do 12: Randomly sample (xj, yj, dj) {(x, y, d) D} which satisfies (yi = yj) and (di = dj). 13: Put (xj, yj, dj) into B2. 14: Update θ with data λB1 + (1 λ)B2 with learning rate γ.

4. Experiments

In this section, we conduct comprehensive experiments to evaluate the effectiveness of LISA. Specifically, we aim to answer the following questions: Q1: Compared to prior methods, can LISA improve robustness to subpopulation shifts and domain shifts (Section 4.1 and Section 4.2)? Q2: Which aspects of LISA are the most important for improving robustness (Section 4.3)? Q3: Does LISA successfully produce more invariant predictors (Section 4.4)? Q4: How does LISA perform with varying degrees of distribution shifts (Section 4.5)?

To answer Q1, we compare to ERM, IRM (Arjovsky et al., 2019), IB-IRM (Ahuja et al., 2021), V-REx (Krueger et al., 2021), CORAL (Li et al., 2018), DRNN (Ganin & Lempitsky, 2015), Group DRO (Sagawa et al., 2020a), Domain Mix (Xu et al., 2020), and Fish (Shi et al., 2021). Upweighting (UW) is particularly suitable for subpopulation shifts, so we also use it for comparison. We adopt the same model architectures for all approaches. The strategy selection probability psel is selected via cross-validation.

4.1. Evaluating Robustness to Subpopulation Shifts

Evaluation Protocol. In subpopulation shifts, we evaluate the performance on four binary classification datasets, in-

cluding Colored MNIST (CMNIST), Waterbirds (Sagawa et al., 2020a), Celeb A (Liu et al., 2015), and Civilcomments (Borkan et al., 2019). We detail the data descriptions of subpopulation shifts in Appendix A.1.1 and report the detailed data statistics in Table 1, covering domain information, model architecture, and class information. Following Sagawa et al. (2020a), in subpopulation shifts, we use the worst-group accuracy to evaluate the performance of all approaches. In these datasets, the domain information is highly spurious correlated with the label information. For example, as suggested in Figure 1, 80% images in the CMNIST dataset have the same color in each specific class, i.e., green color for label [1, 0] and red color for label [0, 1].

In CMNIST, Waterbirds, and Celeb A, we find that psel = 0.5 works well for choosing selective augmentation strategies, while in Civil Comments, we set psel as 1.0 . This is not surprising because it might be more beneficial to use intralabel LISA more often to eliminate domain effects when there are more domains, i.e., eight domains in Civil Comments v.s. two domains in others. The rest hyperparameter settings and training details are listed in Appendix A.1.2.

Results. In Table 2, we report the overall performance of LISA and other methods. According to Table 2, we observe that the performance of approaches that learn invariant predictors with explicit regularizers (e.g., IRM, IB-IRM, VREx) is not consistent across datasets. For example, IRM and V-REx outperform UW on CMNIST, but they fail to achieve better performance than UW on Waterbirds. The results corroborate our hypothesis that designing widely effective regularizers is challenging, and that inappropriate regularizers may even hurt the performance. LISA instead consistently outperforms other invariant learning methods (e.g., IRM, IB-IRM, V-REx, CORAL, Domain Mix, Fish) in all datasets. LISA further shows the best performance on CMNIST, Celeb A, and Civil Comments. In Waterbirds, it is slightly worse than Group DRO, but the performance is comparable. These results demonstrate the effectiveness of LISA in improving robustness to subpopulation shifts.

Effects of Intra-label and Intra-domain LISA. For CMNIST, Waterbirds and Celeb A, both intra-label and intradomain LISA are used (i.e., psel = 0.5), we illustrate the separate results in Figure 2 and observe that both variants contribute to the final performance. In addition, intradomain LISA performs slightly better than intra-label LISA, corroborating our assumption that intra-domain LISA benefits more when domain information is highly spuriously correlated with the label (see the discussion of the strength of spurious correlation in Appendix A.3).

4.2. Evaluating Robustness to Domain Shifts

Experimental Setup. In domain shifts, we study five datasets. Four of them (Camelyon17, FMo W, Rx Rx1, and

Improving Out-of-Distribution Robustness via Selective Augmentation

Table 1. Dataset Statistics for Subpopulation Shifts. All datasets are binary classification tasks and we use the worst group accuracy as the evaluation metric.

Datasets Domains Model Architecture Class Information

CMNIST 2 digit colors Res Net-50 digit (0,1,2,3,4) v.s. (5,6,7,8,9) Waterbirds 2 backgrounds Res Net-50 waterbirds v.s. landbirds Celeb A 2 hair colors Res Net-50 man v.s. women Civil Comments 8 demographic identities Distil BERT-uncased toxic v.s. non-toxic

Table 2. Results of subpopulation shifts. Here, we show the average and worst group accuracy. We repeat the experiments three times and put full results with standard deviation in Table 10.

CMNIST Waterbirds Celeb A Civil Comments Avg. Worst Avg. Worst Avg. Worst Avg. Worst

ERM 27.8% 0.0% 97.0% 63.7% 94.9% 47.8% 92.2% 56.0% UW 72.2% 66.0% 95.1% 88.0% 92.9% 83.3% 89.8% 69.2% IRM 72.1% 70.3% 87.5% 75.6% 94.0% 77.8% 88.8% 66.3% IB-IRM 72.2% 70.7% 88.5% 76.5% 93.6% 85.0% 89.1% 65.3% V-REx 71.7% 70.2% 88.0% 73.6% 92.2% 86.7% 90.2% 64.9% CORAL 71.8% 69.5% 90.3% 79.8% 93.8% 76.9% 88.7% 65.6% Group DRO 72.3% 68.6% 91.8% 90.6% 92.1% 87.2% 89.9% 70.0% Domain Mix 51.4% 48.0% 76.4% 53.0% 93.4% 65.6% 90.9% 63.6% Fish 46.9% 35.6% 85.6% 64.0% 93.1% 61.2% 89.8% 71.1%

LISA (ours) 74.0% 73.3% 91.8% 89.2% 92.4% 89.3% 89.2% 72.6%

CMNIST Waterbirds Celeb A Datasets

Worst-group Accuracy

Intra-label Intra-domain LISA

Figure 2. Effects of intra-label and intra-domain LISA in CMNIST, Waterbirds and Celeb A. The experiments are repeated three times with different seeds.

Amazon) are selected from WILDS (Koh et al., 2021), covering natural distribution shifts across diverse domains (e.g., health, language, and vision). Besides the WILDS data, we also apply LISA on the Meta Shift datasets (Liang & Zou, 2021), constructed using the real-world images and natural heterogeneity of Visual Genome (Krishna et al., 2016). We summarize these datasets in Table 4, including domain information, evaluation metric, model architecture, and the number of classes. Detailed dataset descriptions and other training details are discussed in Appendix A.2.1 and A.2.2, respectively.

The strategy selection probability psel is set as 1.0 for these domain shifts datasets, i.e., only intra-label LISA is used. Additionally, we only interpolate samples with the same labels without considering the domain information in Camelyon17, FMo W, and Rx Rx1, which empirically leads to the best performance. One potential reason is that the spurious

correlations between labels and domains are not very strong in datasets with natural domain shifts under the existing domain partitions. Here, to evaluate the strength of spurious correlation, we adopt Cram er s V (Cram er, 2016) (see the detailed definition in Appendix A.3) to measure the association between the domain set D and the label set Y, where the results are reported in Table 13 of Appendix A.3. The Cram er s V values in Camelyon17, FMo W, and Rx Rx1 are significantly smaller than other datasets, indicating relatively weak spurious correlations. Under this setting, enlarging the interpolation scope by directly interpolating samples within the same class regardless of existing domain information may bring more benefits.

Results. We report the results of domain shifts in Table 3, where full results that include validation performance and other metrics are listed in Appendix A.6. Aligning with the observation in subpopulation shifts, the performance of prior regularization-based invariant predictor learning methods (e.g., IRM, IB-IRM, V-REx) is still unstable across different datasets. For example, V-REx outperforms ERM on Camelyon17, while it fails in Rx Rx1. However, LISA consistently outperforms all these methods on five datasets regardless of the model architecture and data types (i.e., image or text), indicating its effectiveness in improving robustness to domain shifts with selective augmentation.

4.3. Are the Performance Gains of LISA from Data Augmentation?

In LISA, we apply selective augmentation strategies on samples either with the same label but different domains or

Improving Out-of-Distribution Robustness via Selective Augmentation

Table 3. Main domain shifts results. LISA outperforms prior methods on all five datasets. Following the instructions of Koh et al. (2021), we report the performance of Camelyon17 over 10 different seeds and the results of other datasets are obtained over 3 different seeds.

Camelyon17 FMo W Rx Rx1 Amazon Meta Shift

Avg. Acc. Worst Acc. Avg. Acc. 10-th Per. Acc. Worst Acc.

ERM 70.3 6.4% 32.3 1.25% 29.9 0.4% 53.8 0.8% 52.1 0.4% IRM 64.2 8.1% 30.0 1.37% 8.2 1.1% 52.4 0.8% 51.8 0.8% IB-IRM 68.9 6.1% 28.4 0.90% 6.4 0.6% 53.8 0.7% 52.3 1.0% V-REx 71.5 8.3% 27.2 0.78% 7.5 0.8% 53.3 0.0% 51.6 1.8% CORAL 59.5 7.7% 31.7 1.24% 28.4 0.3% 52.9 0.8% 47.6 1.9% Group DRO 68.4 7.3% 30.8 0.81% 23.0 0.3% 53.3 0.0% 51.9 0.7% Domain Mix 69.7 5.5% 34.2 0.76% 30.8 0.4% 53.3 0.0% 51.3 0.5% Fish 74.7 7.1% 34.6 0.18% 10.1 1.5% 53.3 0.0% 49.2 2.1%

LISA (ours) 77.1 6.5% 35.5 0.65% 31.9 0.8% 54.7 0.0% 54.2 0.7%

Table 4. Dataset Statistics for Domain Shifts.

Datasets Domains Metric Base Model Num. of classes

Camelyon17 5 hospitals Avg. Acc. Dense Net-121 2 FMo W 16 years x 5 regions Worst-group Acc. Dense Net-121 62 Rx Rx1 51 experimental batches Avg. Acc. Res Net-50 1,139 Amazon 7,676 reviewers 10th Percentile Acc. Distil BERT-uncased 5 Meta Shift 4 backgrounds Worst-group Acc. Res Net-50 2

with the same domain but different labels. Here, we explore two substitute interpolation strategies:

Vanilla mixup: in Vanilla mixup, we do not add any constraints on the sample selection, i.e., the mixup is performed on any pairs of samples.

In-group mixup: this strategy applies data interpolation on samples with the same labels and from the same domains.

Notice that all substitute interpolation strategies use the same variant of mixup as LISA (e.g., mixup/Cut Mix). Finally, as upweighting (UW) small groups significantly improves performance in subpopulation shifts, we evaluate UW combined with Vanilla/In-group mixup.

The results of substitute interpolation strategies on domain shifts and subpopulation shifts are in Table 5 and Table 6, respectively. Furthermore, we also conduct experiments on datasets without spurious correlation in Table 14 of Appendix A.4. From the results, we make the following three key observations. First, compared with Vanilla mixup, the performance of LISA verifies that selective data interpolation indeed improve the out-of-distribution robustness by canceling out the spurious correlations and encouraging learning invariant predictors rather than simple data augmentation. These findings are further strengthened by the results in Table 14 of Appendix A.4, where Vanilla mixup outperforms LISA and ERM without spurious correlations but LISA achieves the best performance with spurious correlations. Second, the superiority of LISA over In-group mixup verifies that only interpolating samples within each group is incapable of eliminating out the spurious informa-

tion, where In-group mixup still performs the role of data augmentation. Third, though incorporating UW significantly improves the performance of Vanilla mixup and In-group mixup in subpopulation shifts, LISA still achieves larger benefits than these enhanced substitute strategies, demonstrating its stronger power in improving OOD robustness.

4.4. Does LISA Lead to More Invariant Predictors?

We further analyze the model invariance learned by LISA. Specifically, for each sample (xi, yi, d) in domain d, we denote the unscaled output (i.e., logits) of the model as gi,d. We use two metrics to measure the invariance (see Appendix A.5.1 for additional metrics and the corresponding results):

Accuracy of domain prediction (IPadp). In the first metric, we use the unscaled output to predict the domain. Concretely, the entire dataset is re-split into training, validation, and test sets, where logits are used as features and labels represent the corresponding domain ID. A logistic regression model is trained to predict the domain.

Pairwise divergence of prediction (IPkl). We calculate the KL divergence of the distribution of logits among all domains, where kernel density estimation is used to estimate the probability density function P(gy d) of logits from domain d with label y. The pairwise divergence of the predictions is defined as 1 |Y||D|2 P

d ,d D KL(P(gy D | D = d)|P(gy D | D = d )).

Small values of IPadp and IPkl represent strong functionlevel invariance. In Table 7, we report the results of LISA and other approaches on CMNIST, Waterbirds, Camelyon17

Improving Out-of-Distribution Robustness via Selective Augmentation

Table 5. Compared LISA with substitute mixup strategies in domain shifts.

Camelyon17 FMo W Rx Rx1 Amazon Meta Shift

Avg. Acc. Worst Acc. Avg. Acc. 10-th Per. Acc. Worst Acc.

ERM 70.3 6.4% 32.8 0.45% 29.9 0.4% 53.8 0.8% 52.1 0.4% Vanilla mixup 71.2 5.3% 34.2 0.45% 26.5 0.5% 53.3 0.0% 51.3 0.7% In-group mixup 75.5 6.7% 32.2 1.18% 24.4 0.2% 53.8 0.6% 52.7 0.5%

LISA (ours) 77.1 6.5% 35.5 0.65% 31.9 0.8% 54.7 0.0% 54.2 0.7%

Table 6. Compared LISA with substitute mixup strategies in subpopulation shifts. UW represents upweighting. Full results with standard deviation is listed in Table 11.

CMNIST Waterbirds Celeb A Civil Comments Avg. Worst Avg. Worst Avg. Worst Avg. Worst

ERM 27.8% 0.0% 97.0% 63.7% 94.9% 47.8% 92.2% 56.0% Vanilla mixup 32.6% 3.1% 81.0% 56.2% 95.8% 46.4% 90.8% 67.2% Vanilla mixup + UW 72.2% 71.8% 92.1% 85.6% 91.5% 88.0% 87.8% 66.1% In-group mixup 33.6% 24.0% 88.7% 68.0% 95.2% 58.3% 90.8% 69.2% In-group mixup + UW 72.6% 71.6% 91.4% 87.1% 92.4% 87.8% 84.8% 69.3%

LISA (ours) 74.0% 73.3% 91.8% 89.2% 92.4% 89.3% 89.2% 72.6%

and Meta Shift. The results verify that LISA learns predictors with greater domain invariance. Besides having more invariant predictors, we observe that LISA also leads to more invariant representations, as detailed in Appendix A.5.2.

4.5. Effect of the Degree of Distribution Shifts

We investigate the performance of LISA with respect to the degree of distribution shifts. Here, we use Meta Shift to evaluate performance, where the distance between training and test domains is measured as the node similarity on a meta-graph (Liang & Zou, 2021). To vary the distance between training and test domains, we change the backgrounds of training objects (see full experimental details in Appendix A.2.1). The performance with varied distances is illustrated in Table 8, where the top four best methods (i.e., ERM, IRM, IB-IRM, Group DRO) are reported for comparison. We observe that LISA consistently outperforms other methods under all scenarios. Another interesting finding is that LISA achieves more substantial improvements with the increases of distance. A potential reason is that the effects of eliminating domain information is more obvious when there is a larger distance between training and test domains.

5. Theoretical Analysis

In this section, we provide some theoretical understandings that explain several of the empirical phenomena from the previous experiments and theoretically compare the worstgroup errors of three methods: the proposed LISA, ERM, and vanilla mixup. Specifically, we consider a Gaussian mixture model with subpopulation and domain shifts, which has been widely adopted in theory to shed light upon com-

plex machine learning phenomenon such as in (Montanari et al., 2019; Zhang et al., 2021c; Liu et al., 2021b). We note here that despite the popularity of mixup in practice, the theoretical analysis of how mixup (w/ or w/o the selective augmentation strategies) affects the misclassification error is still largely unexplored in the literature even in the simple models. As discussed in Section 2, here, we define y {0, 1} as the label, and d {R, G} as the domain information. For y {0, 1} and d {R, G}, we consider the following model:

xi|yi = y, di = d N(µ(y,d), Σ(d)), i = 1, . . . , n(y,d), (4)

where µ(y,d) Rp is the conditional mean vector and Σ(d) Rp p is the covariance matrix. Let n = P

y {0,1},d {R,G} n(y,d). Let π(y,d) = P(yi = y, di = d),

π(y) = P(yi = y), and π(d) = P(di = d).

To account for the spurious correlation brought by domains, we consider µ(y,R) = µ(y,G) in general for y {0, 1} and the imbalanced case where π(0,R), π(1,G) < 1/4. Moreover, we assume there exists some invariance across different domains. Specifically, we assume

µ(1,R) µ(0,R) = µ(1,G) µ(0,G) := and Σ(G) = Σ(R) := Σ.

According to Fisher s linear discriminant analysis (Anderson, 1962; Tony Cai & Zhang, 2019; Cai & Zhang, 2021), the optimal classification rule is linear with slope Σ 1 . The assumption above implies that (Σ 1 ) x is the (unknown) invariant prediction rule for model (4).

Suppose we use some method A and obtain a linear classifier x T b + b0 > 0 from a training data, we will apply it to a test data and compute the worst-group misclassification error,

Improving Out-of-Distribution Robustness via Selective Augmentation

Table 7. Results of the analysis of learned invariant predictors. Accuracy of domain prediction (IPadp) and pairwise divergence of prediction among all domains (IPkl) are used to measure the invariance. Smaller values denote stronger invariance.

CMNIST Waterbirds Camelyon17 Meta Shift CMNIST Waterbirds Camelyon17 Meta Shift

ERM 82.85% 94.99% 49.43% 67.98% 6.286 1.888 1.536 1.205 Vanilla mixup 92.34% 94.49% 52.79% 69.36% 4.737 2.912 0.790 1.171 IRM 69.42% 95.12% 47.96% 67.59% 7.755 1.122 0.875 1.148 IB-IRM 74.72% 94.78% 48.37% 67.39% 1.004 3.563 0.756 1.115 V-REx 63.58% 93.32% 61.38% 68.38% 3.190 3.791 1.281 1.094

LISA (ours) 58.42% 90.28% 45.15% 66.01% 0.567 0.134 0.723 1.001

Table 8. Effects of the degree of distribution shifts w.r.t. the performance on the Meta Shift benchmark. Distance represents the distribution distance between training and test domains. Best B/L represents best baseline.

Distance 0.44 0.71 1.12 1.43

ERM 80.1% 68.4% 52.1% 33.2% IRM 79.5% 67.4% 51.8% 32.0% IB-IRM 79.7% 66.9% 52.3% 33.6% Group DRO 77.0% 68.9% 51.9% 34.2% LISA (ours) 81.3% 69.7% 54.2% 37.5%

LISA v.s. Best B/L +1.5% +1.2% +3.6% +9.6%

where the mis-classification error for domain d and class y is E(y,d)(b, b0) := P(1(x T i b + b0 > 1

2) = y|di = d, yi = y), and we denote the worst-group error with the method A as

E(wst) A = max d {R,G},y {0,1} E(y,d)(b A, b0,A),

where b A and b0,A are the slope and intercept based on the method A. Specifically, A = ERM denotes the ERM method (by minimizing the sum of squares loss on the training data altogether), A = mix denotes the vanilla mixup method (without any selective augmentation strategy), and A = LISA denotes the mixup strategy for LISA. We also denote its finite sample version by ˆE(wst) A .

Let e = E[xi|yi = 1] E[xi|yi = 0] denote the marginal difference and ξ = T Σ 1 e Σ e Σ denote the correlation operator between the domain-specific difference and the marginal difference e with respect to Σ. We see that smaller ξ indicates larger discrepancy between the marginal difference and the domain-specific difference and therefore implies stronger spurious correlation between the domains and labels. We present the following theorem showing that our proposed LISA algorithm outperforms the ERM and vanilla mixup in the subpopulation shifts setting.

Theorem 1 (Error comparison with subpopulation shifts) Consider n independent samples generated from model (4), π(R) = π(1) = 1/2, π(0,R) = π(1,G) = α < 1/4, maxy,d µ(y,d) 2 C, and Σ is positive definite. Suppose (ξ, α) satisfies that ξ < min{ e Σ Σ , Σ

e Σ } Cα for some large enough constant C and E[λ2 i ]/ max{var(λi), 1/4}

e 2 Σ + e Σ Σ. Then for any psel [0, 1],

b E(wst) LISA < min{ b E(wst) ERM , b E(wst) mix } + OP

In Theorem 1, λi is the random mixup coefficient for the i-th sample. If λi = λ are the same for all the samples in a minibatch, the results still hold. Theorem 1 implies that when ξ is small (indicating that the domain has strong spurious correlation with the label) and p = o(αn), the worst-group classification errors of LISA are asymptotically smaller than that of ERM and vanilla mixup. In fact, our analysis shows that LISA yields a classification rule closer to the invariant classification rules by leveraging the domain information.

In the next theorem, we present the mis-classification error comparisons with domain shifts. That is, consider samples from a new unseen domain:

x(0, ) i N(µ(0, ), Σ), x(1, ) i N(µ(1, ), Σ). (5)

Let e = 2(µ(0, ) E[xi]), where E[xi] is the mean of the training distribution, and assume µ(1, ) µ(0, ) = . Let ξ = e T Σ 1 e

e Σ Σ and γ = T Σ 1 e

e Σ Σ denote the correlation

for (e , e ) and for (e , ), respectively, with respect to Σ 1. Let E(wst ) A = maxy {0,1} E(y, )(b A, b0,A) and its sample version be ˆE(wst ) A .

Theorem 2 (Error comparison with domain shifts) Suppose n samples are independently generated from model (4), π(R) = π(1) = 1/2, π(0,R) = π(1,G) = α < 1/4, maxy,d µ(y,d) 2 C and Σ is positive definite. Suppose that (ξ, ξ , γ) satisfy that 0 ξ γξ and ξ < min{ γ

2 e Σ Σ , Σ

Σ } Cα for some large enough constant

C and E[λ2 i ]/ max{var(λi), 1/4} e 2 Σ + e Σ Σ. Then for any psel [0, 1],

b E(wst ) LISA < min{ b E(wst ) ERM , b E(wst ) mix } + OP

Similar to Theorem 1, this result shows that when domain has strong spurious correlation with the label (corresponding to small ξ), such a spurious correlation leads to the downgraded performance of ERM and vanilla mixup, while

Improving Out-of-Distribution Robustness via Selective Augmentation

our proposed LISA method is able to mitigate such an issue by selective data interpolation. Proofs of Theorem 1 and Theorem 2 are provided in Appendix B.

6. Related Work and Discussion

In this paper, we focus on improving the robustness of machine learning models to subpopulation shifts and domain shifts. Here, we discuss related approaches from the following three categories:

Learning Invariant Representations. Motivated by unsupervised domain adaptation (Ben-David et al., 2010; Ganin et al., 2016), the first category of works learns invariant representations by aligning representations across domains. The major research line of this category aims to eliminate the domain dependency by minimizing the divergence of feature distributions with different distance metrics, e.g., maximum mean discrepancy (Tzeng et al., 2014; Long et al., 2015), an adversarial loss (Ganin et al., 2016; Li et al., 2018), Wassertein distance (Zhou et al., 2020a). Follow-up works applied data augmentation to (1) generate more domains and enhance the consistency of representations during training (Yue et al., 2019; Zhou et al., 2020b; Xu et al., 2020; Yan et al., 2020; Shu et al., 2021; Wang et al., 2020; Yao et al., 2021) or (2) generate new domains in an adversarial way to imitate the challenging domains without using training domain information (Zhao et al., 2020; Qiao et al., 2020; Volpi et al., 2018). Unlike these latter methods, LISA instead focuses on learning invariant predictors without restricting the internal representations, leading to stronger empirical performance.

Learning Invariant Predictors. Beyond using domain alignment to learning invariant representations, recent work aims to further enhance the correlations between the invariant representations and the labels (Koyama & Yamaguchi, 2020), leading to invariant predictors. Representatively, motivated by casual inference, invariant risk minimization (IRM) (Arjovsky et al., 2019) and its variants (Guo et al., 2021; Khezeli et al., 2021; Ahuja et al., 2021) aim to find a predictor that performs well across all domains through regularizations. Other follow-up works leverage regularizers to penalize the variance of risks across all domains (Krueger et al., 2021), to align the gradient across domains (Koyama & Yamaguchi, 2020), to smooth the cross-domain interpolation paths (Chuang & Mroueh, 2021), or to involve game-theoretic invariant rationalization criterion (Chang et al., 2020). Instead of using regularizers, LISA instead learns domain-invariant predictors via data interpolation.

Group Robustness. The last category of methods combating spurious correlations and are particularly suitable for subpopulation shifts. These approaches include directly optimizing the worst-group performance with Distributionally

Robust Optimization (Sagawa et al., 2020a; Zhang et al., 2021a; Zhou et al., 2021), generating samples around the minority groups (Goel et al., 2021), and balancing the majority and minority groups via reweighting (Sagawa et al., 2020b) or regularizing (Cao et al., 2019; 2020). A few recent approaches in this category target on subpopulation shifts without annotated group labels (Nam et al., 2020; Liu et al., 2021a; Zhang et al., 2021d; Creager et al., 2021; Lee et al., 2022). LISA proposes a more general strategy that is suitable for both domain shifts and subpopulation shifts.

7. Conclusion

To tackle distribution shifts, we propose LISA, a simple and efficient algorithm, to improve the out-of-distribution robustness. LISA aims to eliminate the domain-related spurious correlations among the training set with selective interpolation. We evaluate the effectiveness of LISA on nine datasets under subpopulation shifts and domain shifts settings, demonstrating its promise. Besides, detailed analyses verify that the performance gains caused by LISA result from encouraging learning invariant predictors and representations. Theoretical results further strengthen the superiority of LISA by showing smaller worst-group mis-classification error compared with ERM and vanilla data interpolation.

While we have made progress in learning invariant predictors with selective augmentation, a limitation of LISA is how to make it compatible with problems in which it is difficult to obtain examples with the same label (e.g., object detection, generative modeling). It would be interesting to explore more general selective augmentation strategies in the future. Additionally, we empirically find that intra-label LISA works without domain information in some domain shift situations. Systematically exploring domain-free intralabel LISA with a theoretical guarantee would be another interesting future direction.

Acknowledgement

We thank Pang Wei Koh for the many insightful discussions. This research was funded in part by JPMorgan Chase & Co. Any views or opinions expressed herein are solely those of the authors listed, and may differ from the views and opinions expressed by JPMorgan Chase & Co. or its affiliates. This material is not a product of the Research Department of J.P. Morgan Securities LLC. This material should not be construed as an individual recommendation for any particular client and is not intended as a recommendation of particular securities, financial instruments or strategies for a particular client. This material does not constitute a solicitation or offer in any jurisdiction. The research was also supported by Apple and Juniper Networks. The research of Linjun Zhang is partially supported by NSF DMS-2015378.

Improving Out-of-Distribution Robustness via Selective Augmentation

Ahuja, K., Caballero, E., Zhang, D., Bengio, Y., Mitliagkas, I., and Rish, I. Invariance principle meets information bottleneck for out-of-distribution generalization. 2021.

Albuquerque, I., Monteiro, J., Darvishi, M., Falk, T. H., and Mitliagkas, I. Generalizing to unseen domains via distribution matching. ar Xiv preprint ar Xiv:1911.00804, 2019.

Anderson, T. W. An introduction to multivariate statistical analysis. Technical report, Wiley New York, 1962.

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez Paz, D. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019.

Bandi, P., Geessink, O., Manson, Q., Van Dijk, M., Balkenhol, M., Hermsen, M., Bejnordi, B. E., Lee, B., Paeng, K., Zhong, A., et al. From detection of individual metastases to classification of lymph node status at the patient level: the camelyon17 challenge. IEEE Transactions on Medical Imaging, 2018.

Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Machine learning, 79(1):151 175, 2010.

Borkan, D., Dixon, L., Sorensen, J., Thain, N., and Vasserman, L. Nuanced metrics for measuring unintended bias with real data for text classification. In Companion proceedings of the 2019 world wide web conference, pp. 491 500, 2019.

Cai, T. T. and Zhang, L. A convex optimization approach to high-dimensional sparse quadratic discriminant analysis. The Annals of Statistics, 49(3):1537 1568, 2021.

Cao, K., Wei, C., Gaidon, A., Arechiga, N., and Ma, T. Learning imbalanced datasets with label-distributionaware margin loss. Neur IPS, 2019.

Cao, K., Chen, Y., Lu, J., Arechiga, N., Gaidon, A., and Ma, T. Heteroskedastic and imbalanced deep learning with adaptive regularization. ar Xiv preprint ar Xiv:2006.15766, 2020.

Chang, S., Zhang, Y., Yu, M., and Jaakkola, T. Invariant rationalization. In International Conference on Machine Learning, pp. 1448 1458. PMLR, 2020.

Christie, G., Fendley, N., Wilson, J., and Mukherjee, R. Functional map of the world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

Chuang, C.-Y. and Mroueh, Y. Fair mixup: Fairness via interpolation. ICLR, 2021.

Cram er, H. Mathematical Methods of Statistics (PMS-9), Volume 9. Princeton university press, 2016.

Creager, E., Jacobsen, J.-H., and Zemel, R. Environment inference for invariant learning. In International Conference on Machine Learning, pp. 2189 2200. PMLR, 2021.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. 2019.

Ganin, Y. and Lempitsky, V. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pp. 1180 1189. PMLR, 2015.

Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096 2030, 2016.

Goel, K., Gu, A., Li, Y., and R e, C. Model patching: Closing the subgroup performance gap with data augmentation. In ICLR, 2021.

Guo, R., Zhang, P., Liu, H., and Kiciman, E. Out-ofdistribution prediction with invariant risk minimization: The limitation and an effective fix. ar Xiv preprint ar Xiv:2101.07732, 2021.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700 4708, 2017.

Khezeli, K., Blaas, A., Soboczenski, F., Chia, N., and Kalantari, J. On invariance penalties for risk minimization. ar Xiv preprint ar Xiv:2106.09777, 2021.

Koh, P. W., Sagawa, S., Xie, S. M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R. L., Gao, I., Lee, T., et al. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pp. 5637 5664. PMLR, 2021.

Koyama, M. and Yamaguchi, S. Out-of-distribution generalization with maximal invariant predictor. ar Xiv preprint ar Xiv:2008.01883, 2020.

Improving Out-of-Distribution Robustness via Selective Augmentation

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., Bernstein, M., and Fei-Fei, L. Visual genome: Connecting language and vision using crowdsourced dense image annotations. 2016. URL https://arxiv. org/abs/1602.07332.

Krueger, D., Caballero, E., Jacobsen, J.-H., Zhang, A., Binas, J., Zhang, D., Le Priol, R., and Courville, A. Outof-distribution generalization via risk extrapolation (rex). In International Conference on Machine Learning, pp. 5815 5826. PMLR, 2021.

Lee, H. B., Nam, T., Yang, E., and Hwang, S. J. Meta dropout: Learning to perturb latent features for generalization. In International Conference on Learning Representations, 2019.

Lee, Y., Yao, H., and Finn, C. Diversify and disambiguate: Learning from underspecified data. ar Xiv preprint ar Xiv:2202.03418, 2022.

Li, H., Pan, S. J., Wang, S., and Kot, A. C. Domain generalization with adversarial feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5400 5409, 2018.

Liang, W. and Zou, J. Metadataset: A dataset of datasets for evaluating distribution shifts and training conflicts. In ICML2021 ML4data Workshop, 2021.

Liu, E. Z., Haghgoo, B., Chen, A. S., Raghunathan, A., Koh, P. W., Sagawa, S., Liang, P., and Finn, C. Just train twice: Improving group robustness without training group information. In ICML, pp. 6781 6792. PMLR, 2021a.

Liu, H., Hao Chen, J. Z., Gaidon, A., and Ma, T. Selfsupervised learning is more robust to dataset imbalance. ar Xiv preprint ar Xiv:2110.05025, 2021b.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In ICCV, 2015.

Long, M., Cao, Y., Wang, J., and Jordan, M. Learning transferable features with deep adaptation networks. In International conference on machine learning, pp. 97 105. PMLR, 2015.

Montanari, A., Ruan, F., Sohn, Y., and Yan, J. The generalization error of max-margin linear classifiers: Highdimensional asymptotics in the overparametrized regime. ar Xiv preprint ar Xiv:1911.01544, 2019.

Muandet, K., Balduzzi, D., and Sch olkopf, B. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pp. 10 18. PMLR, 2013.

Nam, J., Cha, H., Ahn, S.-S., Lee, J., and Shin, J. Learning from failure: De-biasing classifier from biased classifier. Advances in Neural Information Processing Systems, 33, 2020.

Ni, J., Li, J., and Mc Auley, J. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.

Qiao, F., Zhao, L., and Peng, X. Learning to learn single domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12556 12565, 2020.

Rosenfeld, E., Ravikumar, P., and Risteski, A. The risks of invariant risk minimization. In ICLR, 2021.

Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In ICLR, 2020a.

Sagawa, S., Raghunathan, A., Koh, P. W., and Liang, P. An investigation of why overparameterization exacerbates spurious correlations. In ICML, pp. 8346 8356. PMLR, 2020b.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ar Xiv preprint ar Xiv:1910.01108, 2019.

Shi, Y., Seely, J., Torr, P. H., Siddharth, N., Hannun, A., Usunier, N., and Synnaeve, G. Gradient matching for domain generalization. ar Xiv preprint ar Xiv:2104.09937, 2021.

Shu, Y., Cao, Z., Wang, C., Wang, J., and Long, M. Open domain generalization with domain-augmented metalearning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9624 9633, 2021.

Sun, B. and Saenko, K. Deep coral: Correlation alignment for deep domain adaptation. In European conference on computer vision, pp. 443 450. Springer, 2016.

Taylor, J., Earnshaw, B., Mabey, B., Victors, M., and Yosinski, J. Rxrx1: An image set for cellular morphological variation across many experimental batches. In International Conference on Learning Representations (ICLR), 2019.

Tony Cai, T. and Zhang, L. High dimensional linear discriminant analysis: optimality, adaptive algorithm and missing data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(4):675 705, 2019.

Improving Out-of-Distribution Robustness via Selective Augmentation

Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., and Darrell, T. Deep domain confusion: Maximizing for domain invariance. ar Xiv preprint ar Xiv:1412.3474, 2014.

Verma, V., Lamb, A., Beckham, C., Najafi, A., Mitliagkas, I., Lopez-Paz, D., and Bengio, Y. Manifold mixup: Better representations by interpolating hidden states. In International Conference on Machine Learning, pp. 6438 6447. PMLR, 2019.

Volpi, R., Namkoong, H., Sener, O., Duchi, J., Murino, V., and Savarese, S. Generalizing to unseen domains via adversarial data augmentation. ar Xiv preprint ar Xiv:1805.12018, 2018.

Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.

Wang, Y., Li, H., and Kot, A. C. Heterogeneous domain generalization via domain mixup. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3622 3626. IEEE, 2020.

Xu, M., Zhang, J., Ni, B., Li, T., Wang, C., Tian, Q., and Zhang, W. Adversarial domain adaptation with domain mixup. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 6502 6509, 2020.

Yan, S., Song, H., Li, N., Zou, L., and Ren, L. Improve unsupervised domain adaptation with mixup training. ar Xiv preprint ar Xiv:2001.00677, 2020.

Yao, H., Zhang, L., and Finn, C. Meta-learning with fewer tasks through task interpolation. In International Conference on Learning Representations, 2021.

Yue, X., Zhang, Y., Zhao, S., Sangiovanni-Vincentelli, A., Keutzer, K., and Gong, B. Domain randomization and pyramid consistency: Simulation-to-real generalization without accessing target domain data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2100 2110, 2019.

Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6023 6032, 2019.

Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization. 2018.

Zhang, J., Menon, A., Veit, A., Bhojanapalli, S., Kumar, S., and Sra, S. Coping with label shift via distributionally robust optimisation. In ICLR, 2021a.

Zhang, L., Deng, Z., Kawaguchi, K., Ghorbani, A., and Zou, J. How does mixup help with robustness and generalization? In ICLR, 2021b.

Zhang, L., Deng, Z., Kawaguchi, K., and Zou, J. When and how mixup improves calibration. ar Xiv preprint ar Xiv:2102.06289, 2021c.

Zhang, M., Sohoni, N. S., Zhang, H. R., Finn, C., and R e, C. Correct-n-contrast: A contrastive approach for improving robustness to spurious correlations. In Neur IPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications, 2021d.

Zhao, L., Liu, T., Peng, X., and Metaxas, D. Maximum-entropy adversarial data augmentation for improved generalization and robustness. ar Xiv preprint ar Xiv:2010.08001, 2020.

Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Torralba, A. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1452 1464, 2017.

Zhou, C., Ma, X., Michel, P., and Neubig, G. Examining and combating spurious features under distribution shift. In ICML, 2021.

Zhou, F., Jiang, Z., Shui, C., Wang, B., and Chaib-draa, B. Domain generalization with optimal transport and metric learning. ar Xiv preprint ar Xiv:2007.10573, 2020a.

Zhou, K., Yang, Y., Hospedales, T., and Xiang, T. Deep domain-adversarial image generation for domain generalisation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 13025 13032, 2020b.

Improving Out-of-Distribution Robustness via Selective Augmentation

A. Additional Experiments

A.1. Additional Experiments on Subpopulation Shifts

A.1.1. DATASET DETAILS

Colored MNIST (CMNIST): We classify MNIST digits from 2 classes, where classes 0 and 1 indicate original digits (0,1,2,3,4) and (5,6,7,8,9). The color is treated as a spurious attribute. Concretely, in the training set, the proportion between red samples and green samples is 8:2 in class 0, while the proportion is set as 2:8 in class 1. In the validation set, the proportion between green and red samples is 1:1 for all classes. In the test set, the proportion between green and red samples is 1:9 in class 0, while the ratio is 9:1 in class 1. The data sizes of train, validation, and test sets are 30000, 10000, and 20000, respectively. Follow (Arjovsky et al., 2019), we flip labels with probability 0.25.

Waterbirds (Sagawa et al., 2020a): The Waterbirds dataset aims to classify birds as waterbird or landbird , where each bird image is spuriously associated with the background water or land . Waterbirds is a synthetic dataset where each image is composed by pasting a bird image sampled from CUB dataset (Wah et al., 2011) to a background drawn from the Places dataset (Zhou et al., 2017). The bird categories in CUB are stratified as land birds or water birds. Specifically, the following bird species are selected to construct the waterbird class: albatross, auklet, cormorant, frigatebird, fulmar, gull, jaeger, kittiwake, pelican, puffin, tern, gadwall, grebe, mallard, merganser, guillemot, or Pacific loon. All other bird species are combined as the landbird class. We define (land background, waterbird) and (water background, landbird) are minority groups. There are 4,795 training samples while only 56 samples are waterbirds on land and 184 samples are landbirds on water . The remaining training data include 3,498 samples from landbirds on land , and 1,057 samples from waterbirds on water .

Celeb A (Liu et al., 2015; Sagawa et al., 2020a): For the Celeb A data (Liu et al., 2015), we follow the data preprocess procedure from (Sagawa et al., 2020a). Celeb A defines a image classification task where the input is a face image of celebrities and the classification label is its corresponding hair color blond or not blond. The label is spuriously correlated with gender, i.e., male or female. In Celeb A, the minority groups are (blond, male) and (not blond, female). The number of samples for each group are 71,629 dark hair, female , 66,874 dark hair, male , 22,880 blond hair, female , 1,387 blond hair, male .

Civil Comments (Borkan et al., 2019; Koh et al., 2021): We use Civil Comments from the WILDS benchmark (Koh et al., 2021). Civil Comments is a text classification task, aiming to predict whether an online comment is toxic or non-toxic. The spurious domain identifications are defined as the demographic features, including male, female, LGBTQ, Christian, Muslim, other religion, Black, and White. Civil Comments contains 450,000 comments collected from online articles. The number of samples for training, validation, and test are 269,038, 45,180, and 133,782, respectively. The readers may kindly refer to Table 17 in (Koh et al., 2021) for the detailed group information.

A.1.2. TRAINING DETAILS

We adopt pre-trained Res Net-50 (He et al., 2016) and BERT (Sanh et al., 2019) as the model for image data (i.e., CMNIST, Waterbirds, Celeb A) and text data (i.e., Civil Comments), respectively. In each training iteration, we sample a batch of data per group. For intra-label LISA, we randomly apply mixup on sample batches with the same labels but different domains. For intra-domain LISA, we instead apply mixup on sample batches with the same domain but different labels. The interpolation ratio λ is sampled from the distribution Beta(2, 2). All hyperparameters are selected via cross-validation and are listed in Table 9.

A.1.3. ADDITIONAL RESULTS

In this section, we have added the full results of subpopulation shifts in Table 10 and Table 11.

A.2. Additional Experimental Settings on Domain Shifts

A.2.1. DATASET DETAILS

In this section, we provide detailed descriptions of datasets used in the experiments of domain shifts and report the data statistics in Table 4.

Improving Out-of-Distribution Robustness via Selective Augmentation

Table 9. Hyperparameter settings for the subpopulation shifts.

Dataset CMNIST Waterbirds Celeb A Civil Comments

Learning rate 1e-3 1e-3 1e-4 1e-5 Weight decay 1e-4 1e-4 1e-4 0 Scheduler n/a n/a n/a n/a Batch size 16 16 16 8 Type of mixup mixup mixup Cut Mix Manifold Mix Architecture Res Net50 Res Net50 Res Net50 Distil Bert Optimizer SGD SGD SGD Adam Maximum Epoch 300 300 50 3 Strategy sel. prob. psel 0.5 0.5 0.5 1.0

Table 10. Full results of subpopulation shifts with standard deviation. All the results are performed with three random seed.

CMNIST Waterbirds Avg. Worst Avg. Worst

ERM 27.8 1.9% 0.0 0.0% 97.0 0.2% 63.7 1.9% UW 72.2 1.1% 66.0 0.7% 95.1 0.3% 88.0 1.3% IRM 72.1 1.2% 70.3 0.8% 87.5 0.7% 75.6 3.1% IB-IRM 72.2 1.3% 70.7 1.2% 88.5 0.6% 76.5 1.2 % V-REx 71.7 1.2% 70.2 0.9% 88.0 1.0% 73.6 0.2% Coral 71.8 1.7% 69.5 0.9% 90.3 1.1% 79.8 1.8% Group DRO 72.3 1.2% 68.6 0.8% 91.8 0.3% 90.6 1.1% Domain Mix 51.4 1.3% 48.0 1.3% 76.4 0.3% 53.0 1.3% Fish 46.9 1.4% 35.6 1.7% 85.6 0.4% 64.0 0.3%

LISA 74.0 0.1% 73.3 0.2% 91.8 0.3% 89.2 0.6%

Celeb A Civil Comments Avg. Worst Avg. Worst

ERM 94.9 0.2% 47.8 3.7% 92.2 0.1% 56.0 3.6% UW 92.9 0.2% 83.3 2.8% 89.8 0.5% 69.2 0.9% IRM 94.0 0.4% 77.8 3.9% 88.8 0.7% 66.3 2.1% IB-IRM 93.6 0.3% 85.0 1.8% 89.1 0.3% 65.3 1.5% V-REx 92.2 0.1% 86.7 1.0% 90.2 0.3% 64.9 1.2% Coral 93.8 0.3% 76.9 3.6% 88.7 0.5% 65.6 1.3% Group DRO 92.1 0.4% 87.2 1.6% 89.9 0.5% 70.0 2.0% Domain Mix 93.4 0.1% 65.6 1.7% 90.9 0.4% 63.6 2.5% Fish 93.1 0.3% 61.2 2.5% 89.8 0.4% 71.1 0.4%

LISA (ours) 92.4 0.4% 89.3 1.1% 89.2 0.9% 72.6 0.1%

Camelyon17 We use Camelyon17 from the WILDS benchmark (Koh et al., 2021; Bandi et al., 2018), which provides 450, 000 lymph-node scans sampled from 5 hospitals. Camelyon17 is a medical image classification task where the input x is a 96 96 image and the label y is whether there exists tumor tissue in the image. The domain d denotes the hospital that the patch was taken from. The training dataset is drawn from the first 3 hospitals, while out-of-distribution validation and out-of-distribution test datasets are sampled from the 4-th hospital and 5-th hospital respectively.

FMo W The FMo W dataset is from the WILDS benchmark (Koh et al., 2021; Christie et al., 2018) a satellite image classification task which includes 62 classes and 80 domains (16 years x 5 regions). Concretely, the input x is a 224 224 RGB satellite image, the label y is one of the 62 building or land use categories, and the domain d represents the year that the image was taken as well as its corresponding geographical region Africa, the Americas, Oceania, Asia, or Europe. The train/test/validation splits are based on the time when the images are taken. Specifically, images taken before 2013 are used as the training set. Images taken between 2013 and 2015 are used as the validation set. Images taken after 2015 are used for testing.

Rx Rx1 Rx Rx1 (Koh et al., 2021; Taylor et al., 2019) from the WILDS benchmark is a cell image classification task. In the dataset, some cells have been genetically perturbed by si RNA. The goal of Rx Rx1 is to predict which si RNA that

Improving Out-of-Distribution Robustness via Selective Augmentation

Table 11. Full table of the comparison between LISA and other substitute mixup strategies in subpopulation shifts. UW represents upweighting.

CMNIST Waterbirds Avg. Worst Avg. Worst

ERM 27.8 1.9% 0.0 0.0% 97.0 0.2% 63.7 1.9% Vanilla mixup 32.6 3.1% 3.1 2.4% 81.0 0.2% 56.2 0.2% Vanilla mixup + UW 72.2 0.7% 71.8 0.1% 92.1 0.1% 85.6 1.0% In-group Group 33.6 1.9% 24.0 1.1% 88.7 0.3% 68.0 0.4% In-group + UW 72.6 0.1% 71.6 0.2% 91.4 0.6% 87.1 0.6%

LISA (ours) 74.0 0.1% 73.3 0.2% 91.8 0.3% 89.2 0.6%

Celeb A Civil Comments Avg. Worst Avg. Worst

ERM 94.9 0.2% 47.8 3.7% 92.2 0.1% 56.0 3.6% Vanilla mixup 95.8 0.0% 46.4 0.5% 90.8 0.8% 67.2 1.2% Vanilla mixup + UW 91.5 0.2% 88.0 0.3% 87.8 1.2% 66.1 1.4% Within Group 95.2 0.3% 58.3 0.9% 90.8 0.6% 69.2 0.8% Within Group + UW 92.4 0.4% 87.8 0.6% 84.8 0.7% 69.3 1.1%

LISA (ours) 92.4 0.4% 89.3 1.1% 89.2 0.9% 72.6 0.1%

the cells have been treated with. Concretely, the input x is an image of cells obtained by fluorescent microscopy, the label y indicates which of the 1, 139 genetic treatments the cells received, and the domain d denotes the experimental batches. Here, 33 different batches of images are used for training, where each batch contains one sample for each class. The out-of-distribution validation set has images from 4 experimental batches. The out-of-distribution test set has 14 experimental batches. The average accuracy on out-of-distribution test set is reported.

Amazon Each task in the Amazon benchmark (Koh et al., 2021; Ni et al., 2019) is a multi-class sentiment classification task. The input x is the text of a review, the label y is the corresponding star rating ranging from 1 to 5, and the domain d is the corresponding reviewer. The training set has 245, 502 reviews from 1, 252 reviewers, while the out-of-distribution validation set has 100, 050 reviews from another 1, 334 reviewers. The out-of-distribution test set also has 100, 050 reviews from the rest 1, 252 reviewers. We evaluate the models by the 10th percentile of per-user accuracies in the test set.

Meta Shift We use the Meta Shift (Liang & Zou, 2021), which is derived from Visual Genome (Krishna et al., 2016). Meta Shift leverages the natural heterogeneity of Visual Genome to provide many distinct data distributions for a given class (e.g. cats with cars or cats in bathroom for the cat class). A key feature of Meta Shift is that it provides explicit explanations of the dataset correlation and a distance score to measure the degree of distribution shift between any pair of sets.

We adopt the Cat vs. Dog task in Meta Shift, where we evaluate the model on the dog(shelf) domain with 306 images, and the cat(shelf) domain with 235 images. The training data for the Cat class is the cat(sofa + bed), including cat(sofa) domain and cat(bed) domain. Meta Shift provides 4 different sets of training data for the Dog class in an increasingly challenging order, i.e., increasing the amount of distribution shift. Specifically, dog(cabinet + bed), dog(bag + box), dog(bench + bike), dog(boat + surfboard) are selected for training, where their corresponding distances to dog(shelf) are 0.44, 0.71, 1.12, 1.43.

A.2.2. TRAINING DETAILS

Follow WILDS Koh et al. (2021), we adopt pre-trained Dense Net121 (Huang et al., 2017) for Camelyon17 and FMo W datasets, Res Net-50 (He et al., 2016) for Rx Rx1 and Meta Shift datasets, and Distil Bert (Sanh et al., 2019) for Amazon datasets.

In each training iteration, we first draw a batch of samples B1 from the training set. With B1, we then select another sample batch B2 with same labels as B1 for data interpolation. The interpolation ratio λ is drawn from the distribution Beta(2, 2). We use the same image transformers as Koh et al. (2021), and all other hyperparameters are selected via cross-validation and are listed in Table 12.

Improving Out-of-Distribution Robustness via Selective Augmentation

Table 12. Hyperparameter settings for the domain shifts.

Dataset Camelyon17 FMo W Rx Rx1 Amazon Meta Shift

Learning rate 1e-4 1e-4 1e-3 2e-6 1e-3 Weight decay 0 0 1e-5 0 1e-4 Scheduler n/a n/a Cosine Warmup n/a n/a Batch size 32 32 72 8 16 Type of mixup Cut Mix Cut Mix Cut Mix Manifold Mix Cut Mix Architecture Dense Net121 Dense Net121 Res Net50 Distil Bert Res Net50 Optimizer SGD Adam Adam Adam SGD Maximum Epoch 2 5 90 3 100 Strategy sel. prob. psel 1.0 1.0 1.0 1.0 1.0

A.3. Strength of Spurious Correlation

In Section 2, the spurious correlation is defined as the association between the domain d and label y, measured by Cram er s V (Cram er, 2016). Specifically, let ky,d be the number of samples from domain d with label y. The Cram er s V is formulated as

N min(|Y 1|, |D 1|) =

y Y,d D (ky,d ky,d)2

ky,d N min(|Y| 1|, |D| 1|), (6)

where N represents the number of samples in the entire dataset and ky,d =

y Y,d D ky,d . Cram er s V varies from 0 to 1 and higher Cram er s V represents stronger correlation.

According to Eq. (6), we calculate the strength of spurious correlations on all datasets used in the experiments and report the results in Table 13. Compared with other datasets, the Cram er s V on Camelyon17, FMo W and Rx Rx1 are significantly smaller, indicating weaker spurious correlations.

Table 13. Analysis of the strength of spurious correlations on datasets with subpopulation shifts or domain shifts.

Subpopulation Shifts Domain Shifts CMNIST Waterbirds Celeb A Civil Comments Camelyon17 FMo W Rx Rx1 Amazon Meta Shift

0.6000 0.8672 0.3073 0.8723 0.0004 0.1114 0.0067 0.3377 0.4932

A.4. Results on Datasets without Spurious Correlations

In order to analyze the factors that lead to the performance gains of LISA, we conduct experiments on datasets without spurious correlations. To be more specific, we balance the number of samples for each group under the subpopulation shifts setting. The results of ERM, Vanilla mixup and LISA on CMNIST, Waterbirds and Celeb A are reported in Table 14. The results show that LISA performs similarly compared with ERM when datasets do not have spurious correlations. If there exists any spurious correlation, LISA significantly outperforms ERM. Another interesting finding is that Vanilla mixup outperforms LISA and ERM without spurious correlations, while LISA achieves the best performance with spurious correlations. This finding strengthens our conclusion that the performance gains of LISA are from eliminating spurious correlations rather than simple data augmentation.

Table 14. Results on datasets without spurious correlations. LISA performs similarly to ERM when there are no spurious correlations. However, Vanilla mixup outperforms LISA and ERM when there are no spurious correlations while underperforms LISA on datasets with spurious correlations. The results further strengthen our claim that the performance gains of LISA are not from simple data augmentation.

Dataset CMNIST Waterbirds Celeb A

ERM 73.67% 88.07% 86.11% Vanilla mixup 74.28% 88.23% 88.89% LISA 73.18% 87.05% 87.22%

Improving Out-of-Distribution Robustness via Selective Augmentation

A.5. Additional Invariance Analysis

A.5.1. ADDITIONAL METRICS OF INVARIANT PREDICTOR ANALYSIS

In Table 15, we report two additional metrics to measure the invariance of predictors Risk Variance and Gradient Norm, which is defined as:

Risk Variance (IPvar). Motivated by Krueger et al. (2021), we use the variance of test risks across all domains to measure the invariance, which is defined as IPvar = Var({R1(θ), . . . , RD(θ)}), where D represents the number of test domains and Rd(θ) represents the risk of domain d.

Gradient Norm (IPnorm). Follow IRMv1 (Arjovsky et al., 2019), we use the gradient norm of the classifier to measure the optimality of the dummy classifier at each domain d. Assume the classifier is parameterized by w, IPnorm is defined as: IPnorm = 1 |D| P

d D w|w=1.0Rd(θ) 2.

Table 15. Additional Invariance Metrics for Invariant Predictor Analysis. We report the results of risk variance (IPvar) and gradient norm (IPnorm), where smaller values indicate stronger invariance.

IPvar IPnorm

CMNIST Waterbirds Camelyon Meta Shift CMNIST Waterbirds Camelyon Meta Shift

ERM 12.0486 0.2456 0.0150 1.8824 1.1162 1.5780 1.2959 1.0914 Vanilla mixup 0.2769 0.1465 0.0180 0.2659 1.5347 1.8631 0.3993 0.1985 IRM 0.0112 0.1243 0.0201 0.8748 0.0908 0.9798 0.5266 0.2320 IB-IRM 0.0072 0.2069 0.0329 0.5680 0.6225 0.8814 0.6890 0.1683 V-REx 0.0056 0.1257 0.0106 0.4220 0.0290 0.8329 0.9641 0.3680

LISA (ours) 0.0012 0.0016 9.97e-5 0.2387 0.0039 0.0538 0.3081 0.1354

Comparing LISA to other invariant learning methods, the results of IPvar and IPnorm further confirm that LISA does indeed improve predictor invariance.

A.5.2. ANALYSIS OF LEARNED INVARIANT REPRESENTATIONS

In this section, we use pairwise divergence of representations (IRkl) to measure representation-level invariance. Specifically, assume the representation before classifier of each sample (xi, yi, d) is hi,d, we compute the KL divergence of the distribution of representations. Similarly, kernel density estimation is also used to estimate the probability density function P(hy d) of representations from domain d with label y. Formally, IRkl is defined as IRkl = 1 |Y||D|2 P

d ,d D KL(P(hy D | D = d)|P(hy D | D = d )). Smaller IRkl values indicate more invariant representations with respect to the labels. We report the results on CMNIST, Waterbirds, Camelyon17 and Meta Shift in Table 16. Our key observations are: (1) Compared with ERM, LISA learns stronger representation-level invariance. The potential reason is that a stronger invariant predictor implicitly includes stronger invariance representation; (2) LISA provides more invariant representations than other regularization-based invariant predictor learning methods, i.e., IRM, IB-IRM, V-REx, showing its capability in learning stronger invariance.

Table 16. Results of representation-level invariance IRkl ( 108 for CMNIST), where smaller IRkl value denotes stronger invariance.

CMNIST Waterbirds Camelyon17 Meta Shift

ERM 1.683 3.592 8.213 0.632 Vanilla mixup 4.392 3.935 7.786 0.634 IRM 1.905 2.413 8.169 0.627 IB-IRM 3.178 3.306 8.824 0.646 V-REx 3.169 3.414 8.838 0.617

LISA (ours) 0.421 1.912 7.570 0.585

Besides the quantitative analysis, follow Appendix C in Lee et al. (2019), we visualize the hidden representations for all test samples and the decision boundary on Waterbirds and illustrate the results in Figure 3. Compared with other methods, the representations of samples with the same label that learned by LISA are closer regardless of their domain information, which further demonstrates the promise of LISA in producing invariant representations.

Improving Out-of-Distribution Robustness via Selective Augmentation

(a): ERM (b): Vanilla mixup

Landbird in Land Waterbird in Land Landbird in Water Waterbird in Water

Decision Boundary

(d): IB-IRM (e): V-REx

Figure 3. Visualization of sample representations and decision boundaries on Waterbirds dataset.

A.6. Full Results of WILDS data

Follow Koh et al. (2021), we reported more results on WILDS datasets in Table 17 - Table 20, including validation performance and the results of other metrics. According to these additional results, we could see that LISA outperforms other baseline approaches in all scenarios. Particularly, we here discuss two additional findings: (1) In Camelyon dataset, the test data is much more visually distinctive compared with the validation data, resulting in the large gap ( 10%) between validation and test performance of ERM (see Table 17). However, LISA significantly reduces the performance gap between the validation and test sets, showing its promise in improving OOD robustness; (2) In Amazon dataset, though LISA performs worse than ERM in average accuracy, it achieves the best accuracy at the 10th percentile, which is regarded as a more common and important metric to evaluate whether models perform consistently well across all users (Koh et al., 2021).

Table 17. Full Results of Camelyon17. We report both validation accuracy and test accuracy.

Validation Acc. Test Acc.

ERM 84.9 3.1% 70.3 6.4% IRM 86.2 1.4% 64.2 8.1% IB-IRM 80.5 0.4% 68.9 6.1% V-REx 82.3 1.3% 71.5 8.3% Coral 86.2 1.4% 59.5 7.7% Group DRO 85.5 2.2% 68.4 7.3% Domain Mix 83.5 1.1% 69.7 5.5% Fish 83.9 1.2% 74.7 7.1%

LISA (ours) 81.8 1.3% 77.1 6.5%

B. Proofs of Theorem 1 and Theorem 2

Outline of the proof. We will first find the mis-classification errors based on the population version of OLS with different mixup strategies. Next, we will develop the convergence rate of the empirical OLS based on n samples towards its population version. These two steps together give us the empirical mis-classification errors of different methods. We will separately show that the upper bounds in Theorem 1 and Theorem 2 hold for two selective augmentation strategies of LISA and hence hold for any psel [0, 1]. Let LL denote intra-label LISA and LD denote intra-domain LISA.

Let π1 = P(yi = 1) and π0 = P(yi = 0) denote the marginal class proportions in the training samples. Let πR = P(di = R) and πG = P(di = G) denote the marginal subpopulation proportions in the training samples. Let πG|1 = P(di = G|yi = 1) and define πG|0, πR|1, and πR|0 similarly.

Improving Out-of-Distribution Robustness via Selective Augmentation

Table 18. Full Results of FMo W. Here, we report the average accuracy and the worst-domain accuracy on both validation and test sets.

Validation Test Avg. Acc. Worst Acc. Avg. Acc. Worst Acc.

ERM 59.5 0.37% 48.9 0.62% 53.0 0.55% 32.3 1.25% IRM 57.4 0.37% 47.5 1.57% 50.8 0.13% 30.0 1.37% IB-IRM 56.1 0.48% 45.0 0.62% 49.5 0.49% 28.4 0.90% V-REx 55.3 1.75% 44.7 1.31% 48.0 0.64% 27.2 0.78% Coral 56.9 0.25% 47.1 0.43% 50.5 0.36% 31.7 1.24% Group DRO 58.8 0.19% 46.5 0.25% 52.1 0.50% 30.8 0.81% Domain Mix 58.6 0.29% 48.9 1.15% 51.6 0.19% 34.2 0.76% Fish 57.8 0.15% 49.5 2.34% 51.8 0.32% 34.6 0.18%

LISA (ours) 58.7 0.92% 48.7 0.74% 52.8 0.94% 35.5 0.65%

Table 19. Full Results of Rx Rx1. ID: in-distribution; OOD: out-of-distribution

Validation Acc. Test ID Acc. Test OOD Acc.

ERM 19.4 0.2% 35.9 0.4% 29.9 0.4% IRM 5.6 0.4% 9.9 1.4% 8.2 1.1% IB-IRM 4.3 0.7% 7.9 0.5% 6.4 0.6% V-REx 5.2 0.6% 9.3 0.9% 7.5 0.8% Coral 18.5 0.4% 34.0 0.3% 28.4 0.3% Group DRO 15.2 0.1% 28.1 0.3% 23.0 0.3% Domain Mix 19.3 0.7% 39.8 0.2% 30.8 0.4% Fish 7.5 0.6% 12.7 1.9% 10.1 1.5%

LISA (ours) 20.1 0.4% 41.2 1.0% 31.9 0.8%

We consider the setting where α := π(1,G) = π(0,R) is relatively small and π(1) = π(0) = π(G) = π(R) = 1/2.

B.1. Decomposing the loss function

Recall that = µ(1,G) µ(0,G) = µ(1,R) µ(0,R). We further define e = µ(1) µ(0), θ(G) = µ(0,G) E[xi], and θ(R) = µ(0,R) E[xi].

For the mixup estimators, we will repeatedly use the fact that λi has a symmetric distribution with support [0, 1].

For ERM estimator based on (X, y), where b0 = 1

2 E[xi]T b, we have

(µ(0,G))T b + b0 = (µ(0,G) E[xi])T b + 1

2 = (θ(G))T b + E[yi]

(µ(1,G))T b + b0 = (µ(1,G) E[xi])T b + 1

2 = T b + (θ(G))T b + E[yi],

Notice that based on the estimator b, b0, for d {G, R},

E(1,d)(b, b0) = Φ( T b (θ(d))T b

b T Σb ) and E(0,d)(b, b0) = Φ((θ(d))T b

b T Σb ). (7)

In the extreme case where π0,R = π1,G = 0, we have

e = µ(1,R) µ(0,G), θ(G) = 1

2 e , θ(R) = 1

2 e , and 0 := µ(0,G) µ(0,R) = e .

Improving Out-of-Distribution Robustness via Selective Augmentation

Table 20. Full Results of Amazon. Both the average accuracy and the 10th Percentile accuracy are reported.

Validation Test Avg. Acc. 10-th Per. Avg. Acc. 10-th Per. Acc.

ERM 72.7 0.1% 55.2 0.7% 71.9 0.1% 53.8 0.8% IRM 71.5 0.3% 54.2 0.8% 70.5 0.3% 52.4 0.8% IB-IRM 72.4 0.4% 55.1 0.6% 72.2 0.3% 53.8 0.7% V-REx 72.7 1.2% 53.8 0.7% 71.4 0.4% 53.3 0.0% Coral 72.0 0.3% 54.7 0.0% 70.0 0.6% 52.9 0.8% Group DRO 70.7 0.6% 54.7 0.0% 70.0 0.6% 53.3 0.0% Domain Mix 71.9 0.2% 54.7 0.0% 71.1 0.1% 53.3 0.0% Fish 72.5 0.0% 54.7 0.0% 71.7 0.1% 53.3 0.0%

LISA (ours) 71.6 0.4% 55.1 0.6% 70.8 0.3% 54.7 0.0%

E(wst) 0 = max{Φ

B.2. Classification errors of four methods with infinite training samples

We first provide the limit of the classification errors when n .

B.2.1. BASELINE METHOD: ERM

For the training data, it is easy to show that

var(x) = E[var(x|y)] + var(E[x|y])

= Σ + E[var(E[x|y, D]|y)] + var((µ(1) µ(0))y)

= Σ + E[var(µ(0,R) µ(0,G))1(D = R)|y)] + e 2π(1)π(0)

2(µ(0,R) µ(0,G)) 2(πR|1πG|1 + πR|0πG|0) + e 2π(1)π(0)

cov(x, y) = cov(E[x|y], y)

= cov(µ(0) + e y, y)

= cov(e y, y) = e π(1)π(0)

2(πR|1πG|1 + πR|0πG|0) and 0 = µ(0,G) µ(0,R), the ERM has slope and intercept being

b = var(x) 1cov(x, y)

(Σ + a0 2 0 ) 1 e

= Σ 1 e Σ 1 0 a0 e T Σ 1 0 1 + a0 T 0 Σ 1 0 b0 = E[y] E[x T b].

B.2.2. BASELINE METHOD: VANILLA MIXUP

The vanilla mixup does not use the group information. Let i1 be a random draw from {1, . . . , n}. Let i2 be a random draw from {1, . . . , n} independent of i1. Let yi = λiyi1 + (1 λi)yi2

and xi = λixi1 + (1 λi)xi2.

Improving Out-of-Distribution Robustness via Selective Augmentation

We can find that

cov( xi, yi) = cov(λixi1 + (1 λi)xi2, λiyi1 + (1 λi)yi2)

= cov(λixi1, λiyi1) + cov((1 λi)xi2, (1 λi)yi2)

= (E[λ2 i ] + E[(1 λi)2])cov(xi, yi).

cov( xi) = (E[λ2 i ] + E[(1 λi)2])cov(xi).

Hence, the population-level slope is the same as the slope in the benchmark method. It is easy to show that the populationlevel intercept is also the same. Hence,

E(wst) mix = E(wst) 0 .

B.3. Intra-label LISA (LISA-L): mixup across domain

Define x(λ) i = λix(yi,G) i1 + (1 λi)x(yi,R) i2 ,

where i1 is a random draw from {l : yl = yi, Dl = G} and i2 is a random draw from {l : yl = yi, Dl = R}. Then we perform OLS based on (x(λ) i , yi), i = 1, . . . , n.

We can calculate that

cov(x(λ) i , yi) = cov(E[x(λ) i |yi], yi) = cov(1

2µ(yi,G) + 1

2µ(yi,R), yi)

= var(yi) = π(1)π(0)

cov(x(λ) i ) = E[cov(x(λ) i |yi, λi)] + cov(E[x(λ) i |yi, λi])

= 2E[λ2 i ]Σ + cov(λi(µ(0,G) µ(0,R)) + yi)

= 2E[λ2 i ]Σ + var(λi)(µ(0,G) µ(0,R)) 2 + π(1)π(0) 2.

B.4. Intra-domain LISA (LISA-D): mixup within each domain

The interpolated sample can be written as

( yi, xi) = (λi, λix(1,G) i1 + (1 λi)x(0,G) i2 ) if di = G

( yi, xi) = (λi, λix(1,R) i1 + (1 λi)x(0,R) i2 ) if di = R,

where i1 is a random draw from {l : dl = di, yi = 1} and i2 is a random draw from {l : dl = di, yi = 0}.

We consider regress yi on xi.

cov( xi, yi|di = G) = cov(E[ xi| yi, di = G], yi|di = G) = var( yi)(µ(1,G) µ(0,G))

var( xi|di = G) = E[var( xi|, λi, Di = G)|di = G] + var(E[ xi|, λi, di = G]|Di = G]

= 2E[λ2 i ]Σ + var(λiµ(1,G) + (1 λi)µ(0,G)|di = G)

= 2E[λ2 i ]Σ + var( yi) 2.

We further have

cov( xi, yi) = E[cov( xi, yi|di)] + cov(E[ xi|di], E[ yi|di])

= cov( x(G) i , y(G) i )π(G) + cov( x(R) i , y(R) i )π(R)

= var( yi)(µ(1,G) µ(0,G))π(G) + var( yi)(µ(1,R) µ(0,R))π(R)

= var( yi) .

Improving Out-of-Distribution Robustness via Selective Augmentation

var( xi) = E[var( xi|di)] + var(E[ xi|di])

= var( x(G) i )π(G) + var( x(R))π(R) + (E[ x(G)] E[ x(R)]) 2π(G)π(R)

= 2E[λ2 i ]Σ + var(λi) 2 + (µ(0,G) µ(0,R)) 2π(G)π(R).

b = var( xi) 1cov( xi, yi)

(Σ + a LD 2 0 ) 1

= Σ 1 Σ 1 0 a LD( 0)T Σ 1 1 + a LD( 0)T Σ 1 0 Σ 1 + c LDΣ 1 ,

where a LD = π(R)π(G)

2E[λ2 i ] and

c LL = 1 + a LD T 0 Σ 1 0 a LD T Σ 1 0

a LD T Σ 1 0 .

Moreover, b0 = E[ yi] E[ xi]T b = 1

2 E[ xi]T b. Notice that

4(µ(0,G) + µ(1,G) + µ(0,R) + µ(1,R))

4(2µ(0,G) + + 2µ(1,R) )

2(µ(0,G) + µ(1,R)) = E[xi].

Method comparison. We only need to compare E(wst) ERM , E(wst) LL , E(wst) LD .

For the ERM, 0 a0 2α and

b ERM = (1 + a0 e T Σ 1 0 1 + a0 T 0 Σ 1 0 )Σ 1 e a0 e T Σ 1 0 1 + a0 T 0 Σ 1 0 Σ 1

Σ 1 e a0 e T Σ 1 0 1 + a0 T 0 Σ 1 0 + a0 e T Σ 1 0 Σ 1

Σ 1 e a0 e T Σ 1 0 1 + a0 T Σ 1 0 Σ 1 .

Let c0 = a0 e T Σ 1 0 1+a0 T Σ 1 0 and c1 = |c0| Σ/ e Σ. For simplicity, let v Σ = v T Σ 1v. We first lower bound it via

cor(b ERM, e ) = b T e

b T Σb = e T Σ 1 e c0 T Σ 1 e

e Σ( e Σ + |c0| Σ) |c0 T Σ 1 e |

1 + |c0| Σ/ e Σ c0ξ Σ e Σ c0 Σ

1 (1 + ξ)c1 c2 1 1 c2 1 = 1 Cα.

Improving Out-of-Distribution Robustness via Selective Augmentation

Similarly, we have

cor(b ERM, ) = b T

b T Σb = T Σ 1 e c0 T Σ 1

Σ( e Σ c0 Σ) + |c0 T Σ 1 |

( e Σ c0 Σ) Σ

1 c0 Σ/ e Σ ξ + c0 Σ/ e Σ 1 c0 Σ/ e Σ

( ξ 1 c1 c1 1 c1 ) Σ.

E(wst) ERM max Φ((1

2 Cα) e Σ (ξ + Cα) Σ), Φ(( 1

2 Cα) e Σ) (9)

for some constant C depending on the true parameters.

For method LISA-L, using the fact that 0 = e , for a LL = var(λi)/(2E[λ2 i )]),

b LL Σ 1 + a LL T Σ 1 0 1 + a LL T 0 Σ 1 0 Σ 1 e

Σ 1 e + c LLΣ 1

c LL = 1 + a LL T 0 Σ 1 0 a LL T Σ 1 0

a LL T Σ 1 0 = 1 a LL T Σ 1 0

a LL T Σ 1 0 .

cor(b LL, e ) = e T b LL e Σ p

b T LLΣb LL = e Σ + c LLξ Σ

cor(b LL, ) = b T LL

b T LLΣb LL = ξ e Σ + c LL Σ

e + c LL Σ .

To have E(wst) LL < E(wst) ERM , it suffices to require that ( 1

2 Cα) e Σ < ( 1

2 Cα) e Σ (ξ + Cα) Σ and

1 2cor(b LL, e ) e Σ cor(b LL, ) Σ (1

2 Cα) e Σ (ξ + Cα) Σ

2cor(b LL, e ) e Σ (1

2 Cα) e Σ (ξ + Cα) Σ.

A sufficient condition is

2cor(b LL, e )) e Σ

Σ Cα, cor(b LL, ) ξ + Cα, cor(b LL, e ) 1 2Cα.

We can find that a further sufficient condition is

Σ Cα, c LL > 0, ξ e + c LL Σ e Σ

c LL Σ ϵ1α (10)

e + c LL Σ e Σ, ξ c LL Σ e + c LL Σ e Σ ϵ1α (11)

2cor(b LL, e )) e Σ

Improving Out-of-Distribution Robustness via Selective Augmentation

We first find sufficient conditions for the statements in (10) and (11). Parameterizing t = c LL Σ/ e Σ, we further simplify the condition in (10) and (11) as

ξ < min{ e Σ

Σ , 1} Cα, t(t + 2ξ) > 0

1 + t2 + 2tξ 1

t ϵ1α, ξ 1 + p

1 + t2 + 2tξ t + 2ξ ϵ1α.

We only need to require

t max{0, 2ξ} and ξ min{ e Σ

Some tedious calculation shows that t max{0, 2ξ} can be guaranteed by

a LL 1 2 Σ + Σ Σ and ξ Σ

It is left to consider the constraint in (12). Notice that it holds for any ξ 0. When ξ > 0, we can see

cor(b LL, e ) = e Σ + ξc LL Σ

e + c LL Σ = 1 + tξ p

1 + t2 + 2tξ

Hence, it suffices to guarantee that

2 e Σ Σ )ξ < 1

2 e Σ Σ Cα.

If e Σ/ Σ 2, then LHS is negative and it holds. If 1 e Σ/ Σ < 2, then the inequality becomes ξ < 1 Cα.

If e Σ/ Σ < 1, then the inequality becomes ξ e Σ

Σ Cα. Because we have required ξ < min{ e Σ

Σ , 1} Cα for

some large enough C, the constraint (12) always holds. To summarize, ELL < EERM given that ξ min{ e Σ

for some large enough C and a LL 1/( e 2 Σ + e Σ Σ).

For method LISA-D, we can similarly show that ELD EERM given that ξ < min{ e Σ

e Σ } Cα for some large

enough C and a LD 1/( e 2 Σ + e Σ Σ).

B.5. Finite sample analysis

The empirical loss can be written as

P(1((x G) i )Tˆb + ˆb0 > 1

2) = y(G) i ) (13)

2P((x G) i )Tˆb + ˆb0 > 1

2|y(G) i = 0) + 1

2P((x G) i )Tˆb + ˆb0 < 1

2|y(G) i = 1),

P((x G) i )Tˆb + ˆb0 > 1

2|y(G) i = 0) = Φ(

1 2 (µ(0,G))Tˆb ˆb0 p

ˆb T Σˆb ).

P((x G) i )Tˆb + ˆb0 < 1

2|y(G) i = 1) = Φ(

1 2 (µ(1,G))Tˆb ˆb0 p

ˆb T Σˆb ).

First notice that ˆb0 = y x Tˆb.

Improving Out-of-Distribution Robustness via Selective Augmentation

(µ(0,G))Tˆb + ˆb0 = (µ(0,G) x)Tˆb + y

= (µ(0,G) E[xi])Tˆb + 1

2 + {( y x Tˆb) (E[yi] E[xi]Tˆb)} | {z } R1 (µ(1,G))Tˆb + ˆb0 = (µ(1,G) x)Tˆb + y

= Tˆb + (µ(0,G) E[xi])Tˆb + 1

Therefore, according to (13),

1 2 (µ(0,G))Tˆb ˆb0 p

ˆb T Σˆb ) + 1

1 2 (µ(1,G))Tˆb ˆb0 p

2Φ((θ(G))Tˆb + R1 p

ˆb T Σˆb ) + 1

2Φ( + (θ(G))Tˆb + R1 p

2Φ((θ(G))Tˆb + R1 p

ˆb T Σˆb ) + 1

2Φ( (θ(G))Tˆb + R1 p

( 1 2Φ( (θ(G))Tˆb + R1 p

ˆb T Σˆb ) 1

2Φ( + (Θ(G))Tˆb + R1 p

( 1 2Φ( (θ(G))Tˆb + R1 p

ˆb T Σˆb ) 1

2Φ( Tˆb + (θ(G))Tˆb + R1 p

Then the mis-classification error can be written as

Φ((θ(G))Tˆb + R1 p

ˆb T Σˆb ) Φ((θ(G))Tˆb Tˆb + R1 p

| {z } b L(ˆb)

Larger the b L(ˆb), smaller the mis-classification error.

We first find that

b L(ˆb) L(b) C |(θ(G))Tˆb + R1 p

ˆb T Σˆb (θ(G))T b

+C |(θ(G))Tˆb Tˆb + R1 p

ˆb T Σˆb (θ(G))T b T b

In the event that Σ1/2(ˆb b) 2 = o(1) max y,d µ(y,d) 2 C, Σ is positive definite.

for the denominator, we have

|b T Σb ˆb T Σˆb| (2 Σ1/2b 2 + Σ1/2(ˆb b) 2) Σ1/2(ˆb b) 2

2(1 + o(1)) Σ1/2b 2 Σ1/2(ˆb b) 2

b T Σb| |ˆb T Σˆb b T Σb| p

b T Σb 2(1 + o(1)) Σ1/2(ˆb b) 2.

For the numerator, we have

2 e Tˆb + R1 1

2 e T b| |R1| + 1

2 Σ 1/2 e 2 Σ1/2(ˆb b) 2.

Improving Out-of-Distribution Robustness via Selective Augmentation

We arrive at

T1 (1 + o(1))|R1| + 1

2 Σ 1/2 e 2 Σ1/2(ˆb b) 2

Σ1/2b 2 + (1 + o(1)) |e T b|

Σ1/2(ˆb b) 2

T2 (1 + o(1))|R1| + 1

2( Σ 1/2 e 2 + Σ 1/2 2) Σ1/2(ˆb b) 2

+ (1 + o(1))| 1

2 e T b T b|

Moreover R1 ˆb b 2 + OP ( 1 n). To summarize,

|b L(ˆb) L(b)| (1 + o(1))( ˆb b 2 + 1 n).

In the following, we will upper bound ˆb b 2 for each method. For the ERM method,

ˆb = {(X X)T (X X)} 1(X X)T (y y).

It is easy to show that

ˆb b 2 2 = OP (p PN i=1 var(yi|xi)

N 2 ) = OP ( p

For the vanilla mixup method, we first see that

i=1 (λixi1 + (1 λi)xi2) = x + OP (n 1/2) = µ + OP (n 1/2)

i=1 yi = π(1) + OP (n 1/2).

i=1 xi yi = 1

λ2 i xi1yi1 + (1 λi)2xi2yi2 + λi(1 λi)xi1yi2 + λi(1 λi)xi2yii

i=1 xi yi E[ xi yi] = 1

i=1 xi yi E[ xi yi|X, y]

+ E[ xi yi|X, y] E[ xi yi] | {z } E2

E2 = 2E[λ2 i ] n

i=1 xiyi E[ xi yi] = 2E[λ2 i ]E[xiyi].

Hence, E2 2 2 = OP ( p

For E1, conditioning on (X, y), λ2 i xi1yi1 E[λ2 i ] n Pn i=1 xiyi are independent sub-Gaussian vectors. The sub-Gaussian

N Pn i=1 λ2 i xi1,jyi1 E[λ2 i ] n Pn i=1 xi,jyi (conditioning on (X, y)) can be upper bounded by c maxi N |xi,j|/ n. Hence

P( E1 2 t|X, y) 2 exp{ c2nt2

maxp j=1 maxi N x2 i,j }.

Improving Out-of-Distribution Robustness via Selective Augmentation

As xi,j are Gaussian distributed, we know that

j=1 max i n x2 i,j p log n) exp{ c3 log n}.

Hence, with probability at least 1 exp( c1 log n),

E1 Cp log n

To summarize, 1 n

i=1 xi yi ( 1

i=1 yi) cov( xi, yi)

2 = OP (p log n

Similarly, we can show that 1 n

i=1 xi x T i ( 1

i=1 xi)T cov( xi)

2 = OP (p log n

ˆb b 2 2 = OP (p log n

For the LISA-L, we first see that

i=1 x(λ) i = 1

yi=1 (λix(1,G) i1 + (1 λi)x(1,R) i2 ) + 1

yi=0 (λix(0,G) i1 + (1 λi)x(0,R) i2 )

2( x(1,G) + x(1,R))ˆπ1 + 1

2( x(0,G) + x(0,R))ˆπ0

1 n(X(λ))T y y 1

i=1 x(λ) i cov(x(λ) i , yi) = 1

n(X(λ))T y y 1

i=1 x(λ) i cov(x(λ) i , yi|X, y)

+ cov(x(λ) i , yi|X, y) cov(x(λ) i , yi) | {z } E2 For E2,

2 ( x(1,G) + x(1,R)) ˆπ1(1

2( x(1,G) + x(1,R))ˆπ1 + 1

2( x(0,G) + x(0,R))ˆπ0) cov(x(λ) i , yi)

2( x(1,G) + x(1,R) x(0,G) x(0,R))ˆπ1ˆπ0 π(1)π(0) .

It is easy to show that

E2 2 2 = OP

p miny,e n(y,e)

For E1, conditioning on X and y, x(λ) i yi E[x(λ) i yi|X, y] are independent sub-Gaussian vectors with mean zero. The sub-Gaussian norm of 1

n Pn i=1 x(λ) i,j yi (conditioning on X and y) can be upper bounded by c maxi n |xi,j|/

P( E1 2 t|X, y) = P

i=1 {x(λ) i,j yi E[x(λ) i,j yi|X, y]}|2 t2|X, y

c2nt2 Pp j=1 maxi n x2 i,j

Improving Out-of-Distribution Robustness via Selective Augmentation

s Pp j=1 maxi n x2 i,j n ) = OP (p log n

To summarize,

n(X(λ))T y E[x(λ) i yi] 2 2 = OP ( p miny,e n(y,e) + p log n

We can use similar analysis to bound

N (X(λ))T X(λ) E[x(λ) i (x(λ) i )T ] 2.

The sub-exponential norm of 1

N PN i=1 x(λ) i,j x(λ) i,k (conditioning on X) can be upper bounded by maxi N |xi,j||xi,k|/

N. We can show that

n(X(λ))T X(λ) E[x(λ) i (x(λ) i )T ] 2 = OP ( p miny,e n(y,e) + p log n

For the LISA-D, we first see that

Di=G (λix(1,G) i1 + (1 λi)x(0,G) i2 ) + 1

Di=R (λix(1,R) i1 + (1 λi)x(0,R) i2 )

2( x(1,G) + x(0,G))ˆπG + 1

2( x(1,R) + x(0,R))ˆπR

i=1 xi yi = 1

n λ2 i x(1,G) i1 + λi(1 λi)x(0,G) i2 o + 1

n λ2 i x(1,R) i1 + λi(1 λi)x(0,R) i2 o

i=1 xi yi x y cov( x, y) = 1

i=1 xi yi x y cov( xi, yi|X, y)

+ cov( xi, yi|X, y) cov( xi, yi) | {z } E2

E2 = ˆπ(G)(E[λ2 i ]( x(1,G) x(0,G)) + 1

2 x(0,G)) + ˆπ(R)(E[λ2 i ]( x(1,R) x(0,R)) + 1

1 4( x(1,G) + x(0,G))ˆπG 1

4( x(1,R) + x(0,R))ˆπR var(λi)

= ˆπ(G)var(λi)( x(1,G) x(0,G)) + ˆπ(R)var(λi)( x(1,R) x(0,R)) var(λi) .

Notice that E2 is a sub-Gaussian vector with sub-Gaussian norm upper bounded by

ˆπ2 G n(1,G) + ˆπ2 G n(0,G) + ˆπ2 R n(1,R) + ˆπ2 R n(0,R) 4

n max y,d πd πy|d .

Using sub-Gaussian concentration, we can show that

E2 = OP ( r p

n max y,d πd πy|d ).

Improving Out-of-Distribution Robustness via Selective Augmentation

Notice that maxy,d πd πy|d 1. For E1, conditioning on X and y xi yi E[ xi yi|X, y] are independent sub-Gaussian vectors

with mean zero. The sub-Gaussian norm of 1

n Pn i=1 xi,j yi conditioning on X and y can be upper bounded by c maxi,j |xi,j|. Similar analysis on E1 leads to

i=1 xi yi x y cov( x, y) = OP (

n max y,d πd πy|d ).

For the sample covariance matrix, we can also show that 1 n

i=1 xi x T i ( 1

i=1 xi)T cov( xi)

n max y,d πd πy|d ).

B.6. A ξ-dependent lower bound for E(wst) ERM E(wst) LL

Next, we provide a ξ-dependent lower bound for E(wst) ERM E(wst) LL . Based on our previous analysis

E(wst) ERM E(wst) LL c1 min (1

2cor(b LL, )) Σ + (cor(b LL, ) ξ Cα) Σ,

2cor(b LL, )) Σ (ξ + Cα) Σ

where c1 is a positive constant given by the derivative of Φ( ). Plugging in the expression of cor(b LL, ) and cor(b LL, ), we have for the first term of E(wst) ERM E(wst) LL , it is no smaller than

1 2(1 2Cα 1 + ξt p

1 + t2 + 2tξ ) Σ + ( ξ + t p

1 + t2 + 2tξ ξ Cα) Σ

(1 + t)2 (1 ξ2) Σ + t2

(1 + t)2 (1 ξ) Σ Cα( Σ + Σ),

where the last step is due to the current constraint that t > max{0, 2ξ}. For the second term, it is no smaller than

Σ ξ Σ Cα( Σ + Σ).

Notice that t2/(1 + t2) min{ t2

4}. We can show that t Σ/ Σ, then

E(wst) ERM E(wst) LL c3 min{( Σ

Σ ξ) Σ, (1 ξ) Σ, (1 ξ) 2 Σ/ Σ} c4α( Σ + Σ).

B.7. Domain shifts: Proof of Theorem 2

It still holds that e = 2(µ(0, ) E[x(λ) i ]) = 2(µ(0, ) E[ xi]). It is easy to show that the worst group mis-classification error for this new environment is

E(wst, ) A = max

1 2(e )T b A q

1 2(e )T b A T b A q

where A {ERM, mix, LL, LD}. Notice that

e = 2µ(0, ) (µ(0,G) + µ(1,R)) = e + µ(0, ) µ(0,G)

We assume e 2 = e 2. Let ξ = cor( , e ) and γ = cor(e , e ). We have

cor(b ERM, e ) = γ e Σ e Σ c0ξ Σ e Σ

e Σ e + c0 Σ

= γ e Σ e Σ c0 Σ |c0ξ | Σ e Σ c0 Σ = γ Cα.

Improving Out-of-Distribution Robustness via Selective Augmentation

E(wst) ERM max n Φ((γ

2 Cα) e Σ (ξ Cα) Σ), Φ(( γ

2 Cα) e Σ) o (16)

for some constant C depending on the true parameters.

cor(b LL, e ) = (e )T b LL e Σ p

b T LLΣb LL = γ e Σ + c LLξ Σ

e + c LL Σ .

To have E(wst ) LL < E(wst ) ERM , it suffices to require that ( γ

2 Cα) e Σ < ( γ

2 Cα) e Σ (ξ + Cα) Σ and

1 2cor(b LL, e ) e Σ cor(b LL, ) Σ (γ

2 Cα) e Σ (ξ + Cα) Σ

2cor(b LL, e ) e Σ (γ

2 Cα) e Σ (ξ + Cα) Σ.

A sufficient condition is

2cor(b LL, e )) e Σ

Σ Cα, cor(b LL, ) ξ + Cα, cor(b LL, e ) γ 2Cα.

We can find that a further sufficient condition is

2 e Σ Σ Cα, c LL > 0, ξ γ( e + c LL Σ e Σ)

c LL Σ ϵ1α (17)

e + c LL Σ e Σ, ξ c LL Σ e + c LL Σ e Σ ϵ1α (18)

2cor(b LL, e )) e Σ

We first find sufficient conditions for the statements in (10) and (11). Parameterizing t = c LL Σ/ e Σ, we further simplify the condition in (17) and (18) as

2 e Σ Σ Cα, t > 0, ξ γ( p

1 + t2 + 2tξ 1)

2 ξ t, ξ 1 + p

1 + t2 + 2tξ t + 2ξ ϵ1α.

We only need to require

t max{0, 2ξ} and ξ < min{1 + γ

2 e Σ Σ , 1} Cα, ξ γξ.

Some tedious calculation shows that t max{0, 2ξ} can be guaranteed by

2 Σ + e Σ Σ and ξ Σ

It is left to consider the constraint in (19). Notice that it holds for any ξ 0. When ξ > 0, we can see

cor(b LL, e ) = γ e Σ + ξ c LL Σ

e + c LL Σ = γ + tξ p

1 + t2 + 2tξ

Improving Out-of-Distribution Robustness via Selective Augmentation

Hence, it suffices to guarantee that

e Σ ξ + Cα.

To summarize, it suffices to require

2 Σ + e Σ Σ , 0 ξ γξ, ξ < min{γ

2 e Σ Σ , Σ

For LISA-D, we can similarly show that E(wst ) LD < E(wst ) ERM given that

2 Σ + e Σ Σ , 0 ξ γξ, ξ < min{γ

2 e Σ Σ , Σ