# outofdomain_robustness_via_targeted_augmentations__e103cc1a.pdf

Out-of-Domain Robustness via Targeted Augmentations

Irena Gao * 1 Shiori Sagawa * 1 Pang Wei Koh 2 3 Tatsunori Hashimoto 1 Percy Liang 1

Models trained on one set of domains often suffer performance drops on unseen domains, e.g., when wildlife monitoring models are deployed in new camera locations. In this work, we study principles for designing data augmentations for out-of-domain (OOD) generalization. In particular, we focus on real-world scenarios in which some domain-dependent features are robust, i.e., some features that vary across domains are predictive OOD. For example, in the wildlife monitoring application above, image backgrounds vary across camera locations but indicate habitat type, which helps predict the species of photographed animals. Motivated by theoretical analysis on a linear setting, we propose targeted augmentations, which selectively randomize spurious domain-dependent features while preserving robust ones. We prove that targeted augmentations improve OOD performance, allowing models to generalize better with fewer domains. In contrast, existing approaches such as generic augmentations, which fail to randomize domain-dependent features, and domain-invariant augmentations, which randomize all domain-dependent features, both perform poorly OOD. In experiments on three realworld datasets, we show that targeted augmentations set new states-of-the-art for OOD performance by 3.2 15.2%.

1. Introduction

Real-world machine learning systems are often deployed on domains unseen during training. However, distribution shifts between domains can substantially degrade model performance. For example, in wildlife conservation, where

*Equal contribution 1Stanford University 2University of Washington 3Google Brain. Correspondence to: Irena Gao <irena@cs.stanford.edu>, Shiori Sagawa <ssagawa@cs.stanford.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

ecologists use machine learning to identify animals photographed by static camera traps, models suffer large performance drops on cameras not included during training (Beery et al., 2018). Out-of-domain (OOD) generalization in such settings remains an open challenge, with recent work showing that current methods do not perform well (Gulrajani & Lopez-Paz, 2020; Koh et al., 2021).

One approach to improving robustness is data augmentation, but how to design augmentations for OOD robustness remains an open question. Training with generic augmentations developed for in-domain (ID) performance (e.g., random crops and rotations) has sometimes improved OOD performance, but gains are often small and inconsistent across datasets (Gulrajani & Lopez-Paz, 2020; Wiles et al., 2021; Hendrycks et al., 2021). Other work has designed augmentations to encourage domain invariance, but gains can be limited, especially on real-world shifts (Yan et al., 2020; Zhou et al., 2020a; Gulrajani & Lopez-Paz, 2020; Ilse et al., 2021; Yao et al., 2022). Some applied works have shown that heuristic, application-specific augmentations can improve OOD performance on specific tasks (Tellez et al., 2018; 2019; Ruifrok et al., 2001). However, it is unclear what makes these augmentations successful or how to generalize the approach to other OOD problems.

In this work, we study principles for designing data augmentations for OOD robustness. We focus on real-world scenarios in which there are some domain-dependent features that are robust, i.e., where some features that vary across domains are predictive out-of-domain. For example, in the wildlife monitoring application above, image backgrounds vary across cameras but also contain features that divulge the static camera s habitat (e.g., savanna, forest, etc.). This information is predictive across all domains, as wild animals only live in certain habitats; it can also be necessary for prediction when foreground features are insufficient (e.g., when animals are blurred or obscured).

How might data augmentations improve OOD robustness in such settings? We first theoretically analyze a linear regression setting and show that unaugmented models incur high OOD risk when the OOD generalization problem is underspecified, i.e., when there are fewer training domains than the dimensionality of the domain-dependent features. This insight motivates targeted augmentations,

Out-of-Domain Robustness via Targeted Augmentations

which selectively randomize spurious domain-dependent features while preserving robust ones, reducing the effective dimensionality and bringing the problem to a fully specified regime. We prove that targeted augmentations improve OOD risk in expectation, allowing us to generalize with fewer domains. In contrast, existing approaches such as generic augmentations, which fail to randomize domaindependent features, and domain-invariant augmentations, which randomize all domain-dependent features, both suffer high OOD risk: the former fails to address the underspecification issue, and the latter eliminates robust domaindependent features that are crucial for prediction. To our knowledge, our analysis is the first to characterize how different augmentation strategies affect OOD risk and its scaling with the number of domains. It also introduces a natural theoretical setting for OOD generalization, in which the distribution shift arises from sampling finite training domains, departing from prior work that considers worst-case shifts (Rosenfeld et al., 2020; Chen et al., 2021b).

Empirically, we show targeted augmentations are effective on three real-world datasets spanning biomedical and wildlife monitoring applications: CAMELYON17-WILDS (Bandi et al., 2018; Koh et al., 2021), IWILDCAM2020WILDS (Beery et al., 2021; Koh et al., 2021), and BIRDCALLS, which we curate from ornithology datasets (Navine et al., 2022; Hopping et al., 2022; Kahl et al., 2022). Targeted augmentations outperform both generic augmentations and domain invariance baselines to achieve state-of-the-art by substantial margins: 33.3% 36.5% on IWILDCAM2020-WILDS, 75.3% 90.5% on CAMELYON17-WILDS, and 31.8% 37.8% on BIRDCALLS. Overall, our work derives principles for designing data augmentations that can substantially improve out-ofdomain performance in the wild.

2. Problem setting

Domain generalization. In domain generalization, our goal is to generalize to domains unseen during training. In particular, we seek a model θ Θ that minimizes the OOD risk under a meta distribution P, where

ROOD(θ) EP [ℓ(θ; (x, y))], (1)

and P comprises data from all possible domains Dall:

P(x, y) = X

d Dall P(x, y | d)P(d), (2)

where we assume Dall is countable to keep notation simple. To obtain training domains Dtrain Dall, we sample D domains without replacement from the meta distribution P. This yields the training distribution comprising Dtrain,

P train(x, y) = X

d Dtrain P(x, y | d)P train(d), (3)

where P train(d) is the probability of drawing domain d from the training domains Dtrain at training time. The challenge is to generalize from the sampled training domains Dtrain to all possible domains Dall that make up the underlying meta distribution. In real-world experiments and simulations, we estimate OOD performance by evaluating on held-out domains Dtest, where Dtest Dtrain = .

Feature decomposition. In many real-world shifts, such as those in Section 2.1, domain-dependent features contain predictive information that generalizes across all domains. To capture such settings, we introduce the feature decomposition x = f(xobj, xnoise, xd:robust, xd:spu) (Figure 1 left). Here, features are split along two axes: whether they are robust (i.e., predictive out-of-domain), and whether they are domain dependent (i.e., varying across domains). We formalize these two criteria by (in)dependence with label y and domain d, respectively, in the meta distribution P:

xobj, xd:robust y

xnoise, xd:spu y

xd:robust, xd:spu d

xobj, xnoise d.

For example, y depends on robust features xobj and xd:robust, but is independent of non-robust features xnoise and xd:spu, which yields P(y | x) = P(y | xobj, xd:robust). We note that these independencies need not hold in the training distribution P train due to finite-domain effects; for instance, when D is small, there may be a dependence between the label y and a spurious feature xd:spu in the training distribution P train, leading models to learn such features and generalize poorly out-of-domain.

2.1. Real-world datasets

We study three real-world datasets (Figure 1 right), which have both robust and spurious domain-dependent features.

Species classification from camera trap images (IWILDCAM2020-WILDS). In i Wild Cam (Beery et al., 2021; Koh et al., 2021), the task is to classify an animal species y from an image x captured by a static camera trap d. There are 243 cameras in Dtrain. Images from the same camera share nearly identical backgrounds. While low-level details of each domain s background are generally spurious (e.g., whether there are two trees or three), backgrounds also contain habitat features, which are predictive across domains. For example, in Figure 1, cameras 23 and 97 are installed in dry Kenyan savannas, while camera 54 observes a leafy Guatemalan forest. The two regions have different label distributions: in practice, wild African elephants are very unlikely to set foot in Guatemala. Further, habitat features are often necessary for prediction; foregrounds are often blurry or occluded

Out-of-Domain Robustness via Targeted Augmentations

i Wild Cam2020-WILDS (D=243 cameras) Camelyon17-WILDS (D=3 hospitals) Bird Calls (D=9 microphones)

animal foreground cell morphology bird calls

low-level background features stain color microphone gain settings, low-level noise

habitat features in background cancer stage, tumor size and density habitat noise (other fauna, rain levels)

xy-position patch orientation, cell xy-positions x-position

Copy-Paste (Same Y) Stain Color Jitter (Tellez et al., 2018) Copy-Paste + Jitter (Region)

TARGETED AUGMENTATION

dependent on label

independent

dependent on domain

independent

Figure 1. We model inputs as x = f(xobj, xd:robust, xd:spu, xnoise), where each of the four types of features are either (i) dependent on the domain d or not and (ii) dependent on the output label y or not, both in the meta distribution P. We study targeted augmentations, which randomize xd:spu but preserve xd:robust, and we consider three real-world datasets (Beery et al., 2021; Bandi et al., 2018; Koh et al., 2021), each of which have both robust and spurious domain-dependent features.

(see Figure 8), so randomizing all domain-dependent features discards useful information.

Tumor identification in histopathology slides (CAMELYON17-WILDS). In Camelyon17 (Bandi et al., 2018; Koh et al., 2021), the task is to classify whether a patch of a histopathology slide contains a tumor. Slides are contributed by hospitals d. Variations in imaging technique result in domain-specific stain colorings, which spuriously correlate with y in the training set (see Figure 6). Domains also vary in distributions of patient cancer stage. In Camelyon17 s 3 training hospitals, most patients in Hospitals 1 and 2 have earlier-stage p N1 breast cancer, whereas nearly half of the patients in Hospital 3 have later-stage p N2 stage cancer. The p N stage relates to the size and number of lymph node metastases, which is correlated with other histological tumor features. These useful tumor features thus depend on both d and y.

Bird species recognition from audio recordings (BIRDCALLS). To monitor bird populations, ornithologists use machine learning to identify birds by their calls in audio recordings. However, generalizing to recordings from new microphones can be challenging (Joly et al., 2021). We introduce a new bird recognition dataset curated from publicly released data (see Appendix A.3 for details). The task is to identify the bird species y vocalizing in audio clip x recorded by microphone d. There are 9 microphones in Dtrain, which vary in their model and location. While lowlevel noise and microphone settings (e.g., gain levels) only spuriously correlate with y, other background noises indicate habitat, like particular insect calls in the Amazon Basin that are absent from other regions (Figure 1). As in i Wild Cam, these habitat indicators reliably predict y. We train models on mel-spectrograms of audio clips.

3. Data augmentation

Augmentation types. We use the feature decomposition from Section 2 to model three types of data augmentations. Generic augmentations designed for in-domain settings often do not randomize domain-dependent features. For example, horizontal flips modify object orientation; this feature varies across examples but is typically distributed similarly across domains. We model generic augmentations as varying xnoise, which is labeland domain-independent:

Agen(x) = f(xobj, x noise, xd:robust, xd:spu), (5)

where x noise is drawn from some augmentation distribution. Domain-invariant augmentations Ainv aim to randomize all domain-dependent features xd:robust and xd:spu:

Ainv(x) = f(xobj, xnoise, x d:robust, x d:spu), (6)

where x d:robust, x d:spu are drawn from some distribution. Finally, targeted augmentations Atgt preserve xd:robust while aiming to randomize xd:spu:

Atgt(x) = f(xobj, xnoise, xd:robust, x d:spu), (7)

where x d:spu is drawn from some distribution. Applying generic, domain-invariant, and targeted augmentations to the training distribution P train yields new distributions over examples P train gen , P train inv , and P train tgt , respectively. Intuitively, when augmentations preserve labels, they break any dependence between the randomized features and the label y.

Training. Given N training examples {(x(i), y(i))}N i=1 drawn from P train, we learn a model that minimizes the average loss on the (augmented) training data: ˆθ(unaug) = arg min θ E ˆ P train [ℓ(θ; (x, y))] (8)

ˆθ(aug) = arg min θ E ˆ P train aug [ℓ(θ; (x, y))] , (9)

Out-of-Domain Robustness via Targeted Augmentations

i Wild Cam Camelyon Bird Calls

Original Image

Generic Augmentations

Designed for in-distribution performance, often only randomize

Rand Augment Mix Up Cut Mix Cutout

Rand Augment Mix Up Cut Mix Cutout

Spec Augment Mix Up Noise Reduction Random Pass

Designed for domain invariance Targeted Augmentation

Randomize , preserve

Randomize and

Figure 2. Augmentation examples for the three real-world datasets, including targeted augmentations Copy-Paste (Same Y) for i Wild Cam, Stain Color Jitter for Camelyon17, and Copy-Paste + Jitter (Region) for Bird Calls. Targeted augmentations randomize xd:spu but preserve xd:robust. In Section 5.1, we compare to modified Copy-Paste augmentations in the ablation column.

where ˆP train and ˆP train aug are the empirical distributions over the unaugmented and augmented training data, respectively. The superscript aug can stand for gen, inv, or tgt.

3.1. Targeted augmentations for real-world datasets

We instantiate targeted augmentations on real-world datasets from Section 2.1. Full details are in Appendix B.

Species classification from camera trap images (IWILDCAM2020-WILDS). In i Wild Cam, image backgrounds are domain-dependent features with both spurious and robust components. While low-level background features are spurious, habitat features are robust. Copy-Paste (Same Y) transforms input (x, y) by pasting the animal foreground onto a random training set background but only onto backgrounds from training cameras that also observe y (Figure 2). This randomizes low-level background features while roughly preserving habitat. We use segmentation masks from Beery et al. (2021).

Tumor identification in histopathology slides (CAMELYON17-WILDS). In Camelyon17, stain color is a spurious domain-dependent feature, while stage-related features are robust domain-dependent features. Stain Color Jitter (Tellez et al., 2018) transforms x by jittering its color in the hematoxylin and eosin staining color space (Figure 2). In contrast, domain-invariant augmentations can distort cell morphology to attain invariance.

Bird species recognition from audio recordings (BIRDCALLS). In Bird Calls, low-level noise and gain levels are spurious domain-dependent features, while habitat-specific noise is a robust domain-dependent feature. Copy-Paste + Jitter (Region) leverages time-frequency bounding boxes

to paste bird calls onto other training set recordings from the same geographic region (Southwestern Amazon Basin, Hawaii, or Northeastern United States) (Figure 2). After pasting the bird call, we also jitter hue levels of the spectrogram to simulate randomizing microphone gain settings.

4. Analysis and simulations

We now motivate targeted augmentations and illustrate the shortcomings of generic and domain-invariant augmentations by analyzing a linear setting extended from Section 2. To our knowledge, our analysis is the first to characterize how different augmentation strategies affect OOD risk and its scaling with the number of domains. It also proposes a natural theoretical setting for OOD generalization, in which the distribution shift arises from finite-domain effects, departing from prior work that considers worst-case shifts (Rosenfeld et al., 2020; Chen et al., 2021b).

4.1. Linear regression setting

Data distribution. We model each domain d as having latent attributes µ(d) [µ(d)

robust, µ(d)

spu], which affect the distribution of the corresponding domain-dependent features xd:robust, xd:spu. In i Wild Cam, µ(d)

robust intuitively corresponds to a habitat indicator and label prior. In the linear setting, these domain attributes are drawn as

robust N(0, τ 2I)

spu N(0, τ 2I). (10)

The dimensionality of µ(d) is pdom, and the dimensionality of µ(d)

robust is probust. Following the feature decomposition in Figure 1, we consider inputs x = [xobj, xnoise, xd:robust, xd:spu]. The training data is drawn

Out-of-Domain Robustness via Targeted Augmentations

uniformly from D training domains. Within each domain, inputs x are drawn according to the following distribution:

xobj N(0, I)

xnoise N(0, I)

xd:robust|d N(µ(d)

robust, σ2I)

xd:spu|d N(µ(d)

The domain-dependent features xd:robust and xd:spu are centered around the corresponding domain attributes µ(d)

robust and µ(d)

spu, while the domain-independent features xobj and xnoise are not. We define the variance ratio γ2 τ 2/σ2, which is the ratio of variances in µ(d) and feature noise. When γ2 > 1, examples within a domain tend to be more similar to each other than to examples from other domains; we consider the typical setting in which γ2 > 1.

The output y R is a linear function of both xobj and robust domain attribute µ(d)

y = β objxobj + β robustµ(d)

robust + N(0, σ2 ε). (12)

For convenience, we define the parameters for domaindependent components as βdom [βrobust, βspu] where βspu = 0. Although y depends on the domain attributes µ(d), models cannot directly observe µ(d), and instead only observe the noised features xd:robust, xd:spu. Because there are finite domains in the training distribution, µ(d)

robust and µ(d)

spu are coupled: models can infer µ(d)

robust not only from xd:robust, but also from xd:spu by memorizing the (µ(d)

robust, µ(d)

spu) pairings. The fewer domains present during training, the simpler memorization is for the model, as there are fewer (µ(d)

robust, µ(d)

spu) pairings. However, since µ(d)

robust and µ(d)

spu are independent in the true data generating process, relying on xd:spu does not generalize OOD.

Augmentations. Recall from Section 3 that generic, domain-invariant, and targeted augmentations replace components of x with draws from an augmentation distribution. We preserve y when augmenting and fix the augmentation distributions to match the data generating distribution:

x noise N(0, I)

x d:robust N(0, (σ2 + τ 2)I)

x d:spu N(0, (σ2 + τ 2)I).

Models. We study linear models, specifically ordinary least squares linear regression in theoretical analysis (Section 4.2) and ridge regression in simulations (Section 4.3).

4.2. Theory

In this section, we first show that unaugmented models fail to generalize OOD when the domain generalization

problem is underspecified (Theorem 1), i.e., when there are fewer training domains than the dimensionality of the domain-dependent features, as is typically the case in realworld domain generalization problems. This motivates targeted augmentations; by eliminating spurious domaindependent features, targeted augmentations bring the problem to a fully specified regime. We prove that targeted augmentations improve OOD risk in expectation (Theorems 2 and 3), whereas generic and domain-invariant augmentations incur high OOD risk (Corollary 1 and Theorem 3).

Our analysis assumes infinite data per domain, but finite training domains. This allows us to focus on the effects of OOD generalization while simplifying traditional sample complexity issues, which are better understood.

Overview. We study the expected excess OOD risk E ROOD(θ) ROOD(θ ) , where the expectation is over random draws of training domains, and θ arg minθ ROOD(θ) is the oracle model that attains optimal performance on the meta distribution P. To show that targeted augmentations improve the expected OOD risk, we lower bound the expected excess risk for unaugmented models, upper bound it for models with targeted augmentations, and then demonstrate a gap between the two bounds. Proofs are in Appendix C.

Lower bound for excess OOD risk with no or generic augmentations. When the number of domains is smaller than the dimensionality of the domain-dependent features (D < pdom), unaugmented models perform poorly OOD. Theorem 1 (Excess OOD risk without augmentations). If D < pdom, the expected excess OOD risk of the unaugmented model is bounded below as

E h ROOD(ˆθ(unaug)) ROOD(θ ) i τ 2γ2 βrobust 2

Proof sketch. The learned estimator has weights ˆθ(unaug) dom = (σ2I + M) 1Mβdom, where M 1 D PD d=1 µ(d)µ(d) is a random Wishart matrix. As we only observe D < pdom training domains, M is not full rank, with nullity pdom D. We lower bound the overall excess risk by the excess risk incurred in the null space of M, which can be written as τ 2γ2

1+γ2 Ppdom D i=1 (u i βdom)2; each ui is an eigenvector with a zero eigenvalue and the summation term is thus the squared norm of a projection of βdom onto the null space of M. In expectation, the squared norm is ||βdom||2(1 D pdom ) because M has spherically symmetric eigenvectors. Finally, βdom = βrobust because βspu = 0.

To contextualize the bound, we discuss the relative scale of the excess OOD risk with respect to the OOD risk of the oracle model ROOD(θ ) = σ2 ε + τ 2 βrobust 2/(1 + γ2), where the first term is the irreducible error from noise in

Out-of-Domain Robustness via Targeted Augmentations

the output y. The excess error of the unaugmented model is higher than the second term by a factor of γ2(1 D/pdom), where γ2 > 1 is the variance ratio and D is the number of domains. Thus, in typical settings where D is small relative to pdom and the variance ratio γ2 is large, unaugmented models suffer substantial OOD error.

Models trained with generic augmentations have the same lower bound (Corollary 1 in Appendix C.4), as applying generic augmentations results in the same model as unaugmented training in the infinite data setting. Our analysis captures the shortcomings of generic augmentations, which primarily improve sample complexity; as evident in the high OOD risk even in the infinite data setting, improving sample complexity alone fails to achieve OOD robustness.

Motivating targeted augmentations. The core problem above is underspecification, in which the number of domains is smaller than the dimensionality of the domaindependent features (D < pdom); there are fewer instances of µ(d) than its dimensionality (although E[xx ] is full rank due to feature noise). In such regimes, it is not possible to approximate βdom well, and models incur high OOD risk. We can mitigate this via targeted augmentations, which randomizes the spurious domain-dependent feature. This decreases the effective dimensionality from pdom to probust, the dimensionality of only the robust components, as models would no longer use the spurious feature.

Upper bound for excess OOD risk with targeted augmentations. With targeted augmentations, the problem (even without feature noise) is no longer underspecified when the number of training domains D is large enough relative to probust < pdom. In this fully specified regime, we can upper bound the expected excess OOD risk as O(log D/D). This resembles the standard rates for random design linear regression up to a log factor (Hsu et al., 2011; Gy orfi et al., 2002); standard analysis shows that excess ID risk has a O(1/N) convergence rate where N is the number of samples, and we show that excess OOD risk has an analogous convergence rate as a function of the number of domains instead of examples. Theorem 2 (Excess OOD risk with targeted augmentations). Assume γ2 > 1. For any 0 < r < 1 and large enough D such that D > 2(probust+2) log(4Dprobust)/(1 r)2, the excess OOD risk is bounded above as

E h ROOD(ˆθ(tgt)) ROOD(θ ) i

τ 2γ2 βrobust 2

D + 2 log(4Dprobust)(probust + 2)

D(1 + γ2r)2

Proof sketch. The learned estimator has weights ˆθ(tgt) spu = 0 and ˆθ(tgt) robust = (σ2I + Mrobust) 1Mrobustβrobust, where Mrobust 1 D PD d=1 µ(d)

robustµ(d) robust is a random

Wishart matrix. The excess risk can be written as Pprobust i=1 σ4(τ 2 λi)2

(σ2+τ 2)(λi+σ2)2 (u i βrobust)2, where λi and ui are eigenvalues and eigenvectors of Mrobust, respectively. Note that this excess risk is low when D is sufficiently large relative to probust such that the eigenvalues are sufficiently close to their expected value τ 2. We upper bound the excess OOD risk by applying concentration of measure arguments from Zhu (2012) to the eigenvalues of Mrobust.

Compared to the lower bound for unaugmented models (Theorem 1), this upper bound has qualitatively different behavior. It depends on probust instead of pdom, and it converges to 0 at a fast rate of O(log D/D) whereas the lowerbound is a negative linear function of the number of D.

Targeted augmentations improve expected OOD risk. We now combine the lower and upper bounds to show that targeted augmentations improve expected OOD risk. Theorem 3 (Targeted augmentations improve OOD risk). If γ2 > 1 and probust is small relative to pdom such that

probust < pdom log(2pdom) 1 4(1 + γ4/(γ2 1)2),

then for D such that

(γ2 1)2 (probust + 2) log(2pdom)

D < pdom 4(probust + 2) log(2pdom),

the improvement in expected OOD risk is positive:

E h ROOD(ˆθ(unaug)) ROOD(ˆθ(tgt)) i > 0.

As expected, the minimum and maximum number of domains for which there is a provable gap is proportional to probust and pdom, respectively. However, there is some looseness in the bound; in simulations (Section 4.3), we see a substantial gap consistent with the above result, including for D outside the proven range.

Domain-invariant augmentations incur high OOD error. Finally, we show that domain-invariant augmentations incur high OOD risk in expectation. Theorem 4 (OOD error with domain-invariant augmentations). For all D, expected OOD risk is

E[ROOD(ˆθ(inv)) ROOD(θ )] = τ 2γ2 βrobust 2

Because domain-invariant augmentations randomize all domain-dependent features, models do not use any domaindependent features, including the robust components that are crucial for prediction. As a result, the expected OOD risk is high (higher than the lower bound for unaugmented models in Theorem 1), and the error does not decay with the number of domains D.

Out-of-Domain Robustness via Targeted Augmentations

0 200 400 600 800 1000 1.0

High-sample regime (N = 100000)

0 200 400 600 800 1000

Low-sample regime (N = 5000)

OOD Test RMSE

Number of training domains D

Unaugmented Generic augmentation Domain invariant augmentation

Targeted augmentation Thm 1 (unaugmented lower bound) Thm 2 (targeted general upper bound)

Figure 3. Targeted augmentations (red line) improve OOD error substantially, while generic (orange) or unaugmented (blue) models require many training domains to attain low OOD error. Domain-invariant augmentations (green line) have constant high error. We plot OOD RMSE for varying number of training domains, with standard errors over 10 random seeds. We also plot the risk bounds from Section 4.2 for the high-sample regime; because the bounds assume infinite data, we do not plot them for the low-sample case. The plotted Theorem 2 bound is a more general version (Appendix C.5).

4.3. Simulations

The analysis in Section 4.2 assumes infinite data per domain. We now present simulation results with finite data in a high-sample (N = 100 000) and low-sample (N = 5000) regime, where N is the total number of examples across all domains. We fix γ2 = 10, probust = 5 and pspu = 500. Additional details and results are in Appendix D.

High-sample regime (N = 100 000). In Figure 3 (left), we plot OOD RMSE against the number of training domains D, together with our upper bound for targeted augmentations (a more general version of Theorem 2 in Appendix C) and lower bound for unaugmented training (Theorem 1).

We observe the trends suggested by our theory. When D is small, the unaugmented model (blue) has high OOD error, and as D increases, OOD error slowly decays. Training with generic augmentation (orange) does not improve over unaugmented training. In contrast, training with targeted augmentation (red) significantly reduces OOD error. There is a substantial gap between the red and orange/blue lines, which persists even when D is outside of the window guaranteed by Theorem 3. Finally, domain-invariant augmentations result in high OOD error (green) that does not decrease with increasing domains, as in Theorem 3.

Low-sample regime (N = 5000). In Figure 3 (right), we plot OOD RMSE against the number of training domains D when the sample size is small. The unaugmented and targeted models follow the same trends as in the high-sample regime. However, in the low-sample regime, generic augmentation does reduce OOD error compared to the unaugmented model. When the total number of examples N is small, models are incentivized to memorize individual examples using xnoise. Generic augmentation prevents this

behavior, resulting in an ID and OOD improvement over unaugmented training (also see Figure 11 in Appendix D). However, the OOD error of generic augmentation only decays slowly with D and is significantly higher than targeted augmentation for D < 1000. Domain-invariant augmentation results in a constant level of OOD error, which improves over the unaugmented and generic models for small values of D, but underperforms once D is larger.

Overall, our simulations corroborate the theory and show that targeted augmentations offer significant OOD gains in the linear regression setting. In contrast, generic and domain-invariant augmentations improve over unaugmented training only in the low-sample regime.

5. Experiments on real-world datasets

We return to the real-world datasets (IWILDCAM2020WILDS, CAMELYON17-WILDS, BIRDCALLS) and augmentations introduced in Section 2.1, where we compare targeted augmentations to unaugmented training, generic augmentations, and domain invariance baselines.

Generic augmentations. On image datasets i Wild Cam and Camelyon17, we compare to Rand Augment (Cubuk et al., 2020), Cut Mix (Yun et al., 2019), Mix Up (Zhang et al., 2017), and Cutout (De Vries & Taylor, 2017). On audio dataset Bird Calls, we compare to Mix Up, Spec Augment (Park et al., 2019), random low / high pass filters, and noise reduction via spectral gating (Sainburg, 2022). Since the targeted augmentation for Bird Calls (Copy-Paste + Jitter (Region)) includes color jitter as a subroutine, we also include a baseline of augmenting with only color jitter.

Domain invariance baselines. We compare to LISA (Yao et al., 2022), a data augmentation strategy that aims to encourage domain invariance by applying either Mix Up or Cut Mix to inputs of the same class across domains. We also compare to other domain invariance algorithms that do not involve augmentation: (C)DANN (Long et al., 2018; Ganin et al., 2016), Deep CORAL (Sun & Saenko, 2016; Sun et al., 2017), and IRM (Arjovsky et al., 2019).

Samples of the augmentations are shown in Figure 2. Additional experimental details can be found in Appendix E.2. Code annd BIRDCALLS are released at this link.

5.1. Results

Figure 4 plots the average ID versus OOD performance of each method. On all three datasets, targeted augmentations significantly improve OOD performance. Compared to the best-performing baseline, targeted augmentations improve OOD Macro F1 on i Wild Cam from 33.3% 36.5%, OOD average accuracy on Camelyon17 from 75.3% 90.5%, and OOD Macro F1 on Bird Calls from 31.8%

Out-of-Domain Robustness via Targeted Augmentations

0.38 0.40 0.43 0.45 0.48 0.50 0.53 ID Test Macro F1

OOD Test Macro F1

Unaugmented

Rand Augment

Copy-Paste (Same Y)

i Wild Cam2020-WILDS

0.85 0.90 0.95 1.00 ID Val Average Accuracy

OOD Test Average Accuarcy

Unaugmented

Rand Augment

Deep CORAL IRM

Stain Color Jitter

Camelyon17-WILDS

0.65 0.70 0.75 0.80 0.85 ID Test Macro F1

OOD Test Macro F1

Unaugmented

Jitter Mix Up

Noise Reduction Random Pass

Spec Augment

Copy-Paste + Jitter (Region)

Unaugmented Generic Augmentation Domain Invariance Targeted Augmentation Miller et al., 2021

Figure 4. We plot the in-domain (ID) performance of methods against their out-of-domain (OOD) performance. Error bars are standard errors over replicates. Targeted augmentations significantly improve OOD performance over the nearest baseline, improving OOD Macro F1 on i Wild Cam from 33.3% 36.5%, OOD average accuracy on Camelyon17 from 75.3% 90.5%, and OOD Macro F1 on Bird Calls from 31.8% 37.8%. Tables and additional details can be found in Appendix E.

37.8%. On i Wild Cam and Camelyon17, which are part of the WILDS benchmark, these targeted augmentations set new state-of-the-art performances (Koh et al., 2021). 1

Several generic augmentations were also able to improve OOD performance, although by smaller amounts than targeted augmentations; this matches our simulations in the low-sample regime in Section 4.3. Rand Augment (Cubuk et al., 2020) performs strongly on i Wild Cam and Camelyon17, and both noise reduction and random high / low pass filters perform well on Bird Calls. Some generic augmentations degraded performance (Mix Up, Cut Mix, and Spec Augment), which may reflect the fact that these augmentations can also distort xobj and xd:robust, e.g., by mixing cell morphologies on Camelyon17.

Effective robustness. On i Wild Cam, Miller et al. (2021) showed that the ID and OOD performances of models across a range of sizes are linearly correlated; we plot their linear fit on Figure 4 (left). We found that our targeted augmentation Copy-Paste (Same Y) confers what Miller et al. (2021) termed effective robustness, which is represented in the plot by a vertical offset from the line. In contrast, generic augmentations improve OOD performance along the plotted line. While the domain invariance methods also show effective robustness, they mostly underperform the unaugmented model in raw performance numbers.

Although neither Camelyon17 nor Bird Calls have associated linear fits, we observe similar trends in Figure 4, with targeted augmentations offering significant OOD gains even at similar ID performances as other methods.

1Bird Calls is a new dataset, so targeted augmentations are state-of-the-art against the baselines reported here.

Table 1. Randomizing habitat features in IWILDCAM2020WILDS and BIRDCALLS degrades performance.

Dataset Method ID Test Macro F1 OOD Test Macro F1

i Wild Cam Unaugmented 46.5 (0.4) 30.2 (0.3) Copy-Paste (All Backgrounds) 47.1 (1.1) 34.7 (0.5) Copy-Paste (Same Y) 50.2 (0.7) 36.5 (0.4)

Bird Calls Unaugmented 70.0 (0.5) 27.8 (1.2) Copy-Paste + Jitter (All Regions) 76.0 (0.3) 33.7 (1.0) Copy-Paste + Jitter (Same Region) 75.6 (0.3) 37.8 (1.0)

Table 2. Finetuning CLIP Vi T-L/14 with targeted augmentations improves OOD performance on CAMELYON17-WILDS (accuracy) and IWILDCAM2020-WILDS (macro F1). Results averaged over 5 seeds with standard errors.

Dataset Method ID Performance OOD Performance

Camelyon17 Unaugmented 99.5 (0.0) 96.0 (0.2) Stain Color Jitter 99.4 (0.0) 97.1 (0.0)

i Wild Cam Unaugmented 55.6 (0.8) 43.5 (0.7) Copy-Paste (Same Y) 56.6 (0.7) 45.5 (0.3)

Ablation on xd:robust. To further demonstrate the utility of preserving xd:robust, we ran ablations on the targeted augmentations for i Wild Cam and Bird Calls, which both preserve habitat features. On i Wild Cam, Copy-Paste (Same Y) selectively pastes animal foregrounds onto backgrounds from domains which also observe y in the training set; as an ablation, we studied Copy-Paste (All Backgrounds), which draws backgrounds from all training domains, including cameras in which y was not observed. Similarly, on Bird Calls, Copy-Paste + Jitter (Region) only pastes calls onto recordings from the original microphone s region; as an ablation, we studied Copy-Paste + Jitter (All Regions), which merges recordings indiscriminately. In Table 1, we see that preserving habitat features is useful randomizing this feature, as in our ablations, decreases OOD performance by 1.8% on i Wild Cam and 4.1% on Bird Calls.

Out-of-Domain Robustness via Targeted Augmentations

Targeted augmentations improve OOD performance when finetuning CLIP. We also applied our targeted augmentations to CLIP Vi T-L/14 (Radford et al., 2021), a large-scale vision-language model (Table 2). Targeted augmentations offer 1.1% and 2% OOD average gains over unaugmented finetuning on i Wild Cam and Camelyon17.

6. Related work

Data augmentations for OOD robustness. Prior work has shown that generic augmentations designed for ID performance can improve OOD performance, but this effect is inconsistent across datasets (Gulrajani & Lopez-Paz, 2020; Hendrycks et al., 2021; Wiles et al., 2021). Other work has sought to design augmentations specifically for robustness (Puli et al., 2022; Wang et al., 2022). Many augmentations are inspired by domain invariance and aim to randomize all domain-dependent features, including xd:robust. For example, inter-domain Mix Up interpolates inputs from different domains, possibly within the class (Wang et al., 2020; Xu et al., 2020; Yan et al., 2020; Yao et al., 2022). Ilse et al. (2021) propose to select transformations which maximally confuse a domain classifier. Several works train generative models to transform images between domains by learning to modify all domain-dependent features (Hoffman et al., 2018; Zhou et al., 2020b; Robey et al., 2021). In contrast, we preserve xd:robust in targeted augmentations.

Analysis on data augmentations and domain generalization. Existing work usually analyzes augmentations in the standard i.i.d. setting (Dao et al., 2019; He et al., 2019; Chen et al., 2020; Lyle et al., 2020), where augmentations improve sample complexity and reduce variance. We instead analyze the effect of data augmentation on OOD performance. There is limited theoretical work in this setting: Ilse et al. (2021) use augmentations to simulate interventions on domains, and Wang et al. (2022) show that one can recover a causal model given a set of augmentations encoding the relevant invariances. These works are part of a broader thread of analysis which emphasizes robustness to worst-case domain shifts; the aim is thus to recover models that only rely on causal features. In contrast, we seek to generalize to unseen domains on average. Our analysis on generalization to a meta-distribution is related to work on metalearning (Chen et al., 2021a; Jose & Simeone, 2021); however, these analyses focus on adaptation to new tasks instead of out-of-domain generalization.

Failures of domain invariance. To improve OOD robustness, the domain invariance literature focuses on learning models which are invariant to domain-dependent features, such that representations are independent of domain either marginally (Ganin et al., 2016; Albuquerque et al., 2019) or conditioned on y (Long et al., 2018; Zhao et al.,

2019). Several works have pointed out failure modes of domain invariance (Zhao et al., 2019; Johansson et al., 2019; Akuzawa et al., 2020), such as when label distributions vary across domains. Mahajan et al. (2021) focus on cases where the distribution of causal features vary across domains; we additionally allow for xd:robust to be non-causal, such as habitat features in i Wild Cam and Bird Calls.

Targeted augmentations in application literature. Many existing domain-specific augmentations can fit the proposed framework of targeted augmentations. For example, Stain Color Jitter is sourced from the biomedical literature and was designed for OOD robustness (Tellez et al., 2018; 2019; Miller et al., 2021). Copy-Paste (nonselective) has been previously applied to a smaller, singlehabitat camera trap dataset (Beery et al., 2020). Our contribution lies in interpreting and formalizing why these targeted augmentations are effective OOD.

Underspecification. D Amour et al. (2020) point out the underspecification issue in out-of-domain generalization, in which multiple models are optimal on the training data, but generalize very differently out of domain. While our theoretical setting does not precisely fit the above definition of underspecification, we observe a related phenomenon; although there is a unique optimal model due to feature noise, OOD error can be high when the noiseless version of the regression problem is underspecified.

7. Conclusion

We studied targeted augmentations, which randomize spurious domain-dependent features while preserving robust ones. In theoretical analysis and experiments on realworld datasets, we showed that targeted augmentations can significantly improve OOD performance over generic and domain-invariant augmentations. These results illustrate the power of leveraging application knowledge to design targeted augmentations: when the out-of-domain generalization problem is underspecified, prior knowledge can provide additional structure and make the out-ofdomain generalization problem more tractable. Future work could also explore methods for learning, rather than hand-designing, targeted augmentations; such approaches could leverage high-level prior knowledge on xd:robust, or directly infer xd:robust from the training domains.

Acknowledgements

We are grateful to Henrik Marklund, Holger Klinck, and Sara Beery for their advice. This work was supported by NSF Frontier and Open Philanthropy awards. Shiori Sagawa was supported by the Apple Scholars in AI/ML Ph D Fellowship.

Out-of-Domain Robustness via Targeted Augmentations

Akuzawa, K., Iwasawa, Y., and Matsuo, Y. Adversarial invariant feature learning with accuracy constraint for domain generalization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 315 331. Springer, 2020. 9

Albuquerque, I., Monteiro, J., Darvishi, M., Falk, T. H., and Mitliagkas, I. Generalizing to unseen domains via distribution matching. ar Xiv preprint ar Xiv:1911.00804, 2019. 9

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez Paz, D. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. 7

Bandi, P., Geessink, O., Manson, Q., Van Dijk, M., Balkenhol, M., Hermsen, M., Bejnordi, B. E., Lee, B., Paeng, K., Zhong, A., et al. From detection of individual metastases to classification of lymph node status at the patient level: the camelyon17 challenge. IEEE transactions on medical imaging, 38(2):550 560, 2018. 2, 3

Beery, S., Van Horn, G., and Perona, P. Recognition in terra incognita. In Proceedings of the European conference on computer vision (ECCV), pp. 456 473, 2018. 1

Beery, S., Morris, D., and Yang, S. Efficient pipeline for camera trap image review. ar Xiv preprint ar Xiv:1907.06772, 2019. 18

Beery, S., Liu, Y., Morris, D., Piavis, J., Kapoor, A., Joshi, N., Meister, M., and Perona, P. Synthetic examples improve generalization for rare classes. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 863 873, 2020. 9

Beery, S., Agarwal, A., Cole, E., and Birodkar, V. The iwildcam 2021 competition dataset. ar Xiv preprint ar Xiv:2105.03494, 2021. 2, 3, 4, 18

Birodkar, V., Lu, Z., Li, S., Rathod, V., and Huang, J. The surprising impact of mask-head architecture on novel class segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7015 7025, 2021. 18

Chen, Q., Shui, C., and Marchand, M. Generalization bounds for meta-learning: An information-theoretic analysis. Advances in Neural Information Processing Systems, 34:25878 25890, 2021a. 9

Chen, S., Dobriban, E., and Lee, J. H. A group-theoretic framework for data augmentation. The Journal of Machine Learning Research, 21(1):9885 9955, 2020. 9

Chen, Y., Rosenfeld, E., Sellke, M., Ma, T., and Risteski, A. Iterative feature matching: Toward provable domain generalization with logarithmic environments. ar Xiv preprint ar Xiv:2106.09913, 2021b. 2, 4

Cubuk, E. D., Zoph, B., Shlens, J., and Le, Q. V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 702 703, 2020. 7, 8

D Amour, A., Heller, K., Moldovan, D., Adlam, B., Alipanahi, B., Beutel, A., Chen, C., Deaton, J., Eisenstein, J., Hoffman, M. D., et al. Underspecification presents challenges for credibility in modern machine learning. Journal of Machine Learning Research, 2020. 9

Dao, T., Gu, A., Ratner, A., Smith, V., De Sa, C., and R e, C. A kernel theory of modern data augmentation. In International Conference on Machine Learning, pp. 1528 1537. PMLR, 2019. 9

Denton, T., Wisdom, S., and Hershey, J. R. Improving bird classification with unsupervised sound separation. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 636 640. IEEE, 2022. 14

De Vries, T. and Taylor, G. W. Improved regularization of convolutional neural networks with cutout. ar Xiv preprint ar Xiv:1708.04552, 2017. 7

Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. The journal of machine learning research, 17 (1):2096 2030, 2016. 7, 9

Gontijo-Lopes, R., Smullin, S. J., Cubuk, E. D., and Dyer, E. Affinity and diversity: Quantifying mechanisms of data augmentation. ar Xiv preprint ar Xiv:2002.08973, 2020. 33

Gulrajani, I. and Lopez-Paz, D. In search of lost domain generalization. ar Xiv preprint ar Xiv:2007.01434, 2020. 1, 9

Gy orfi, L., Kohler, M., Krzyzak, A., Walk, H., et al. A distribution-free theory of nonparametric regression, volume 1. Springer, 2002. 6

He, Z., Xie, L., Chen, X., Zhang, Y., Wang, Y., and Tian, Q. Data augmentation revisited: Rethinking the distribution gap between clean and augmented data. ar Xiv preprint ar Xiv:1909.09148, 2019. 9

Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M.,

Out-of-Domain Robustness via Targeted Augmentations

et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340 8349, 2021. 1, 9

Hoffman, J., Tzeng, E., Park, T., Zhu, J.-Y., Isola, P., Saenko, K., Efros, A., and Darrell, T. Cycada: Cycleconsistent adversarial domain adaptation. In International conference on machine learning, pp. 1989 1998. Pmlr, 2018. 9

Hopping, W. A., Kahl, S., and Klinck, H. A collection of fully-annotated soundscape recordings from the Southwestern Amazon Basin, September 2022. URL https: //doi.org/10.5281/zenodo.7079124. 2, 13

Hsu, D., Kakade, S. M., and Zhang, T. An analysis of random design linear regression. ar Xiv preprint ar Xiv:1106.2363, 2011. 6

Ilse, M., Tomczak, J. M., and Forr e, P. Selecting data augmentation for simulating interventions. In International Conference on Machine Learning, pp. 4555 4562. PMLR, 2021. 1, 9

Johansson, F. D., Sontag, D., and Ranganath, R. Support and invertibility in domain-invariant representations. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 527 536. PMLR, 2019. 9

Joly, A., Go eau, H., Kahl, S., Picek, L., Lorieul, T., Cole, E., Deneu, B., Servajean, M., Durso, A., Bolon, I., et al. Overview of lifeclef 2021: an evaluation of machinelearning based species identification and species distribution prediction. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: 12th International Conference of the CLEF Association, CLEF 2021, Virtual Event, September 21 24, 2021, Proceedings, pp. 371 393. Springer, 2021. 3

Jose, S. T. and Simeone, O. Information-theoretic generalization bounds for meta-learning and applications. Entropy, 23(1):126, 2021. 9

Kahl, S., Charif, R., and Klinck, H. A collection of fullyannotated soundscape recordings from the Northeastern United States, August 2022. URL https://doi. org/10.5281/zenodo.7079380. 2, 13

Koh, P. W., Sagawa, S., Marklund, H., Xie, S. M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R. L., Gao, I., et al. Wilds: A benchmark of in-thewild distribution shifts. In International Conference on Machine Learning, pp. 5637 5664. PMLR, 2021. 1, 2, 3, 8, 18, 33, 34

Kumar, A., Shen, R., Bubeck, S., and Gunasekar, S. How to fine-tune vision models with sgd. ar Xiv preprint ar Xiv:2211.09359, 2022. 34

Long, M., Cao, Z., Wang, J., and Jordan, M. I. Conditional adversarial domain adaptation. Advances in neural information processing systems, 31, 2018. 7, 9

Lyle, C., van der Wilk, M., Kwiatkowska, M., Gal, Y., and Bloem-Reddy, B. On the benefits of invariance in neural networks. ar Xiv preprint ar Xiv:2005.00178, 2020. 9

Mahajan, D., Tople, S., and Sharma, A. Domain generalization using causal matching. In International Conference on Machine Learning, pp. 7313 7324. PMLR, 2021. 9

Miller, J. P., Taori, R., Raghunathan, A., Sagawa, S., Koh, P. W., Shankar, V., Liang, P., Carmon, Y., and Schmidt, L. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International Conference on Machine Learning, pp. 7721 7735. PMLR, 2021. 8, 9, 34

Navine, A., Kahl, S., Tanimoto-Johnson, A., Klinck, H., and Hart, P. A collection of fully-annotated soundscape recordings from the Island of Hawai i, September 2022. URL https://doi.org/10.5281/ zenodo.7078499. 2, 13

Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., Cubuk, E. D., and Le, Q. V. Specaugment: A simple data augmentation method for automatic speech recognition. ar Xiv preprint ar Xiv:1904.08779, 2019. 7

Puli, A., Joshi, N., He, H., and Ranganath, R. Nuisances via negativa: Adjusting for spurious correlations via data augmentation. ar Xiv preprint ar Xiv:2210.01302, 2022. 9

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763. PMLR, 2021. 9

Robey, A., Pappas, G. J., and Hassani, H. Model-based domain generalization. Advances in Neural Information Processing Systems, 34:20210 20229, 2021. 9

Rosenfeld, E., Ravikumar, P., and Risteski, A. The risks of invariant risk minimization. ar Xiv preprint ar Xiv:2010.05761, 2020. 2, 4

Ruifrok, A. C., Johnston, D. A., et al. Quantification of histochemical staining by color deconvolution. Analytical and quantitative cytology and histology, 23(4):291 299, 2001. 1, 18

Sagawa, S., Koh, P. W., Lee, T., Gao, I., Xie, S. M., Shen, K., Kumar, A., Hu, W., Yasunaga, M., Marklund, H., et al. Extending the wilds benchmark for unsupervised

Out-of-Domain Robustness via Targeted Augmentations

adaptation. ar Xiv preprint ar Xiv:2112.05090, 2021. 33, 34

Sainburg, T. Noise reduction in python using spectral gating, 2022. URL https://github.com/ timsainb/noisereduce. 7

Sun, B. and Saenko, K. Deep coral: Correlation alignment for deep domain adaptation. In European conference on computer vision, pp. 443 450. Springer, 2016. 7

Sun, B., Feng, J., and Saenko, K. Correlation alignment for unsupervised domain adaptation. In Domain Adaptation in Computer Vision Applications, pp. 153 171. Springer, 2017. 7

Tellez, D., Balkenhol, M., Otte-H oller, I., van de Loo, R., Vogels, R., Bult, P., Wauters, C., Vreuls, W., Mol, S., Karssemeijer, N., et al. Whole-slide mitosis detection in h&e breast histology using phh3 as a reference to train distilled stain-invariant convolutional networks. IEEE transactions on medical imaging, 37(9): 2126 2136, 2018. 1, 4, 9, 18

Tellez, D., Litjens, G., B andi, P., Bulten, W., Bokhorst, J.- M., Ciompi, F., and van der Laak, J. Quantifying the effects of data augmentation and stain color normalization in convolutional neural networks for computational pathology. Medical image analysis, 58, 2019. 1, 9

Wang, R., Yi, M., Chen, Z., and Zhu, S. Out-of-distribution generalization with causal invariant transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 375 385, 2022. 9

Wang, Y., Li, H., and Kot, A. C. Heterogeneous domain generalization via domain mixup. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3622 3626. IEEE, 2020. 9

Wiles, O., Gowal, S., Stimberg, F., Rebuffi, S.-A., Ktena, I., Dvijotham, K. D., and Cemgil, A. T. A fine-grained analysis on distribution shift. In International Conference on Learning Representations, 2021. 1, 9

Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A. S., Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp. 23965 23998. PMLR, 2022. 34

Xu, M., Zhang, J., Ni, B., Li, T., Wang, C., Tian, Q., and Zhang, W. Adversarial domain adaptation with domain mixup. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 6502 6509, 2020. 9

Yan, S., Song, H., Li, N., Zou, L., and Ren, L. Improve unsupervised domain adaptation with mixup training. ar Xiv preprint ar Xiv:2001.00677, 2020. 1, 9

Yao, H., Wang, Y., Li, S., Zhang, L., Liang, W., Zou, J., and Finn, C. Improving out-of-distribution robustness via selective augmentation. ar Xiv preprint ar Xiv:2201.00299, 2022. 1, 7, 9

Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 6023 6032, 2019. 7

Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412, 2017. 7

Zhao, H., Des Combes, R. T., Zhang, K., and Gordon, G. On learning invariant representations for domain adaptation. In International Conference on Machine Learning, pp. 7523 7532. PMLR, 2019. 9

Zhou, K., Yang, Y., Hospedales, T., and Xiang, T. Deep domain-adversarial image generation for domain generalisation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 13025 13032, 2020a. 1

Zhou, K., Yang, Y., Hospedales, T., and Xiang, T. Learning to generate novel domains for domain generalization. In European conference on computer vision, pp. 561 578. Springer, 2020b. 9

Zhu, S. A short note on the tail bound of wishart distribution. ar Xiv preprint ar Xiv:1212.5860, 2012. 6, 28

Out-of-Domain Robustness via Targeted Augmentations

A. Additional notes on datasets

In this appendix, we provide additional analysis justifying the decomposition of robust and spurious domaindependent features in the real-world datasets. We also provide details on the construction of BIRDCALLS.

A.1. IWILDCAM2020-WILDS

Analysis on domain-dependent features. Figure 8 depicts a sample of images from the i Wild Cam training set. This figure illustrates that animal foregrounds which are often blurry, occluded, or camouflaged are alone insufficient for prediction. Extracting habitat features from the background gives useful signal on what species (out of 182 classes) are likely for an image. We emphasize that xd:robust is reliable under realistic distribution shifts for this application: since camera traps monitor wild animals in their natural habitats, adversarial shifts as dramatic as swapping animals between Kenya and Guatemala (Figure 8) are unlikely. Further, we show in Section 5.1 that being too conservative to this adversarial shift can reduce OOD performance on relevant, widespread shifts (across cameras).

A.2. CAMELYON17-WILDS

Analysis on domain-dependent features. Figure 9 depicts a sample of images from the Camelyon17 training set. This figure illustrates that cell morphologies are affected by distributions of patients and their breast cancer stage; Figure 5 concretizes how the distribution of cancer stages varies across domains.

We note that unlike IWILDCAM2020-WILDS and BIRDCALLS, domains in CAMELYON17-WILDS have the same (class-balanced) label distribution. To understand why models are incentivized to memorize stain color in this task, we plot the class-separated color histograms for the three training domains in Figure 6. We see that, on train, models can learn a threshold function based on the class color means for prediction.

Hospital 1 Hospital 2 Hospital 3 0%

Distributions of cancer stage

p N2 p N1 p N1mi p N0(i+)

Figure 5. Hospitals vary in the distribution of cancer stages they observe in patients, due to the different patient distributions they service. This in turn affects the causal feature for cancer prediction (cell morphology).

Figure 6. Class-separated color histograms for CAMELYON17WILDS.

A.3. BIRDCALLS

Problem setting. To monitor the health of bird populations and their habitats, ornithologists collect petabytes of acoustic recordings from the wild each year. Machine learning can automate analysis of these recordings by learning to recognize species from audio recordings of their vocalizations. However, several features vary across the microphones that collect these recordings, such as microphone model, sampling rate, and recording location. These shifts can degrade model performance on unseen microphones.

Dataset construction and statistics. To study targeted augmentations for this setting, we curate a bird recognition dataset by combining publicly released datasets. 2

The original data is sourced from 32k Hz long recordings from Navine et al. (2022); Hopping et al. (2022); Kahl et al. (2022), which were released alongside expertannotated time-frequency bounding boxes around observed bird calls. To build our dataset from these long recordings, we extracted all 5-second chunks in which a single (or no) species makes a call, and then we undersampled majority classes to achieve a more balanced class distribution. Our curated dataset, BIRDCALLS, contains 4,897 audio clips from 12 microphones distributed between the Northeastern United States, Southwest Amazon Basin, and Hawai i. Each clip features one of 31 bird species, or no bird (we

2We release BIRDCALLS at this link.

Out-of-Domain Robustness via Targeted Augmentations

Table 3. Test-to-test comparison on BIRDCALLS ID Test Avg Acc ID Test Macro F1 OOD Test Avg Acc OOD Test Macro F1

Train on OOD data 16.7 (0.2) 4.1 (0.1) 84.4 (0.7) 51.9 (0.9) Train on ID data 79.8 (0.4) 70.8 (0.6) 44.6 (0.8) 23.9 (1.0)

include an extra class for no bird recorded ). The dataset is split as follows:

1. Train: 2,089 clips from 9 microphones

2. ID Validation: 407 clips from 8 of the 9 microphones in the training set

3. ID Test: 1,677 clips from the 9 microphones in the training set

4. OOD Test: 724 clips from 3 different microphones

To train classification models, we convert the 5-second audio clips into Mel spectrograms and train an Efficient Net B0 on these images, following prior work (Denton et al., 2022). We evaluate ID and OOD performance on their corresponding test sets. The label distribution of this dataset is shown in Figure 7; to account for remaining class imbalance, we report Macro F1 as the evaluation metric. We show additional samples of the data in Figure 10.

Verifying performance drops. We ran checks to verify that observed ID to OOD performance drops were due to distribution shift, and not due to having an innately more difficult OOD Test set. For these analyses, we further split the OOD Test set into three temporary splits: OOD Train (365 clips), OOD Validation (69 clips), and OOD Test (290). We then compared the (subsetted) OOD Test performance of models trained on the (ID) Train split + selected on the ID Validation split with models trained on the OOD Train split + selected on the OOD Validation split. The results are shown in Table 3. We see that models perform quite on OOD Test if trained on the same distribution of data (OOD Train). This verifies that the ID to OOD performance drops are due to distribution shift.

Analysis on domain-dependent features. Figure 10 depicts a sample of images from the Bird Calls training set. This figure shows how habitat features distinctly vary across domains. Since fine-grained bird species are almost disjoint across regions, habitat features help indicate which species are likely. Correspondingly, we show in Section 5.1 that retaining habitat features improve both ID and OOD performance.

0 5 10 15 20 25 30 Class label

Label distribution for Bird Calls

Train OOD Test

Figure 7. Label distribution of BIRDCALLS.

Out-of-Domain Robustness via Targeted Augmentations

Camera 54 Camera 249 Camera 124 Camera 213 Camera 23

Forest (Guatemala) Savanna (Kenya)

Figure 8. Across domains (columns), both low-level background details xd:spu and high-level habitat features xd:robust vary. Since xd:robust d, domain invariance may eliminate habitat information. In contrast, a targeted augmentation, Copy-Paste (Same Y), randomizes backgrounds between cameras in similar habitats, preserving the ability of the model to use xd:robust. This is necessary for performance, as foregrounds xobj can be too camouflaged, distant, blurred, dark, or occluded for even a human annotator s eye. (All images in this figure contain an animal.)

Out-of-Domain Robustness via Targeted Augmentations

Hospital 1 Hospital 2 Hospital 3

Patient 9 (p N1mi) Patient 16 (p N1) Patient 67 (p N1mi) Patient 64 (p N0(i+)) Patient 92 (p N2) Patient 96 (p N2)

Figure 9. The top two rows depict non-cancerous patches (y = 0), while the bottom three rows are cancerous patches (y = 1). Across domains (columns), several features, including distributions of the causal feature (cell morphology), vary. Cell morphology is impacted by the patient distribution of each hospital, as some hospitals have patients with more aggressive cancer staging (Figure 5). This leads to different distributions of cell morphologies across domains. While domain invariance would thus eliminate this causal feature, targeted augmentations only randomize features independent of y, such as stain color.

Out-of-Domain Robustness via Targeted Augmentations

Microphone 2 Microphone 0 Microphone 8 Microphone 10 Microphone 11

Amazon Hawaii Northeast

Figure 10. Across domains (columns), recordings vary in their habitat features, such as calls from local insects (left two columns, high frequencies), stronger wind levels (center two columns), or rainfall levels. These habitat features can act as a useful bias for deciding likely labels. Targeted augmentations randomize background noise between microphones located in the same region, preserving this robust feature, while domain invariance eliminates this feature.

Out-of-Domain Robustness via Targeted Augmentations

B. Augmentation details

In this appendix, we provide implementation details for the targeted augmentations we study on the real-world datasets.

B.1. Copy-Paste (Same Y) on IWILDCAM2020-WILDS

The full Copy-Paste protocol is given in Algorithm 1. We consider two strategies for selecting the set of valid empty backgrounds B(i).

1. Copy-Paste (All Backgrounds): all empty train split images. B(i) = {(x, y, d) Dtrain : y = empty }, i.e., all augmented examples should have a single distribution of backgrounds. There is a large set of training backgrounds to choose from when executing the procedure of 129, 809 training images, 48, 021 are empty images.

2. Copy-Paste (Same Y): empty train split images from cameras that have observed y(i). Let Y(d) represent the set of labels domain d observes. Then B(i) = {(x, y, d) Dtrain : y = empty and y(i) Y(d)}.

Algorithm 1 Copy-Paste

Input: Labeled example (x(i), y(i), d(i)), binary segmentation mask m(i), set of images to sample empty images from to use as backgrounds B(i)

if y(i) = empty or |B(i)| = 0 then

Return x(i)

end if Copy out foreground by applying segmentation mask f (i) := m(i) x(i)

Randomly select a background b B(i)

Paste f (i) onto b and return x(i) := Paste(f (i), b)

Segmentation masks. The i Wild Cam dataset is curated from real camera trap data collected by the Wildlife Conservation Society and released by Beery et al. (2021); Koh et al. (2021). Beery et al. (2021) additionally compute and release segmentation masks for all labeled examples in i Wild Cam. These segmentation masks were extracted by running the dataset through Mega Detector (Beery et al., 2019) and then passing regions within detected boxes through an off-the-shelf, class-agnostic detection model, Deep MAC (Birodkar et al., 2021). We use these segmentation masks for our Copy-Paste augmentation.

Comparison to swapping within countries. To confirm that Copy-Paste (Same Y) acts to preserve geographic habitat features, we ran an oracle experiment comparing its

Table 4. Pasting onto backgrounds from cameras that have observed the same class during training achieves similar ID and OOD performance to pasting within countries.

ID Test Macro F1 OOD Test Macro F1

Copy-Paste (Same Y) 50.2 (0.7) 36.5 (0.4) Copy-Paste (Same Country) 49.3 (0.9) 36.7 (0.7)

performance to applying Copy-Paste within geographic regions. Beery et al. (2021) released noisy geocoordinates for around half of the locations in IWILDCAM2020-WILDS. Using these coordinates, we inferred the country each camera trap was located in (we merged all cameras of unknown locations into one group, unknown country ). We then applied Copy-Paste, pasting animals only onto backgrounds from the same country. Table 4 shows that Copy-Paste (Same Y) and this oracle have the same performance, suggesting that the Same Y policy indeed preserves geographic habitat features.

B.2. Stain Color Jitter on CAMELYON17-WILDS

The full Stain Color Jitter protocol, originally from Tellez et al. (2018), is given in Algorithm 2. The augmentation uses a pre-specified Optical Density (OD) matrix from Ruifrok et al. (2001) to project images from RGB space to a three-channel hematoxylin, eosin, and DAB space before applying a random linear combination.

Algorithm 2 Stain Color Jitter Augmentation

Input: Labeled example (x(i), y(i), d(i)), normalized OD matrix M (Ruifrok et al., 2001), tolerance ϵ = 1 6

S = log(x(i) + ϵ)M 1

Sample α Uni(1 σ, 1 + σ) Sample β Uni( σ, σ) P = exp[ (αS + β)M] ϵ Return P with each cell clipped to [0, 255]

B.3. Copy-Paste + Jitter (Region) on BIRDCALLS

After transforming audio clips into mel-spectrograms, we use time-frequency bounding boxes included in the dataset to extract pixels of bird calls. We then paste these pixels onto spectrograms from the empty (no bird recorded) class, applying Algorithm 1. Finally, we apply color jitter on the spectrograms. The goal of jitter is to simulate changes in gain settings across microphones, which affect the coloring of spectrograms. We consider two strategies for selecting the set of valid empty backgrounds B(i).

1. Copy-Paste + Jitter (All Regions): all empty train split recordings. B(i) = {(x, y, d) Dtrain : y = empty }, i.e., all augmented examples should have a single distribution of backgrounds. There is a large

Out-of-Domain Robustness via Targeted Augmentations

set of training backgrounds to choose from when executing the procedure of 129, 809 training images, 48, 021 are empty images.

2. Copy-Paste + Jitter (Region): empty train split recordings from microphones in the same region. Let R(d) represent the region (Hawaii, Southwest Amazon Basin, or Northeastern United States) that domain d is located in; we provide these annotations in BIRDCALLS. Then B(i) = {(x, y, d) Dtrain : y = empty and R(d(i)) = R(d)}.

Out-of-Domain Robustness via Targeted Augmentations

We present the proofs for results presented in Section 4.2.

C.1. Analyzing domain-dependent features only

In the proofs, we analyze only the domain-dependent features xdom = [xd:robust, xd:spu], disregarding the object features xobj and noise features xnoise, since the latter two features do not affect our results. To show this, we first consider the full setting with x = [xobj, xnoise, xd:robust, xd:spu] and compute the model estimate ˆθ by applying the normal equations. We compute the relevant quantities as

I 0 0 0 I 0 0 0 A

βobj βnoise Bβdom

where the blocks correspond to object features xobj, noise features xnoise, and domain-dependent features [xd:robust, xd:spu] and the matrices A and B depend on the augmentation strategy. Applying the normal equations yields

βobj βnoise A 1Bβdom.

This means that in our infinite-data, finite-domain setting, models perfectly recover βobj and βnoise for all augmentation strategies. Thus, the model incurs zero error from the object and noise dimensions, so these features can also be disregarded in the error computation.

In the rest of the proof, we focus on analyzing the domain-dependent features; without loss of generality, we assume that the dimensionality of the object and noise features are 0. In other words, we consider x = [xd:robust, xd:spu], β = βdom = [βrobust, βspu], and θ = θdom = [θrobust, θspu], all of which are of length pdom.

C.2. Models

Proposition 1 (Estimator without augmentation). Unaugmented training yields the model

ˆθ(unaug) = (Σ + M) 1Mβ (16)

where M = 1

D PD d=1 µ(d)µ(d) and Σ = σ2I.

ˆθ(unaug) = E[xx ] 1E[xy] (17)

d=1 Σ + µ(d)µ(d) ! 1 1 D

d=1 E x(β µ(d) + ε) !

d=1 µ(d)µ(d) ! 1 1 D

d=1 µ(d)µ(d) β

= (Σ + M) 1 Mβ (20)

Proposition 2 (Estimator with generic augmentation). Applying generic augmentation yields the model

ˆθ(gen) = (Σ + M) 1Mβ (21)

where M = 1

D PD d=1 µ(d)µ(d) and Σ = σ2I.

Out-of-Domain Robustness via Targeted Augmentations

Proof. Applying generic augmentations do not change the data distribution over the domain-dependent features. Thus, ˆθ(gen) = ˆθ(unaug). Applying Proposition 1 yields the result.

Proposition 3 (Estimator with targeted augmentation). Applying targeted augmentation yields the model

ˆθ(tgt) = (Σrobust + Mrobust) 1Mrobustβrobust 0

where Mrobust = 1

D PD d=1 µ(d)

robustµ(d) robust and Σrobust = σ2I.

Proof. In the augmented training distribution, input x in domain d is distributed as

, Σ(tgt) , (23)

where Σ(tgt) = σ2I 0 0 (σ2 + τ 2)I

Applying the normal equations on the augmented training distribution, we compute ˆθ(tgt) as

ˆθ(tgt) = E[xx ] 1E[xy] (24)

= Σ(tgt) + M (tgt) 1 M (tgt)β, (25)

where M (tgt) = Mrobust 0 0 0

Since we can invert block diagonal matrices block by block, we can compute Σ(tgt) + M (tgt) 1 as

Σ(tgt) + M (tgt) 1 = (σ2I + Mrobust) 1 0 0 1 σ2+τ 2 I

As a result of the block structure, we can simplify ˆθ(tgt) as

ˆθ(tgt) = (σ2I + Mrobust) 1Mrobustβrobust 0

Proposition 4 (Estimator with domain-invariant augmentations). Applying domain-invariant augmentation yields the model

ˆθ(inv) = 0. (28)

Proof. In the augmented training distribution, input x in domain d is distributed as

x N (0, Σ + T) . (29)

Applying the normal equations thus yields ˆθ(inv) = 0.

Proposition 5 (Oracle model). Recall that θ arg minθ ROOD(θ) is the oracle model that attains optimal performance on the meta distribution P. The oracle model is

θ = (Σ + T) 1Tβ, (30)

where Σ = σ2I and T = τ 2I.

Proof. As the number of domains D , M converges to T. Applying the normal equations yields the result.

Out-of-Domain Robustness via Targeted Augmentations

C.3. Computation of ID and OOD errors

Proposition 6 (OOD error as a function of θ). The OOD error of a model θ is

ROOD(θ) = σ2 ε + θ Σθ + (β θ) T (β θ) , (31)

where Σ = σ2I and T = τ 2I.

ROOD(θ) =Ex,y,d h (y θ x)2i (32)

=Ed h Ex,y|d h (y θ x)2ii (33)

=Ed h Ex,y|d h βrobust µ(d)

robust + ε θ x ii (34)

=σ2 ε + Ed h β µ(d) 2 + θ Σ + µ(d)µ(d) θ 2 β µ(d) θ µ(d) i (35)

=σ2 ε + θ Σθ + (β θ) E µ(d)µ(d) (β θ) (36)

=σ2 ε + θ Σθ + (β θ) T (β θ) (37)

Proposition 7 (ID error as a function of θ). The ID error of a model θ is

RID(θ) = σ2 ε + θ Σθ + (β θ) M (β θ) , (38)

where M = 1

D PD d=1 µ(d)µ(d) and Σ = σ2I.

RID(θ) =ˆEx,y,d h (y θ x)2i (39)

=ˆEd h Ex,y|d h (y θ x)2ii (40)

=ˆEd h Ex,y|d h βrobust µ(d)

robust + ε θ x ii (41)

=σ2 ε + ˆEd h β µ(d) 2 + θ Σ + µ(d)µ(d) θ 2 β µ(d) θ µ(d) i (42)

=σ2 ε + θ Σθ + (β θ) ˆE µ(d)µ(d) (β θ) (43)

=σ2 ε + θ Σθ + (β θ) M (β θ) (44)

Proposition 8 (OOD error of the oracle). The OOD error of the oracle model θ is

ROOD(θ ) = σ2 ε + τ 2σ2

σ2 + τ 2 βrobust 2. (45)

Proof. Applying Proposition 5 and Proposition 6 yields the following:

ROOD(θ ) = σ2 ε + θ Σθ + (β θ ) T (β θ ) (46)

= σ2 ε + τ 2σ2

σ2 + τ 2 β 2 (47)

= σ2 ε + τ 2σ2

σ2 + τ 2 βrobust 2. (48)

Out-of-Domain Robustness via Targeted Augmentations

C.4. Proof for Theorem 1

Theorem 1 (Excess OOD error without augmentations). If D < pdom, the expected excess OOD error of the unaugmented model is bounded below as

E h ROOD(ˆθ(unaug)) ROOD(θ ) i τ 2γ2 βrobust 2

Proof. The goal is to lower bound the excess OOD error for the unaugmented estimator ˆθ(unaug),

ROOD(ˆθ(unaug)) ROOD(θ ) (50)

=σ2 ε + ˆθ(unaug) Σˆθ(unaug) + (β ˆθ(unaug)) T β ˆθ(unaug) ROOD(θ ) (51)

=β M(Σ + M) 1Σ(Σ + M) 1Mβ + β Σ(Σ + M) 1T(Σ + M) 1Σβ (52)

σ2 + τ 2 β 2. (53)

We first eigendecompose M as

M = U diag(λ)U . (54)

Using this eigendecomposition, we can compute excess OOD error as

ROOD(ˆθ(unaug)) ROOD(θ ) (55)

=β M(Σ + M) 1Σ(Σ + M) 1Mβ + β Σ(Σ + M) 1T(Σ + M) 1Σβ (56)

σ2 + τ 2 β 2 (57)

=β U diag(v)U β, (58)

( σ4(τ 2 λi)2

(σ2+τ 2)(λi+σ2)2 , i D

τ 4 (σ2+τ 2), i > D . (59)

In the above expression, eigenvectors ui and eigenvalues λi are random variables, with randomness coming from the draw of domains. We simplify the above expression as

ROOD(ˆθ(unaug)) ROOD(θ ) (60)

=β U diag(v)U β (61)

σ4(τ 2 λi)2

(σ2 + τ 2)(λi + σ2)2 (u i β)2 +

(σ2 + τ 2)(u i β)2 !

The first term is always positive, so we can lower bound it by 0, yielding

ROOD(ˆθ(unaug)) ROOD(θ ) (63)

(σ2 + τ 2)(u i β)2. (64)

Finally, we compute the expected excess OOD error:

E h ROOD(ˆθ(unaug)) ROOD(θ ) i (65)

(σ2 + τ 2)(u i β)2 #

i=D+1 E (u i β)2 . (67)

Out-of-Domain Robustness via Targeted Augmentations

We then plug in E h θ ui 2i = θ 2/pdom from Lemma 1, which uses the spherical symmetry of M s eigenvectors:

E h ROOD ˆθ(unaug) RID ˆθ(unaug) i (68)

i=D+1 E (u i β)2 (69)

(σ2 + τ 2) pdom D

pdom β 2 (70)

1 + γ2 pdom D

=τ 2γ2 βrobust 2

1 + γ2 pdom D

pdom . (72)

where γ = τ/σ.

Lemma 1. Let θ Rm be a fixed vector, and let ui be eigenvectors with the ith largest eigenvalue for a random matrix A = 1

k Pk d=1 z(d)z(d) , where z(d) is drawn from an isotropic Gaussian as z(d) N(0, s2Im). For all i = 1, . . . , m,

E (θ ui)2 = E[(θ ui)2 | λ1, . . . , λm] = θ 2

Proof. Since z(d) is sampled from an isotropic Gaussian, A s unit eigenvectors are uniformly distributed on the unit sphere. Thus, we can simplify the expectation as follows:

E (θ ui)2 = θ E uiu i θ (74)

By the same symmetry argument, we get the same expected value even when conditioned on the eigenvalues,

E (θ ui)2 | λ1, . . . , λm = θ 2

Corollary 1 (Excess OOD error with generic augmentations). If D < pdom, the expected excess OOD error of the generic model is bounded below as

E h ROOD(ˆθ(gen)) ROOD(θ ) i τ 2γ2 βrobust 2

Proof. This follows from Theorem 1 and Proposition 2.

C.5. Proof for Theorem 2

We first present Theorem 2 and its proof, including a more general theorem statement before it was simplified for the main text. Theorem 2 (Excess OOD error with targeted augmentations). Assume γ2 > 1. For any 0 < r0 1 and large enough D such that D > 2(probust + 2) log(4Dprobust/r0), the excess OOD error is bounded as

E h ROOD(ˆθ(tgt)) ROOD(θ ) i τ 2γ2 βrobust 2

r0 D + 2(probust + 2) log(4Dprobust/r0)

D 1 + γ2 1 q

2(probust+2) log(4Dprobust/r0)

Out-of-Domain Robustness via Targeted Augmentations

Furthermore, for any 0 < r < 1 and large enough D such that D > 2(probust + 2) log(4Dprobust)/(1 r)2,

E h ROOD(ˆθ(tgt)) ROOD(θ ) i τ 2γ2 βrobust 2

D + 2 log(4Dprobust)(probust + 2)

D(1 + γ2r)2

Proof. Applying Proposition 9 and Lemma 4 yields

E h ROOD(ˆθ(tgt)) ROOD(θ ) i (80)

1 + γ2 βrobust 2 η2

(1 + γ2(1 η))2 + δ (81)

1 + γ2 βrobust 2

δ + 2(probust + 2) log(4probust/δ)

D 1 + γ2 1 q

2(probust+2) log(4probust/δ)

1 + γ2 βrobust 2

δD + 2(probust + 2) log(4probust/δ) 1 + γ2 1 q

2(probust+2) log(4probust/δ)

We will discuss the assumptions needed to apply Proposition 9 and Lemma 4 in a subsequent paragraph. Before we do that, we will pick δ as δ = r0/D for any constant 0 < r0 1, in which case 0 < δ < 1 for D > 1. Then, we can simplify the expression as

E h ROOD(ˆθ(tgt)) ROOD(θ ) i (84)

1 + γ2 βrobust 2

δD + 2(probust + 2) log(4probust/δ) 1 + γ2 1 q

2(probust+2) log(4probust/δ)

τ 2γ2 βrobust 2

r0 + 2(probust + 2) log(4Dprobust/r0) 1 + γ2 1 q

2(probust+2) log(4Dprobust/r0)

In order to apply Proposition 9 and Lemma 4 above, we need to satisfy the following assumptions:

where η = q

2(probust+2) log(4Dprobust/r0)

D in this case. The first assumption is equivalent to

D > 2(probust + 2) log(4Dprobust/r0). (87)

This concludes the proof of the general statement.

Now, we will simplify the expression for clarity. First, let s set r0 = 1. This yields:

E h ROOD(ˆθ(tgt)) ROOD(θ ) i (88)

τ 2γ2 βrobust 2

1 + 2(probust + 2) log(4Dprobust) 1 + γ2 1 q

2(probust+2) log(4Dprobust)

Out-of-Domain Robustness via Targeted Augmentations

Now, we will bound

2(probust + 2) log(4Dprobust)

for any 0 < r < 1. To do so, we further assume large enough D such that D > 2(probust + 2) log(4Dprobust)/(1 r)2. Then, we can simplify the bound as

E h ROOD(ˆθ(tgt)) ROOD(θ ) i (91)

τ 2γ2 βrobust 2

1 + 2(probust + 2) log(4Dprobust)

Proposition 9. Let λmin, λmax be the minimum and maximum eigenvalue of Mrobust, respectively. If σ < τ and τ 2(1 η) λmin λmax τ 2(1 + η + η2) with probability greater than 1 δ and η < 1, then

E h ROOD(ˆθ(tgt)) ROOD(θ ) i τ 2γ2

1 + γ2 βrobust 2 η2

(1 + γ2(1 η))2 + δ (93)

Proof. The excess OOD error of ˆθ(tgt) is

ROOD(ˆθ(tgt)) ROOD(θ ) (94)

=σ2 ε + ˆθ(tgt) Σˆθ(tgt) + (β ˆθ(tgt))T β ˆθ(tgt) ROOD(θ ) (95)

=σ2 ε + ˆθ(tgt)TrobustΣrobustˆθ(tgt) robust + (βrobust ˆθ(tgt) robust) Trobust βrobust ˆθ(tgt) robust ROOD(θ ) (96)

=ˆθ(tgt)TrobustΣrobustˆθ(tgt) robust + (βrobust ˆθ(tgt) robust) Trobust βrobust ˆθ(tgt) robust τ 2σ2

σ2 + τ 2 βrobust 2 (97)

=β robust Mrobust(Σrobust + Mrobust) 1Σrobust(Σrobust + Mrobust) 1Mrobustβrobust (98)

+ β robustΣrobust(Σrobust + Mrobust) 1Trobust(Σrobust + Mrobust) 1Σrobustβrobust τ 2σ2

σ2 + τ 2 βrobust 2. (99)

We first eigendecompose Mrobust as

Mrobust = U diag(λ)U . (100)

Using this eigendecomposition, we can compute excess OOD error as

ROOD(ˆθ(tgt)) ROOD(θ ) (101)

=β robust Mrobust(Σrobust + Mrobust) 1Σrobust(Σrobust + Mrobust) 1Mrobustβrobust (102)

+ β robustΣrobust(Σrobust + Mrobust) 1Trobust(Σrobust + Mrobust) 1Σrobustβrobust τ 2σ2

σ2 + τ 2 βrobust 2 (103)

=β robust U diag(v)U βrobust (104)

vi = σ2λ2 i + σ4τ 2

(λi + σ2)2 τ 2σ2

σ2 + τ 2 (105)

= σ4(τ 2 λi)2

(σ2 + τ 2)(λi + σ2)2 . (106)

Out-of-Domain Robustness via Targeted Augmentations

We can rewrite the excess OOD error as

ROOD(ˆθ(tgt)) ROOD(θ ) (107)

=β robust U diag(v)U βrobust (108)

i=1 vi(β robustui)2 (109)

σ4(τ 2 λi)2

(σ2 + τ 2)(λi + σ2)2 (β robustui)2. (110)

We now bound the excess OOD error by applying the bound on λmin and λmax. Recall that we assume τ 2(1 η) λmin λmax τ 2(1+η+η2) with probability greater than 1 δ. Applying Lemma 3, if τ 2(1 η) λmin λmax τ 2(1+η+η2) and η < 1, then the following holds:

ROOD(ˆθ(tgt)) ROOD(θ ) (111)

σ4(τ 2 λi)2

(σ2 + τ 2)(λi + σ2)2 (β robustui)2 (112)

(σ2 + τ 2)(τ 2(1 η) + σ2)2 βrobust 2 (113)

(1 + γ2)(1 + γ2(1 η))2 βrobust 2. (114)

We now bound the expected value of the excess OOD error. Because the above bound holds with probability greater than 1 δ (because the eigenvalue bounds hold with probability greater than 1 δ), we first obtain the expected value by applying the total law of expectation:

E h ROOD ˆθ(tgt) ROOD (θ ) i (115)

(1 δ)E h ROOD ˆθ(tgt) ROOD (θ ) τ 2(1 η) λmin λmax τ 2(1 + η + η2) i (116)

+ δE h ROOD ˆθ(tgt) ROOD (θ ) λmin < τ 2(1 η) or λmax > τ 2(1 + η + η2) i (117)

(1 + γ2)(1 + γ2(1 η))2 βrobust 2 (118)

σ4(τ 2 λi)2

(σ2 + τ 2)(λi + σ2)2 (β robustui)2 λmin < τ 2(1 η) or λmax > τ 2(1 + η + η2)

(1 + γ2)(1 + γ2(1 η))2 βrobust 2 + δ σ4τ 4

(σ2 + τ 2)σ4 βrobust 2 (120)

1 + γ2 βrobust 2 η2

(1 + γ2(1 η))2 + δ . (121)

In the second to last step, we upper bound the second term by the maximum value for λi [0, ), using the fact that λi 0 as Mrobust is positive semidefinite. From Lemma 2, the upper bound is the higher of the value at λi = 0, which is σ4τ 4 (σ2+τ 2)σ4 βrobust 2, and limλi σ4(τ 2 λi)2

(σ2+τ 2)(λi+σ2)2 βrobust 2 = σ4 σ2+τ 2 βrobust 2. Because γ2 > 1, the former is higher, i.e., a more conservative upper bound.

Lemma 2. Let f(z) = (τ 2 z)2

(σ2+z)2 . The derivative of f is

d dz f(z) = 2(τ 2 z)(σ2 + τ 2)

(σ2 + z)3 , (122)

and f is decreasing in ( σ2, τ 2) and increasing in (τ, ).

Out-of-Domain Robustness via Targeted Augmentations

Proof. Taking the derivative, we get

d dz f(z) = 2(τ 2 z)(σ2 + τ 2)

(σ2 + z)3 , (123)

Lemma 3. For z, η, σ, τ such that τ 2(1 η) z τ 2(1 + η + η2), σ < τ, and 0 η < 1 + σ2/τ 2,

(σ2 + z)2 τ 4η2

(σ2 + τ 2(1 η))2 . (124)

Proof. Let f(z) = (τ 2 z)2

(σ2+z)2 . Because f(z) is decreasing for σ2 < z < τ 2 and increasing for z > τ 2 (Lemma 2), we can bound f(z) for τ 2(1 η) z τ 2(1 + η + η2) as

f(z) max τ 4η2

(σ2 + τ 2(1 η))2 , τ 4(η + η2)2

(σ2 + τ 2(1 + η + η2))2

if η < 1 + 1/γ2. We now show that

(σ2 + τ 2(1 η))2 > τ 4(η + η2)2

(σ2 + τ 2(1 + η + η2))2 (126)

for η > 0. We can simplify the difference between these two quantities as

(σ2 + τ 2(1 η))2 τ 4(η + η2)2

(σ2 + τ 2(1 + η + η2))2 (127)

= η3(η + 2)(σ2 + τ 2)( σ2 + 2τ 2η + τ 2) (σ2 τ 2η + τ 2)2(σ2 + τ 2η2 + τ 2η + τ 2)2 . (128)

The above is positive if σ2 + 2τ 2η + τ 2 > 0, which will be the case for η > 0 and τ 2 > σ2.

Lemma 4. Let λmin, λmax be the minimum and maximum eigenvalues of Mrobust, respectively. With probability greater than 1 δ, the eigenvalues can be bounded as

2(probust + 2) log(δ/4probust)

2(probust + 2) log(δ/4probust)

D + 2(probust + 2) log(δ/4probust)

Proof. We apply equations 1 and 6 from Zhu (2012) and the union bound. Note that the bounds can be written as

τ 2(1 η) λmin λmax τ 2(1 + η + η2), (131)

where η = q

2(probust+2) log(δ/4probust)

C.6. Proof for Theorem 3

Theorem 3 (Targeted augmentations improve OOD risk). If γ2 > 1 and probust is small relative to pdom such that

probust < pdom log(2pdom) 1 4(1 + γ4/(γ2 1)2),

then for D such that

(γ2 1)2 (probust + 2) log(2pdom)

D < pdom 4(probust + 2) log(2pdom),

Out-of-Domain Robustness via Targeted Augmentations

the improvement in expected OOD risk is positive:

E h ROOD(ˆθ(unaug)) ROOD(ˆθ(tgt)) i > 0.

Proof. First, we simplify the upper bound further, by picking r = 1/γ2 and by bounding D < pdom:

E h ROOD(ˆθ(tgt)) ROOD(θ ) i (132)

τ 2γ2 βrobust 2

D + 2 log(4Dprobust)(probust + 2)

D(1 + γ2r)2

τ 2γ2 βrobust 2

D + 2 log(4Dprobust)(probust + 2)

τ 2γ2 βrobust 2

2 + log(4pdomprobust)(probust + 2)

Now, we compare with the lower bound. The gap is:

E h ROOD(ˆθ(gen)) i E h ROOD(ˆθ(tgt)) i (136)

= E h ROOD(ˆθ(gen)) ROOD(θ ) i E h ROOD(ˆθ(tgt)) ROOD(θ ) i (137)

τ 2γ2 βrobust 2

1 D pdom 2 + log(4pdomprobust)(probust + 2)

We apply Lemma 5, noting that 1 < log(2pdom)(probust + 2) if pdom 2, i.e., as long as we have at least one robust domain-dependent feature and one spurious domain-dependent feature.

E h ROOD(ˆθ(gen)) i E h ROOD(ˆθ(tgt)) i (139)

τ 2γ2 βrobust 2

4 2pdom(probust + 2) log(2pdom) (140)

We now find the conditions where the gap (Equation 140) is positive:

4 2pdom(probust + 2) log(2pdom) > 0 (141)

4 2pdom(probust + 2) log(2pdom) (142)

4 2pdom(probust + 2) log(2pdom) < D < pdom

4 2pdom(probust + 2) log(2pdom) (143)

= 4(probust + 2) log(2pdom) < D < pdom 4(probust + 2) log(2pdom), (144)

where the last step applies x y > x y for 0 < y < x. For the above computation to go through, we need to ensure that the term in the square root is positive:

4 2pdom(probust + 2) log(2pdom) > 0. (145)

With algebra, we can show that this is equivalent to

probust < pdom 8 log(2pdom) 2. (146)

In addition, we need to satisfy the assumption for Theorem 2:

D > 2(probust + 2) log(4Dprobust)/(1 1/γ2)2, (147)

Out-of-Domain Robustness via Targeted Augmentations

which would be implied by

D > 4(probust + 2) log(2pdom)/(1 1/γ2)2 (148)

for D < pdom. We compare this above minimum value on D with the minimum value of D for which there is a gap, we see that 4(probust + 2) log(2pdom)/(1 1/γ2)2 is larger by a factor of (1 1/γ2) 2. Thus, we can show a gap when

4(probust + 2) log(2pdom)/(1 1/γ2)2 < D < pdom 4(probust + 2) log(2pdom). (149)

Finally, we want to show that the above is a non-empty range, with

4(probust + 2) log(2pdom)

(1 1/γ2)2 < pdom 4(probust + 2) log(2pdom) (150)

probust < pdom 4 log(2pdom)(1 + (1 1/γ2) 2) 2. (151)

Comparing with the earlier condition on probust, we see that this is a stronger condition.

Because ˆθ(unaug) = ˆθ(gen), the same result applies in comparison to ˆθ(gen) as well.

Lemma 5 (Negative polynomial lower bound for gap term.). If 1 < log(2pdom)(probust + 2) and Dpdom > 1,

1 D pdom 2 + log(4pdomprobust)(probust + 2)

2D > D pdom

4 2pdom(probust + 2) log(2pdom) (152)

Proof. Since 1 < log(2pdom)(probust + 2),

1 D + log(2pdom)(probust + 2)

D < 2 log(2pdom)(probust + 2)

log(2pdom)(probust + 2)

log(2pdom)(probust + 2)

+ pdom + pdom log(2pdom)(probust + 2)

D2 < 1 pdom

+ 2pdom(probust + 2) log(2pdom)

+ pdom + pdom log(2pdom)(probust + 2)

D2 < Dpdom 1 pdom

+ 2p2 dom(probust + 2) log(2pdom)

Since probust pdom, we know that 1

2 log(4pdomprobust) log(2pdom).

+ 2pdom + pdom log(4pdomprobust)(probust + 2)

2D2 < Dpdom 1 pdom

+ 2p2 dom(probust + 2) log(2pdom)

+ 2 + log(4pdomprobust)(probust + 2)

2D < D2 1 pdom

+ 2pdom(probust + 2) log(2pdom)

2 + log(4pdomprobust)(probust + 2)

2D > D2 1 pdom

2pdom(probust + 2) log(2pdom)

= 1 D pdom 2 + log(4pdomprobust)(probust + 2)

2D > D pdom

4 2pdom(probust + 2) log(2pdom)

Out-of-Domain Robustness via Targeted Augmentations

C.7. Proof for Theorem 3

Theorem 3 (OOD error with domain-invariant augmentations). For all D, expected OOD risk is

E[ROOD(ˆθ(inv)) ROOD(θ )] = τ 2γ2 βrobust 2

ROOD(ˆθ(inv)) ROOD(θ ) = σ2 ε + ˆθ(inv) Σˆθ(inv) + (β ˆθ(inv)) T β ˆθ(inv) ROOD(θ ) (161)

= σ2 ε + β Tβ ROOD(θ ) (162)

= σ2 ε + τ 2 β 2 σ2 ε τ 2

1 + γ2 β 2 (163)

= τ 2γ2 β 2

1 + γ2 (164)

= τ 2γ2 βrobust 2

1 + γ2 . (165)

Out-of-Domain Robustness via Targeted Augmentations

D. Extended simulation results

In this section, we provide additional details about the simulations in Section 4.3, as well as plots of the ID RMSE for both high and low-sample regimes.

D.1. Additional simulation details

For all experiments below, we fix σ2 = 0.1, τ 2 = 1, probust = 5, pspu = 500, and pnoise = 500. Models are evaluated by their RMSE on two test sets: an ID test set of held-out examples from Dtrain, and an OOD test set that generates examples from 1000 new domains Dtest. We train with ℓ2 regularization; penalty strengths are tuned on an ID validation set.

When applying an augmentation to a training set, we run the augmentation over all inputs 5 times, such that the final training set contains 5N samples.

We plot ID RMSEs for varying ranges of D in Figure 11. Training with targeted augmentation results in similar ID error as generic and unaugmented training, although targeted augmentations result in slightly higher ID error when D is small. This is because memorizing xd:spu improves ID performance. Domain-invariant augmentation results in high, constant ID error. Plots are averaged over 10 random seeds with standard errors.

0 200 400 600 800 1000

High-sample regime (N = 100000)

0 200 400 600 800 1000

Low-sample regime (N = 5000)

ID Test RMSE

Number of training domains D

Unaugmented Generic augmentation

Domain invariant augmentation Targeted augmentation

Figure 11. In-domain RMSE across values for D. Plots are averaged over 10 random seeds with standard errors.

Out-of-Domain Robustness via Targeted Augmentations

E. Experimental details

In this appendix, we provide tabular forms of results visualized in Figure 4. We also summarize core experimental details for each dataset, including hyperparameter tuning and model selection protocol.

E.1. Extended results

Table 5. Results on IWILDCAM2020-WILDS

ID Test Macro F1 OOD Test Macro F1

Unaugmented 46.5 (0.4) 30.2 (0.3) Rand Augment 48.9 (0.2) 33.3 (0.2) Mix Up 45.5 (0.6) 28.9 (0.3) Cut Mix 45.2 (0.7) 28.4 (0.5) Cutout 47.9 (0.7) 32.6 (0.4) LISA 45.4 (0.7) 29.6 (0.4) CDAN 41.2 (0.6) 28.6 (0.2) Deep CORAL 42.4 (1.2) 30.3 (0.6) IRM 39.4 (0.4) 27.8 (0.1) Copy-Paste (Same Y) 50.2 (0.7) 36.5 (0.4)

Table 6. Results on CAMELYON17-WILDS

ID Val Avg Acc OOD Test Avg Acc

Unaugmented 89.3 (2.0) 65.2 (2.6) Rand Augment 94.9 (1.0) 75.3 (1.7) Mix Up 86.9 (2.2) 69.4 (2.1) Cut Mix 84.7 (2.6) 60.9 (2.2) LISA 91.0 (1.6) 73.6 (1.4) DANN 86.1 (2.1) 64.5 (1.9) Deep CORAL 92.3 (1.1) 62.3 (3.0) IRM 88.0 (2.3) 62.4 (3.1) Stain Color Jitter 96.7 (0.1) 90.5 (0.9)

Table 7. Results on BIRDCALLS

ID Test Macro F1 OOD Test Macro F1

Unaugmented 70.0 (0.5) 27.8 (1.2) Spec Augment 71.4 (0.4) 22.8 (1.0) Mix Up 74.0 (0.4) 26.3 (1.0) LISA 69.7 (0.5) 29.4 (1.1) Noise Reduction 75.4 (0.3) 31.6 (0.9) Random Pass 71.2 (2.0) 31.8 (1.2) CDAN 64.7 (0.5) 27.0 (1.2) Deep CORAL 69.2 (0.5) 27.7 (0.9) IRM 69.2 (0.4) 28.3 (0.8) Color Jitter 73.8 (0.2) 26.1 (0.9) Copy-Paste + Jitter (Region) 75.6 (0.3) 37.8 (1.0)

E.2. Hyperparameters

i Wild Cam. All experiments used a Res Net-50, pretrained on Image Net, with no weight decay and batch size 24, following Sagawa et al. (2021); Koh et al. (2021). Model selection and early stopping was done on the OOD validation

split of i Wild Cam, which measures performance on a heldout set of cameras Dval, which is disjoint from both Dtrain

and Dtest. We tuned all methods by fixing a budget of 10 tuning runs per method with one replicate each; the hyperparameter grids are given in Table 8. Final results are reported over 5 random seeds.

For CDAN, we tuned the classifier and discriminator learning rates and fixed the featurizer learning rate to be a tenth of the classifier s, following Sagawa et al. (2021).

We applied all data augmentations stochastically with a tuned transform probability, since we found that doing so improved performance as in prior work (Gontijo-Lopes et al., 2020). For all augmentations, we also stochastically apply a random horizontal flip with the learned transform probability.

Table 8. Hyperparameter search spaces for methods on

IWILDCAM2020-WILDS.

Method Hyperparameters ERM Learning rate 10Uni( 5, 2)

Copy-Paste Learning rate 10Uni( 5, 2)

Transform probability Uni(0.5, 0.9)

LISA Learning rate 10Uni( 5, 2)

Transform probability Uni(0.5, 0.9) Interpolation method {Mix Up, Cut Mix}

Vanilla Mix Up Learning rate 10Uni( 5, 2)

Transform probability Uni(0.5, 0.9) α {0.2, 0.4}

Vanilla Cut Mix Learning rate 10Uni( 5, 2)

Transform probability Uni(0.5, 0.9) α {0.5, 1.0}

Rand Augment Learning rate 10Uni( 5, 2)

Transform probability Uni(0.5, 0.9) k {1, 2}

Cutout Learning rate 10Uni( 5, 2)

Transform probability Uni(0.5, 0.9) Version {Original, Bounding box-aware}

CDAN Classifier learning rate 10Uni( 5.5, 4)

Discriminator learning rate 10Uni( 5.5, 4)

λ 10Uni( 0.3,1)

Camelyon17. All experiments used a randomly initialized Dense Net-121, with weight decay 0.01 and batch size 168, following Sagawa et al. (2021); Koh et al. (2021). We also fixed the learning rate to that of Sagawa et al. (2021), which was selected by the authors of that paper after a random search over the distribution 10Uni( 4, 2). For Camelyon17, we found that the choice of learning rate affected the relative ID vs. OOD accuracies of methods. To remove this confounder, we therefore standardized the learning rate across augmentations / algorithms for fair comparison. Separately tuning the learning rate for each algorithm did not significantly improve performance.

Because Camelyon17 is class-balanced, we ran experi-

Out-of-Domain Robustness via Targeted Augmentations

ments on DANN (rather than CDAN). For DANN, we used the learning rate fixed across all methods for the featurizer and set the classifier learning rate to be 10 higher, following Sagawa et al. (2021).

Model selection and early stopping was done on the OOD validation split of Camelyon17, which measures performance on a held-out hospital Dval, which is disjoint from both Dtrain and Dtest. We tuned remaining hyperparameters by fixing a budget of 10 tuning runs per method with one replicate each; the hyperparameter grids are given in Table 9. Because of the large variance in performance between random seeds for some algorithms on Camelyon17 (Koh et al., 2021; Miller et al., 2021), we ran 20 replicates in the final results.

Table 9. Hyperparameter search spaces for methods on CAMELYON17-WILDS.

Method Hyperparameters Stain Color Jitter Augmentation strength [0.05, 0.1] LISA Interpolation method {Mix Up, Cut Mix} Vanilla Mix Up α {0.2, 0.4} Vanilla Cut Mix α {0.5, 1.0} Rand Augment k {1, 2} Cutout -

DANN Discriminator learning rate 10Uni( 4, 2)

λ 10Uni( 1,0)

Bird Calls. All experiments used an Efficient Net-B0, pretrained on Image Net, with batch size 64. Model selection and early stopping was done on an ID validation split, which measures performance on a held-out examples from Dtrain. We tuned all methods by fixing a budget of 10 tuning runs per method with five replicates each; the hyperparameter grids are given in Table 10. Because of its small size, Bird Calls has relatively high variance between results; we thus report final results averaged over 20 random seeds.

For CDAN, we tuned the classifier and discriminator learning rates and fixed the featurizer learning rate to be a tenth of the classifier s, matching our policy on i Wild Cam. For all augmentations, we also stochastically apply a random horizontal flip with the learned transform probability.

E.3. CLIP Experiments

In our experiments finetuning CLIP on i Wild Cam and Camelyon17, we used Open AI s CLIP Vi T-L/14 at 224 x 224 pixel resolution. Early stopping and model selection were done on the OOD validation splits. Hyperparameters are given in Table 11 for i Wild Cam and Table 12 for Camelyon17; we based Camelyon17 hyperparameters on Kumar et al. (2022) and i Wild Cam hyperparameters on Wortsman et al. (2022). We tuned all methods by fixing a budget of 10 tuning runs per method. Results are averaged over five

Table 10. Hyperparameter search spaces for methods on BIRDCALLS.

Method Hyperparameters

ERM Learning rate 10Uni( 4, 3)

Weight decay {0, 0.001, 0.1, 1}

Copy-Paste Learning rate 10Uni( 4, 3)

Weight decay {0, 0.001, 0.1, 1} Transform probability Uni(0.5, 0.9)

LISA Learning rate 10Uni( 4, 3)

Weight decay {0, 0.001, 0.1, 1} Transform probability Uni(0.5, 0.9)

Vanilla Mix Up

Learning rate 10Uni( 4, 3)

Weight decay {0, 0.001, 0.1, 1} Transform probability Uni(0.5, 0.9) α {0.2, 0.4}

Spec Augment

Learning rate 10Uni( 4, 3)

Weight decay {0, 0.001, 0.1, 1} Transform probability Uni(0.5, 0.9) k {1, 2} F {10, 20, , 100} T {10, 20, , 100}

Random Pass Learning rate 10Uni( 4, 3)

Weight decay {0, 0.001, 0.1, 1}

Noise Reduction Learning rate 10Uni( 4, 3)

Weight decay {0, 0.001, 0.1, 1}

Classifier learning rate 10Uni( 5, 2)

Weight decay {0, 0.001, 0.1, 1} Discriminator learning rate 10Uni( 5, 2)

λ 10Uni( 0.3,1)

Table 11. Hyperparameter search spaces for CLIP experiments on i Wild Cam.

Method Hyperparameters

ERM Learning rate 10Uni( 6, 4)

Weight decay 10Uni( 4, 0.2)

Optimizer = Adam W

Copy-Paste (Same Y)

Learning rate 10Uni( 6, 4)

Weight decay 10Uni( 4, 0.2)

Transform probability Uni(0.5, 0.9) Optimizer = Adam W

Out-of-Domain Robustness via Targeted Augmentations

Table 12. Hyperparameter search spaces for CLIP experiments on Camelyon17.

Method Hyperparameters

ERM Learning rate 10Uni( 6, 3)

Weight decay = 0.01 Optimizer = SGD

Stain Color Jitter

Learning rate 10Uni( 6, 3)

Weight decay = 0.01 Augmentation strength [0.05, 0.1] Optimizer = SGD