# statistical_learning_under_heterogeneous_distribution_shift__43926d81.pdf

Statistical Learning under Heterogenous Distribution Shift

Max Simchowitz * 1 Anurag Ajay * 1 Pulkit Agrawal 1 Akshay Krishnamurthy 2

This paper studies the prediction of a target z from a pair of random variables (x, y), where the ground-truth predictor is additive E[z | x, y] = f (x)+g (y). We study the performance of empirical risk minimization (ERM) over functions f + g, f F and g G, ﬁt on a given training distribution, but evaluated on a test distribution which exhibits covariate shift. We show that, when the class F is simpler" than G (measured, e.g., in terms of its metric entropy), our predictor is more resilient to heterogenous covariate shifts in which the shift in x is much greater than that in y. These results rely on a novel Hölder style inequality for the Dudley integral which may be of independent interest. Moreover, we corroborate our theoretical ﬁndings with experiments demonstrating improved resilience to shifts in simpler features across numerous domains.

1. Introduction

Modern machine learning systems are routinely deployed under distribution shift (Taori et al., 2020; Koh et al., 2021). However, statistical learning theory has primarily focused on studying the generalization error in the situation where the test and the training distributions are identical (Bartlett and Mendelson, 2002; Vapnik, 2006). In the setting of covariate shift where only features/covariates change between training and testing, but the target function remains ﬁxed guarantees from statistical learning theory can be applied via a reweighting argument, leading to the classical bound involving density ratios depicted in Eq. (3.2). However, this approach may be overly pessimistic and may not account for the relative differences in performance degradation between different distribution shift settings.

Well-speciﬁed linear regression is perhaps the simplest set-

*Equal contribution 1CSAIL, Massachusetts Institute of Technology 2Microsoft Research, New York City. Correspondence to: Max Simchowitz <msimchow@csail.mit.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

ting that admits favorable distribution shift behavior (Lei et al., 2021). Here, out-of-distribution generalization is controlled by the alignment between the second moment matrices of the training and test distribution, rather than the signiﬁcantly worse density ratios. Beyond the linear setting, ML models including neural networks often suffer from spurious correlation (c.f., Arjovsky et al., 2019), where the model exploits correlations in the training distribution to learn an accurate-but-incorrect predictor that fails to generalize to a de-correlated distribution. Though this phenomenon and other related ones are well-documented experimentally, a general theory of distribution shift particularly one that explains the behavior of deep learning models in practice has remained undeveloped.

A useful theory of distribution shift should make predictions as to which shifts a learned model is most sensitive to in possible test environments, given properties of the model which can be evaluated from training data (Xiao et al., 2020; Koh et al., 2021; Rahimian and Mehrotra, 2019). We illustrate this point with the following example.

Example 1.1. Consider a quadruped carrying different payloads across multiple terrains, with a policy trained via reinforcement learning. Should one expect more degradation in performance with new shapes or sizes of payloads? Or should a policy suffer more from novel terrains? If our policy requires camera inputs, should we expect that changes in lighting conditions or times of day have more of an effect? Or if the policy relies on tactile sensation, should we expect changes in weather (e.g., rain on the tactile sensors) to present more of an obstacle?

Contributions. This paper gestures towards a richer theory of generalization under covariate shift; one that makes such actionable predictions about the relative resilience of a model to the kinds of multifarious shifts illustrated in Example 1.1. More speciﬁcally, we highlight a setting we call heterogenous covariate shift, where the distribution of one feature shifts more than another.

Theoretically, we study supervised prediction from a pair of (possibly non-independent) random variables (x, y). We think of x as corresponding to simple features and y to more complex ones. We show greater resilience to heterogenous distribution shifts in which the shift in the marginal of y is signiﬁcantly smaller than that of the joint

Learning under Heterogenous Distribution Shift

distribution. Speciﬁcally, our analysis restricts its attention to regression functions which decompose additively as f(x) + g(y). We show that empirical risk minimization (ERM) over functions of the form f(x) + g(y) leads to much more favorable generalization guarantees than those obtained via the naïve covariate shift bound. In the most favorable setting, we obtain a test error bound that scales only with the covariate shift in the marginal of the complex feature y, so that even though spurious correlations between x and y are present, they play no role in the generalization performance of the ERM.

While limited, the additive framework proposes a useful metric to evaluate relative complexity of the features: the richness of their associated function classes. This suggests a more general hypothesis that can be formulated without the additivity assumption: we can determine resilience to shifts in a given feature by evaluating the complexity of a model s dependence on that feature. Using in-distribution generalization as a proxy for model complexity, we ﬁnd that deep learning models are consistenly more resilient to shifts in simpler features than they are to shifts in complex features; this ﬁnding holds across a range of tasks, including synthetic settings, computer vision benchmarks, and imitation learning. We hope that, taken together, our theoretical and experimental results initiate a further dialogue between the ﬁeld of statistical learning theory and the study of distribution shift in machine learning more broadly.

Proof Techniques. The technical challenge to obtaining favorable distribution shift is correlation between x and y, which, among other things, leads to unidentiﬁability of the generalizing predictor. We show that when G is sufﬁciently expressive, the simple predictor f can be learned, up to a bias arising from identiﬁability, at a rate that exhibits a lower order dependence on G. Although this predictor is affected by distribution shifts in x, the low complexity of the function class F and the lower order dependence on G implies that the impact on the overall performance is rather small. Then ERM can learn a g that corrects for the bias in f and is unaffected by distribution shifts in x. The core technical result for this argument is the generalization bound for F which disentangles the correlations between x and y; this result relies, among other things, on a novel Hölder-style inequality for the Dudley integral of products of function classes, which may be of independent interest.

Related Work. Our results and techniques are very much in the spirit of classical statistical learning theory (Bartlett et al., 2005; Bousquet and Elisseeff, 2002; Bartlett and Mendelson, 2002; Vapnik, 2006), but also have the ﬂavor of more recent work on orthogonal/double machine learning (Chernozhukov et al., 2017; Foster and Syrgkanis, 2019; Mackey et al., 2018). In that parlance, we can view g as a nuisance parameter for estimating f and our re-

sults show similar (but not quite matching) recovery guarantees without explicit double-training interventions. We discuss comparisons to orthogonal ML in the sequel. Resilience to distribution shift has received considerable attention in recent years (Miller et al., 2021; Taori et al., 2020; Santurkar et al., 2020; Koh et al., 2021; Zhou et al., 2022), with the vast majority of the work being empirical. While the present work focuses on studying vanilla empirical risk minimization, there have been many methods produced to explicity tackle distribution shift including CORAL (Sun and Saenko, 2016), IRM (Arjovsky et al., 2019), and distributionally robust optimization, the latter having seen recent advances on both empirical and theoretical fronts (Schmidt et al., 2018; Rahimian and Mehrotra, 2019; Sinha et al., 2018). Though the statistical properties of distribution shift under empirical risk minimization has garnered substantially less attention, recent work has given precise characterizations of the effects of covariate shift for certain speciﬁc function classes, notably kernels (Ma et al., 2022) and Hölder smooth classes (Pathak et al., 2022). Our work complements these by considering structural situations in which interesting generalization phenomena arise for arbitrary function classes. Lastly, (Dong and Ma, 2023) establish Laplacian-like connectivity conditions under which test-error of additive predictors f(x) + g(x) (as in this work) can be bounded in terms of train-error, focusing on (a) situations where the marginals over x, y between testand train-distibutions coincide but joint distributions differ and (b) discrete Gaussian-distributed features. By contrast, our work allows for changes in both joint and marginal distributions (albeit with cruder measures of shift), general feature distributions, and exposes statistical phenomena not addressed by the former work.

2. Theoretical Setup We study the prediction of a scalar z R from two covariates x X, y Y under distribution shift. We postulate a pair of testing and training environments denoted e {test, train}, each of which index laws Pe over (x, y, z), and whose expectation operators are denoted by Ee. We assume the environments do not differ in the Bayes regression function, i.e., they exhibit only covariate shift:

Assumption 2.1 (Covariate Shift). We assume that, for all x, y, Etrain[z | x, y] = Etest[z | x, y].

Next, we assume that the we have access to a class of functions that capture the conditional expectations Etrain[z | x, y] via additive structure. Speciﬁcally, we assume access to classes F : X R and G : Y R for which (x, y) 7 Etrain[z | x = x, y = x] F + G. This is typically referred to as being realizable or well-speciﬁed.

Assumption 2.2 (Additive well-speciﬁcation). For some

Learning under Heterogenous Distribution Shift

f F and g G, it holds thats

Ptrain[z | x = x, y = x] N(f (x) + g (y), σ2) (2.1)

Via universality of Gaussian processes, our results can be extended to general subgaussian noise. Since the model is well-speciﬁed, a natural performance measure of a predictor (f, g) is its excess square-loss risk, denoted Re(f, g):

Re(f, g) := Ee((f f )(x) + (g g )(y))2.

Empirical Risk Minimization. We study the excess risk under Ptest of square-loss empirical risk minimizers, or ERMs, for Ptrain. Given a number n N, we collect (xi, yi, zi)i [n] i.i.d Ptrain samples and let ( ˆfn, ˆgn) denote (any) empirical risk minimizer of the samples: ( ˆfn, ˆgn) arg min(f,g) F G ˆLn(f, g), where

ˆLn(f, g) := 1

n Pn i=1(f(xi) + g(yi) zi)2. (2.2)

Distribution Shift. Although we have samples from Ptrain, we are primarily interested in the excess square loss under Ptest. For simplicity, the body of this paper focuses on when the density ratios between these distributions are upper bounded; as discussed in Section 3.3, these conditions can be weakened considerably.

Deﬁnition 2.1. Deﬁne the density ratio coefﬁcients νx,y, νy 1 to be the smallest scalars such that for all measurable sets A X Y and B Y,

Ptest[(x, y) A] νx,y Ptrain[(x, y) A]

Ptest[y B] νy Ptrain[y B].

The interesting regime is where νx,y, νy are ﬁnite. A standard covariate shift argument upper bounds the excess risk on Ptest by the joint density ratio, νx,y, times the excess risk on Ptrain. Our aim is to show that much better bounds are possible. Speciﬁcally, if the class F is smaller than the class G, then the excess risk on Ptest is less sensitive to shifts in the joint distribution (i.e., νx,y) than it is to shifts in the y-marginal (i.e., νy). Such an improvement is most interesting in the regime where νx,y νy, which requires that x is not a measurable function of y.

Controlling distribution shift via bounded density ratios is popular in the ofﬂine reinforcement learning, where such terms are called concentrability coefﬁcients (Xie and Jiang, 2020; Xie et al., 2022). We stress that the uniform density ratio bounds in this section are merely for convenience; we discuss generalizations at length in Section 3.3.

Conditional Completeness. Notice that (f , g ) may not be identiﬁable in the model Eq. (2.1). The most glaring

counterexample occurs when x = y, and f + g F G. Then, (f, g) = (f + g , 0) and (f, g) = (0, f + g ) are both optimal pairs of predictors. However, this setting is uninteresting for our purposes, since x = y implies that νx,y = νy. On the other hand, when x and y are independent, the model is identiﬁable up to a constant offset, i.e., (f + c, g c) is an optimal pair. This line of reasoning suggests that the indentiﬁable part of f in Eq. (2.1) corresponds to the part of x that is orthogonal to y. To capture this effect, we introduce the conditional bias of f given y under the training distribution:

βf( ) = Etrain[(f f )(x) | y = ]. (2.3)

Note that this is a function of y, not x. One can check that Rtrain(f, g) = 0 if and only if (f(x), g(y)) = (f (x)+βf(y), g (y) βf(y)) with probability one over (x, y) Ptrain. Note, in particular, that this requires βf is almost surely (under Ptrain) equal to a measurable function of x. This allows, for example, (f, g) = (f c, g + c) for constants c R, and, in particular, (f , g ) meet these requirements since βf = 0.

We now introduce our ﬁnal, and arguably only nonstandard, assumption.

Assumption 2.3 (γ-Conditional Completeness). There exists some γ > 0 such that, for any (f, g) F G satisfying Rtrain(f, g) γ2, it holds that g βf G.

Conditional completeness is somewhat non-intuitive but it is satisﬁed in some natural cases. We list them here informally, and defer formal exposition to Appendix A.1. First, as aluded to above, when x y, βf(y) is constant in y and so conditional completeness holds as long as G is closed under afﬁne translation. Second, it holds when F and G are linear classes and x and y are jointly Gaussian; this follows since the conditional distribution Etrain[x | y = y] is linear in y. The latter example extends to nonparametric settings: conditional completeness holds if the conditional expectations x | y are smooth and G contains correspondingly smooth functions.

The restriction to Rtrain(f, g) γ2 allows us to make the assumption compatible with the following, standard boundedness assumption (for otherwise we would need to have g kβf G for all k N, see Remark A.1.)

Assumption 2.4 (Boundedness). We assume that for all f F and g G, |f(x)| and |g(y)| are uniformly bounded by some B > 0. For simplicity, we also assume F and G contain the zero predictor.

Notation. We use a b to denote inequality up to universal constants, and use O ( ) and e O ( ) as informal notation suppressing problem-dependent constants and logarithmic factors, respectively. A scalar-valued random variable is standard normal if Z N(0, 1) and Rademacher

Learning under Heterogenous Distribution Shift

if Z is uniform on { 1, 1}. For v = (v1, . . . , vn) Rn

and q [1, ), deﬁne the normalized q-norms v q,n = ( 1

n Pn i=1 |vi|q)1/q and v ,n = v = maxi [n] |vi|. We let W = X Y with elements w W, so we can view classes f F, g G, and βf as mappings with type W R. Given h H and a sequence w1:n Wn, deﬁne the evaluation vector h[w1:n] := (h(w1), . . . , h(wn)) Rn

and evaluated class H[w1:n] := {h[w1:n] : h H} Rn.

All of our results follow from the same schematic: we argue that if F is simpler than G, it is much easier to recover f than it is to recover g , subject to the identiﬁability issues introduced by βf. To express this, we introduce the per-function risks, for e {train, test}:

Re[f] := Ee[(f f βf)2].

Re[g; f] := Ee[(g g + βf)2].

Our schematic shows that Rtrain[ ˆfn] Rtrain[ˆgn; ˆfn], with precise convergence rates. The expression Re[f] reﬂects that f is identiﬁable only up to a bias, while Re[g; f] can be thought of as the residual error after accounting for the bias in f. A straightforward consequence of these deﬁnitions is the following risk decomposition:

Lemma 3.1. Let (f, g) F G. Then, under Assms. 2.1 and 2.2, Rtrain(f, g) = Rtrain[f] + Rtrain[g; f]. Morever, Rtest(f, g) 2(Rtest[f] + Rtest[g; f]). Therefore,

Rtest(f, g) 2(νx,y Rtrain[f] + νy Rtrain(f, g)). (3.1)

Eq. (3.1) is the starting point for our results. By comparison, the standard distribution shift bound is

Rtest(f, g) νx,y Rtrain(f, g). (3.2)

Hence, Eq. (3.1) leads to sharper estimates for ERM in the regime where νy νx,y and Rtrain[ ˆfn] Rtest[ ˆfn, ˆgn), i.e., when the shift in y is less than the shift in the joint distribution and when the estimate of f is more accurate than the estimate of f + g . The bulk of the analysis involves obtaining sharp bounds on Rtrain[ ˆfn], this is sketched in Section 4. In the remainder of this section, we describe implications for various settings of interest.

3.1. Nonparametric Rates

We begin by demonstrating improvements in the nonparametric regime, where we measure the complexity of function classes by their metric entropies. Recall that an ϵ-cover of a set V in a norm is a set V V such that, for any v V, there exists v V for which v v ϵ. The covering number of V at scale ϵ in norm is the minimal cardinality of an ϵ-cover, denoted N (V, , ε).

Metric entropies of function classes are deﬁned via the logarithm of the covering number. Deﬁnition 3.1 (Metric Entropy). We deﬁne the q-norm metric entropy of a function class H : W R as Mq(ϵ, H) := supn supw1:n log N (H[w1:n], q,n, ε).

As in classical results in statistical learning theory, rates of convergence depend on function class complexity primarily through the growth rate of the metric entropy, i.e., how Mq(ϵ, H) scales as a function of ϵ. We state our ﬁrst main result informally, in line with this tradition. Theorem 1 (Informal). Under Assm. 2.1-Assm. 2.4, with high probability, Rtest( ˆfn, ˆgn) is at most

e O νx,y(raten,2(F) + raten, (G)2) + νyraten,2(G)

with high probability, where we deﬁne

raten,q(H) =

d n Mq(ϵ, H) = O (d log(1/ϵ)) n 2 2+p Mq(ϵ, H) = O (ϵ p) , p 2 n 1

p Mq(ϵ, H) = O (ϵ p) , p > 2

and raten, (H) = n (1/2 1/p) for M (ϵ, H) = O (ϵ p).

A formal statement is given in Appendix E. As a preliminary point of comparison, the classical analysis would yield a bound of the form

Rtest( ˆfn, ˆgn) e O (νx,y(raten,2(F) + raten,2(G))) ,

which can be worse than the above bound when νy νx,y and raten, (G)2 raten,2(G). It is crucial to note that in our bound, G and νx,y interact via the squared convergence rate for G, which we expect to be signiﬁcantly smaller than raten,2(G). This benign dependence on G is similar in spirit to results in orthogonal machine learning (c.f., Foster and Syrgkanis, 2019), where it arises from a similar phenomenon: the ability to recover f is only weakly impacted by the presence of g . We discuss orthogonal ML in more detail in Section 3.4.

We should also mention one caveat to the bound: the ideal bound would replace raten, (G)2 with raten,2(G)2 which is smaller most notably when G is small. Unfortunately, our bound departs from this ideal in that raten, (G) involves the ℓ metric entropy and also that it does not exploit localization to go beyond the n 1/2 rate for small classes. It is not clear if this discrepancy reﬂects a limitation in our analysis or is a fundamental limitation of ERM. On the other hand, localization can only improve the constant factors but not the dependence on n, since the raten, (G) term is squared and since raten,2(F) is always at least 1/n.

3.2. Finite Function Classes

When F and G are ﬁnite function classes with log |F| d1 and log |G| d2, an application of Theorem 1 gives

Learning under Heterogenous Distribution Shift

the rate of Rtest( ˆfn, ˆgn) νx,y d1+d2

n , which is precisely what one obtains via naive change of measure arguments. Although direct application of Theorem 1 does not yield improvements precisely because of the lack of localization as discussed above we can improve upon this bound with an additional hypercontractivity assumption, often popular in the statistical learning literature (Mendelson, 2015). We defer formal deﬁnitions, a formal theorem statement, and proofs to Appendix D; the following informal theorem summarizes our ﬁndings.

Theorem 2 (Informal). Under certain hypercontractivity conditions detailed in Appendix D, it holds with high probability that

Rtest( ˆfn, ˆgn) 1

n (νx,yd1 + νyd2 + νx,yd2 φn(d1, d2)) ,

where φn(d1, d2) = ( d2

n )c1 +( d1

d2 )c2, for constants c1, c2 > 0 depending on the hypercontractivity exponents.

When d2 d1, the bound replaces the dimension term d2νx,y with d2φ(d1, d2)νx,y + νyd2, a strict improvement when νy νx,y and φ(d1, d2) 1. The above bound can be extended to function classes with parametric metric entropy (Remark D.1). In all cases, φn(d1, d2) d2

n , which is still weaker than an idealized version of Theorem 1 where raten,2(G)2 replaces raten, (G)2.

3.3. Reﬁned Measures of Distribution Shift

The decomposition in Lemma 3.1 and all subsequent guarantees can be reﬁned considerably. First, we can replace uniform bounds on the density ratios (Deﬁnition 2.1) with the following function-dependent quantities:

ν1 := sup f F

Etest[(f f βf)2] Etrain[(f f βf)2] (3.3)

ν2 := sup f F,g G

Etest[(g g βf)2] Etrain[(g g βf)2], (3.4)

Corollary 3.1. Immediately from Lemma 3.1, it holds that Rtest(f, g) 2(ν1Rtrain[f] + ν2Rtrain(f, g))

Both Theorem 1 and Theorem 2 continue to hold using ν1 and ν2 instead of νx,y and νy. Note that ν1 νx,y and ν2 νy always, but they can be much smaller as demonstrated by the follow upper bounds on ν1.

Lemma 3.2. Suppose x y under Ptrain. Then ν1 νx := sup A X Ptest[x A]/ Ptrain[x A].

Lemma 3.3. Assume (a) X is a Hilbert space, (b) the functions f F are linear in x and (c) there are constants ν lin > 0 such that, with βx(y) := Etrain[x | y], Etest[(x βx(y))(x βx(y))H] ν lin Etrain[(x βx(y))(x βx(y))H]. Then, ν1 ν lin.

Appendix A.2 proves both lemmas. Importantly, ν lin can be ﬁnite even when νx,y is inﬁnite, e.g. if the distribution over x is discrete under Ptrain, but continuous under Ptest.

Beyond Uniform Ratios. Eqs. (3.3) and (3.4) can be generalized further to allow for additive error.

Corollary 3.2. Suppose that, for all f F and g G,

Etest[(f f βf)2] ν1Etrain[(f f βf)2] + 1 Etest[(g g βf)2] ν2Etrain[(g g βf)2] + 2,

Then, immediately from Lemma 3.1,

Rtest(f, g) 2(ν1Rtrain[f] + ν2Rtrain(f, g) + 1 + 2).

This deceptively simple modiﬁcation allows for situations when the density ratios between the test and train distributions are not uniformly bounded, or possibly even inﬁnite. Appendix A.3 details the many consequences of this observation. We highlight a key one here:

Lemma 3.4. Suppose Assm. 2.4 holds. Then, for Rtrain[f], Rtrain(f, g) sufﬁciently small,

Rtest(f, g) 8B p

Rtrain[f] χ2(Ptest(x, y), Ptrain(x, y))

Rtrain(f, g) χ2(Ptest(y), Ptrain(y)),

where χ2(Ptest(x, y), Ptrain(x, y)) denotes the χ2 divergence (see e.g. Polyanskiy and Wu (2022, Chapter 2)) between the joint distribution of (x, y) under test and train distributions, and χ2(Ptest(y), Ptrain(y)) denotes χ2 divergence restricted to the marginals of y.

The above lemma is qualitatively similar to Corollary 3.1 and Lemma 3.1: If Rtrain[f] Rtrain(f, g) (as ensured by our analysis, under appropriate assumptions), then we ensure more resilience to the χ2 divergence between the joint distributions of (x, y) than would naively be expected.

3.4. Comparison with Orthogonal ML

The style of our results is similar to those appearing in the literature on Neyman orthogonalization (also referred to as Double/Debiased ML or orthogonal statistical learning) (c.f., Chernozhukov et al., 2017; Mackey et al., 2018; Foster and Syrgkanis, 2019). At a high level, orthogonal ML considers a situation with an unknown pair (f , g ), where we are primarily interested in learning f , referring to g as a nuisance function. The main difference in setting is that orthogonal ML leverages an auxiliary supervision mechanism to learn g , which helps resolve confounding effects. Qualitatively, the rates are similar: as with our bound, orthogonal ML allows one to learn f with a quadratic dependence on the complexity G. However quantatively the orthogonal ML bound is typically better in that (a) the complexity of G is measured via the ideal

Learning under Heterogenous Distribution Shift

raten,2(G)2 compared with our raten, (G)2, and (b) the distribution shift coefﬁcient is νx compared with our νx,y.

On the other hand, our bound is somewhat more practical, as it applies to ERM directly and does not require algorithmic modiﬁcations or an auxiliary supervision signal. The key difference here is that whereas orthogonal ML aims for inference consistent recovery of f we care only about the prediction error of f + g . Thus, we need not address the identiﬁability challenges present in orthogonal ML. As a consequence, we bypass algorithmic modiﬁcations that typically require more precise modeling of the data generating process, and which typically render orthogonal ML more susceptible to misspeciﬁcation issues. Finally, we should note that in canonical settings for orthogonal learning, we can show that our main assumption, conditional completeness, holds. In this sense, our work shows that, in typically settings for orthogonal learning, one can obtain similar statistical improvements with ERM alone and without auxiliary supervision.

Please see Appendix A for a detailed discussion.

4. Analysis Overview

We begin this section with a formal precursor to Theorem 1 in terms of Dudley integrals (Dudley, 1967), stated as Theorem 3. The rest of the section provides an overview of the proof. Section 4.1 contains the necessary preliminaries, notably Rademacher and Gaussian complexities and their associated critical radii. Section 4.2 provides the roadmap for the proof of Theorem 3, focusing on our novel excess risk bound for Rtrain[ ˆfn] in terms of a cross critical radius term. We bound this term in Section 4.3 via a Hölder style inequality for Rademacher complexity.

For convenience deﬁne the centered classes Fcnt := {f βf f : f F}, Gcnt := {g g + βf : f F, g G} and Hcnt := {f + g (f + g ) : f F, g G}.

Formal Main Result. We deﬁne the Dudley functional, a standard measure of statistical complexity.

Deﬁnition 4.1 (Dudley Functional). Let radq(V) := supv V v q,n be the q-norm radius and Mq(V; ) be the metric entropy in the induced ℓq norm (Deﬁnition E.1). Given V Rn deﬁne Dudley s chaining functional (in the q-norm) as

Dn,q(V) := inf δ radq(V)

2δ + 4 n R radq(V) δ p

Mq(V; ε/2)dε .

Furthermore, given a function class H and letting H[r, w1:n] denotes the empirically localized class (Deﬁnition 4.2 below), deﬁne the Dudley critical radius

δn,D(H, c) := inf n r : supw1:n Dn,2(H[r, w1:n]) r2

We now state the formal version of our main result. We obtain Theorem 1 by bounding the Dudley functionals using standard statistical learning arguments.

Theorem 3. Suppose Assms. 2.1 to 2.3 hold. Let σB := max{B, σ}, let ν1, ν2 be as in Eqs. (3.3) and (3.4), and let c1 be a sufﬁciently small universal constant. Then if δn,D(Hcnt, σB)2 + σ2 B log(1/δ)

n c1γ, it holds that probability at least 1 δ, Rtest( ˆfn, ˆgn) is at most

ν1 δn,D(Fcnt, σB)2 + supw1:n Dn, (Gcnt[w1:n])2

+ ν2 δn,D(Hcnt, σB)2 + (ν1 + ν2)σ2 B log(1/δ) n .

4.1. Learning-Theoretic Preliminaries

We state all deﬁnitions for a general class of functions H mapping W R. We deﬁne two key notions of localized and product classes.

Deﬁnition 4.2 (Product and Localized Classes). Let H, H : W R. Deﬁne the (Hadamard) product class H H := {h h : h H, h H }. Given sequence w1:n Wn, deﬁne the empirically localized class H[r, w1:n] := {h H : 1

n Pn i=1 h(wi)2 r} and population localized class H(r) := {h H : Etrain[h2] r}.

Next, we deﬁne the standard Rademacher and Gaussian complexities and associated quantities (c.f., Rakhlin, 2022; Wainwright, 2019; Bartlett et al., 2005). For convenience, we state these quantities for a set of n-length vectors V Rn and then instantiate the deﬁnition to obtain function class variants.

Deﬁnition 4.3 (Rademacher and Gaussian Complexities: Sets). Let n N, and let ε1:n and ξn denote i.i.d. sequences of Rademacher and standard Normal random variables, respectively. The Rademacher and Gaussian complexities of a subset V Rn are deﬁned as Rn(V) := 1 n Eε supv V Pn i=1 εivi and Gn(V) := 1 n Eξ supv V Pn i=1 ξivi.

Gaussian and Rademacher complexities of function classes can be deﬁned in terms of Deﬁnition 4.3. For example, we may consider Rn(H[w1:n]), or localized variants like Rn(H[r, w1:n]). For the latter, we deﬁne the critical radius quantities, which are central to localization arguments in statistical learning (Bartlett et al., 2005).

Deﬁnition 4.4 (Critical Radii). We deﬁne the following worst-case critical radii: δn,R(H, c) := inf r : supw1:n Rn(H[r, w1:n]) r2

2c , δn,G (H, c) := inf r : supw1:n Gn(H[r, w1:n]) r2

The following lemma veriﬁes that the Rademacher and Gaussian complexities are upper bounded by the Dudley functional (the proof is standard, but see also Appendix B.7 for completeness.)

Learning under Heterogenous Distribution Shift

Lemma 4.1. For any V Rn, we have Gn(V) Rn(V) Dn,2(V), and hence, for all c > 0, δn,R(H, c) δn,G (H, c) δn,D(H, c).

4.2. Proof Overview of Theorem 3.

We begin with the following generic upper bound on the joint risk of Rtrain[ ˆfn, ˆgn], proved in Appendix B.2.

Proposition 4.2. With probability at least 1 δ, we have that Rtrain[ˆgn; ˆfn] Rtrain( ˆfn, ˆgn) γn(δ)2, where we deﬁne γn(δ)2 := δn,R(Hcnt, B)2 + δn,G (Hcnt, σ)2 + (B2+σ2) log(1/δ)

n . Hence, ˆgn Gcnt(γn(δ)).

The localized Gaussian complexity of the class Hcnt, δn,G (Hcnt, σ)2, appears in the sharpest analyses of ERM. The dependence on δn,R(Hcnt, B)2 is suboptimal in general (see, e.g. Rakhlin (2022, Chapter 21)), but is convenient and essentially sharp in the regime where σ2 B2. However, to take advantage of Eq. (3.1), we require sharper control over Rtrain[ ˆfn]. This involves a novel term, unique to our setting; the cross-critical radius.

Deﬁnition 4.5 (Cross Critical Radii). Given the classes Fcnt, and another class H, we deﬁne δn,cross(Fcnt; H) :=

inf n r : supw1:n Rn(Fcnt[r, w1:n] H[w1:n]) r2

The cross-critical radius measures the complexity of products f h, where f Fcnt and h H, and thus captures the extent to which h can obfuscate recovery of f . We invoke the cross-critical radii with H = Gcnt(γ), for some γ.

It is crucial that the localization r > 0 is only on the class Fcnt and not on the class H. With the cross-critical radius in hand, the following is proved in Appendix B.3.

Proposition 4.3. Suppose that (F, G) satisfy γ-conditional completeness. Then, whenever Rtrain[ˆgn; ˆfn] γ, the following holds with probability at least 1 δ,

Rtrain[ ˆfn] δn,cross(Fcnt; Gcnt(γ))2 + δn,R(Fcnt, B)2

+ δn,G (Fcnt, σ)2 + (σ2 + B2) log(1/δ)

Upper bounding Rademacher and Gaussian critical radii by the Dudley radius, and using σB = max{σ, B}, we have that with probability 1 δ/2, Rtrain[ˆgn; ˆfn] δn,D(σB, Hcnt)2 + σ2 B log(1/δ)/n. Hence, if δn,D(σB, Hcnt)2 + σ2 B log(1/δ)/n c1γ for a small enough c1 > 0, Proposition 4.3 yields that with probability 1 δ/2, Rtrain[ ˆfn] is at most

δn,cross(Fcnt; Gcnt(γ))2 + δn,D(Fcnt, σB)2 + σ2 B log 1

δ n δn,cross(Fcnt; Gcnt)2 + δn,D(Fcnt, σB)2 + σ2 B log 1

Invoking Corollary 3.1 and summing these bounds, with

probability at least 1 δ, Rtest( ˆfn, ˆgn) is at most

ν1(δn,cross(Fcnt; Gcnt)2 + δn,D(Fcnt, σB)2)

+ ν2δn,D(σB, Hcnt)2 + (ν1+ν2)σ2 B log(1/δ) n , (4.1)

The key step is to now apply Corollary 4.1, stated in the following section, to upper bound the cross critical radius: δn,cross(Fcnt; Gcnt)2 δn,D(Fcnt, 2B)2 + 16 supw1:n Dn, (Gcnt[w1:n])2. By Lemma C.4 and the bound B σB, δn,D(Fcnt, 2B)2 δn,D(Fcnt, B)2 δn,D(Fcnt, σB)2. Combining with Eq. (4.1) concludes the proof of Theorem 3.

4.3. Controlling Cross Critical Radius via a Hölder-Inequality for Dudley s integral

Recall Lemma 4.1, which restates the well-known fact that the Rademacher and Gaussian complexities of a function class can be upper bounded by Dudley functional deﬁned in Deﬁnition 4.1 (c.f., Dudley, 1967; Wainwright, 2019, Chapter 5). We establish a Hölder style generalization of this upper bound. In what follows, given p, q [2, ], we say (p, q) are square Hölder conjugates if (p/2, q/2) are regular Hölder conjugates, i.e. 2 p + 2

q = 1. Examples include (p, q) = (2, ), (p, q) = (4, 4), and p, q = ( , 2). If v, u Rn are two vectors, then Hölder s inequality implies that for any square Hölder conjugates p, q, v u 2,n v 2,p u q,n. It may be tempting to generalize Lemma 4.1 to product classes via

Rn(V U) Dn,p(V) Dn,q(U), (4.2)

where we recall V U := {v u, v V, u U}. Our key result is that Eq. (4.2) can be sharpened considerably. The following technical result is proved in Appendix B.6.

Proposition 4.4 (Dudley Estimate for Hadamard Products (Sets)). Let p, q [2, ] satisfy 1/p+1/q 1/2, and (for simplicitly) suppose 0 V W. Then,

Rn(V U) radq(U)Dn,p(V) + radp(V)Dn,q(U).

The same bound holds for Rn replaced by any process deﬁned where the εi are 1-sub Gaussian variables (e.g. Gaussian complexity Gn).

Notice that rather than having Dn,p(V) and Dn,q(U) multiply each other as in Eq. (4.2), each is only multiplied by the (Hölder square conjugate) radius term. This is in general considerably sharper, as typically radp(V) Dn,p(V) unless V is exceedingly small. By taking U = {(1, 1, . . . , 1) Rn}, Proposition 4.4 implies the standard Dudley bound, Lemma 4.1, as a corollary (see Appendix B.7). We now use the above proposition to upper bound the cross-critical radius.

Learning under Heterogenous Distribution Shift

Corollary 4.1 (Generic Cross-Critical Radius Bound). For any class H, it holds that δn,cross(Fcnt; H)2 δn,D(Fcnt, 2B)2 + 16 supw1:n Dn, (H[w1:n])2.

Proof of Corollary 4.1. Recall the deﬁntion of δn,cross (Deﬁnition 4.5) By Proposition 4.4 with (p, q) = (2, ),

Rn(Fcnt[r, w1:n] H[w1:n])

rad2(Fcnt[r, w1:n])Dn, (H[w1:n])

+ rad (H)Dn,2(Fcnt[r, w1:n])

r Dn, (H[w1:n]) + BDn,2(Fcnt[r, w1:n]),

where we use that rad2(Fcnt[r, w1:n]) r by the deﬁnition of localization. In particular, if r satisﬁes r2

4 B supw1:n Dn,2(Fcnt[r, w1:n]) and r 4 supw1:n Dn, (H[w1:n]), then δn,cross(Fcnt; H) r. Thus, δn,cross(Fcnt; H) is at most the maximum of inf{r : supw1:n Dn,2(Fcnt[r, w1:n]) r2 4B } and 4 supw1:n Dn, (H[w1:n]), which is at most the maximum of inf{r : supw1:n Dn,2(Fcnt[r, w1:n]) r2 4B } = δn,D(Fcnt, 2B) and 4 supw1:n Dn, (H[w1:n]). The bound follows by squaring.

Remark 4.1. Because we consider the -norm Dudley integral of Gcnt, it is hard to take advantage of localization of Gcnt at Gcnt(γ) in the L2 norm. The absence of localization leads to the suboptimal dependence on leading constants compared to what is obtained through Double ML (Foster and Syrgkanis, 2019). Appendix D shows that, for ﬁnite-function classes, one can take advantage of localization with strong hypercontractivity assumptions.

5. Experiments

In this section we present experiments to validate our theoretical ﬁndings and demonstrate how the conceptual takeaways that predictive models are more resilient to distribution shifts in simple features applies to a broad range of practical settings. All of our experiments have a similar form: we (a) identify simple and complex features and justify these choices and (b) measure how the performance of a predictive model changes with distribution shifts in these features. We experiment with neural network models on tasks ranging from synthetic regression problems to computer vision benchmarks to imitation learning in a robotics simulator. We take the following operational deﬁnition of simplicity:

Feature 1 is simpler than feature 2 if the generalization error, without distribution shift, on a predictive task involving feature 1 is smaller than that for an analogous task involving feature 2.

We believe that this is the correct empirical correlate for the complexity measures adopted by our theoretical results. Across domains, we consistently ﬁnd that predictive models are more resilient to shifts in simpler features, thus deﬁned. We now summarize the experimental results, deferring details to Appendix F. In all experiments, we report average performance and standard error across 4 replicates.

Synthetic Regression with Additive Structure. To closely mirror our theory, we predict z = f (x) + g (y) from an input (x, y), where f : Rdx R and g : Rdy R are randomly initialized 2-layered multi-layer perceptions (MLP) having hidden dimensions df and dg, respectively. We sample x and y, respectively, independently from Gaussian mixtures px = N(1dx, I) + (1 px)N( 1dx, Idx) and py = N(1dy, Idy) + (1 py)N( 1dy, Idy) where px, py are mixing probabilties, and 1d Rd and I Rd d are the all ones vector and identity matrix. To make x simpler, we either have dx < dy or df < dg: Appendix F.1 shows that the auxiliary task of predicting f (x) has lower generalization error than that of predicting g (y). We train a predictor ˆz = fθ(x) + gθ(y) to minimize mean-square error (MSE) under a training distribution with px = py = .01. We then measure the MSE on shifted distributions, where we hold the mixing probability of one of {x, y} ﬁxed, and vary the other in the range {0.1, 0.2, 0.5, 0.9, 0.99}. Corroborating our theory, MSE declines less with shift in px than with shift in py (Figure 1). Appendix F.1 shows similar results with predictor ˆz = hθ(x, y) (using concatenated features (x, y)) and contains further implementation details.

Waterbird & Functional Map of World datasets. We next test our hypothesis on the paradigmatic Waterbird dataset (Sagawa et al., 2019). The predictive task is to classify images of birds as waterbirds or landbirds, against a background of either land or water. We consider birdtype (resp. background) as the complex (resp. simple) feature. This choice is both intuitive and consistent with our deﬁnition of simplicity: Appendix F.2 shows that, in the absense of distribution shift, the auxiliary task of predicting background has lower generalization error than that of predicting birdtype. We then test (Figure 1) the prediction accuracy of birdtype under shifts in the proportions of background and birdtype, ﬁnding prediction accuracy degrades less for the former than the latter. See Appendix F.2 for details. Appendix F.3 applies the same methodology to the Functional Map of World (FMo W) dataset (Koh et al., 2021), with similar ﬁndings.

Logical operators on Celeb A dataset. The Celeb A dataset (Liu et al., 2015) consists of celebrity faces labeled with the presence or absence of 40 different attributes (e.g., baldness, mustache). Here we re-purpose Celeb A to learn logical operators OR and XOR for two at-

Learning under Heterogenous Distribution Shift

Waterbird Classiﬁcation Logical operators on Celeb A Imitation on pusher arm Synthetic Regression

mt Q=</latexit>z = f?(x) + g?(y)

Figure 1. Testing resiliency of learned predictive models. We calculate the performance of learned models as we shift the distribution of either complex features (orange color) or simple features (blue color). To shift distribution of {x, y} in synthetic regression, we vary the mixing probabilities of their distributions. To shift distribution of a feature in Waterbirds and Celeb A datsets, we vary its proportion. To shift distribution of a feature in robotic pusher arm control, we increase standard deviation of its sampling distribution. Corroborating our theoretical expectations, we show that these models are more robust to distribution shift in simple features.

tributes. We ﬁrst train and test a multi-head binary classiﬁer that detects presence of 40 different attributes, with one head per attribute, on images from the Celeb A standard training set (Celeb A-STS). We select bald as simple due to its low generalization error and big_lips as complex due to its larger generalization error. In Figure 1, we predict targets f OR = bald OR big_lips and f XOR = bald XOR big_lips, training on Celeb A-STS, but testing on distributions where the proportions of bald and big_lips are varied (Appendix F.4). Results show greater resilience to shift in the simpler feature bald than the complex feature big_lips. Further details are deferred to Appendix F.4, where we perform the same experiment for (pale_skin, narrow_eyes) with similar ﬁndings.

Imitation Learning for Pusher Control. In Pusher Control simulator, we learn an agent that controls a robotic arm to push an object to its goal location. We ﬁx the goal and starting object locations across episodes. We vary object mass m (m N(60, 15)) and joint dampness d (d N(0.5, 0.1)) of the robot. The agent observes {m, d} at beginning of every episode and aims to learn an optimal policy condition on these variables. To determine the simpler feature, we measure generalization error on the auxiliary task of predicting next-step dynamics where one of {m, d} is held ﬁxed and the other is drawn from a distribution that is ﬁxed across training and testing. This methodology ascribes m as the simpler feature and d as complex (Appendix F.5). We train a policy πθ(a|s, m, d) using 1000 expert trajectories where each trajectory contains a new m and d sampled from N(60, 15) and N(0.5, 0.1) respectively. We then shift distribution of m and d by increasing their standard deviation, one at a time while keeping the distribution of the other factor ﬁxed, and test policy

πθ(a|s, m, d). We show the results in Figure 1 and observe that the success rate of πθ(a|s, m, d) deteriorates less when we shift distribution of m while keeping distribution of d ﬁxed. Thus, we show that the policy is more resilient to distribution shift in the simpler feature. Appendix F.5 contains further details, including the precise distributions used to determine m as the simpler feature.

6. Discussion

This paper sheds new light on the issue of spurious correlation that arises when considering out of distribution generalization. We discover that predictive models are more resilient to distribution shift in simpler features, which we capture via notions of statistical capacity in our experiments and via generalization when predicting the feature itself in our experiments. We ﬁnd that, in most of our experiments, this latter operational notion is predictive of how deep learning models behave under heterogeneous distribution shift. We hope that our work inspires future efforts toward a ﬁne-grained theoretical and experimental understanding of distribution shift in modern machine learning.

Acknowledgements

MS acknowledges support from Amazon.com Services LLC grant; PO# 2D-06310236. AA and PA acknowledge supported from a DARPA Machine Common Sense grant, a MURI grant from the Army Research Ofﬁce under the Cooperative Agreement Number W911NF-21-1-0097, and an MIT-IBM grant. The authors thank Adam Block for his assistance in navigating the relevant learning theory literature.

Learning under Heterogenous Distribution Shift

Anurag Ajay, Abhishek Gupta, Dibya Ghosh, Sergey Levine, and Pulkit Agrawal. Distributionally adaptive meta reinforcement learning. In Advances in Neural Information Processing Systems, 2022.

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv:1907.02893, 2019.

Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 2002.

Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. The Annals of Statistics, 2005.

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013.

Olivier Bousquet. A Bennett concentration inequality and its application to suprema of empirical processes. Comptes Rendus Mathematique, 2002.

Olivier Bousquet and André Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2002.

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duﬂo, Christian Hansen, and Whitney Newey. Double/debiased/Neyman machine learning of treatment effects. American Economic Review, 2017.

Kefan Dong and Tengyu Ma. First steps toward understanding the extrapolation of nonlinear models to unseen domains. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=7wrq3v Hc MM.

Richard M Dudley. The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. Journal of Functional Analysis, 1967.

Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. ar Xiv:1901.09036, 2019.

Abhishek Gupta, Russell Mendonca, Yu Xuan Liu, Pieter Abbeel, and Sergey Levine. Meta-reinforcement learning of structured exploration strategies. Advances in Neural Information Processing Systems, 2018.

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft actor-critic algorithms and applications. ar Xiv:1812.05905, 2018.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications. ar Xiv:1704.04861, 2017.

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv:1412.6980, 2014.

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavnass, Wei Guo, Berton Earnshaw, Imran Haque, Sara M Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, 2021.

Qi Lei, Wei Hu, and Jason Lee. Near-optimal linear regression under distribution shift. In International Conference on Machine Learning, pages 6164 6174. PMLR, 2021.

Tengyuan Liang, Alexander Rakhlin, and Karthik Sridharan. Learning with square loss: Localization through offset rademacher complexity. In Conference on Learning Theory, 2015.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In IEEE International Conference on Computer Vision, 2015.

Cong Ma, Reese Pathak, and Martin J Wainwright. Optimally tackling covariate shift in RKHS-based nonparametric regression. ar Xiv:2205.02986, 2022.

Lester Mackey, Vasilis Syrgkanis, and Ilias Zadik. Orthogonal machine learning: Power and limitations. In International Conference on Machine Learning, 2018.

Shahar Mendelson. Learning without concentration. Journal of the ACM, 2015.

John P Miller, Rohan Taori, Aditi Raghunathan, Shiori Sagawa, Pang Wei Koh, Vaishaal Shankar, Percy Liang, Yair Carmon, and Ludwig Schmidt. Accuracy on the line: on the strong correlation between out-ofdistribution and in-distribution generalization. In International Conference on Machine Learning, 2021.

Learning under Heterogenous Distribution Shift

Reese Pathak, Cong Ma, and Martin Wainwright. A new similarity measure for covariate shift with applications to nonparametric regression. In International Conference on Machine Learning, 2022.

Yury Polyanskiy and Yihong Wu. Information theory: From coding to learning, 2022.

Hamed Rahimian and Sanjay Mehrotra. Distributionally robust optimization: A review. arxiv:1908.05659, 2019.

Alexander Rakhlin. IDS.160 Mathematical Statistics: A non-asymptotic approach, 2022. URL http://www.mit.edu/~rakhlin/courses/ mathstat/rakhlin_mathstat_sp22.pdf.

Peter M Robinson. Root-n-consistent semiparametric regression. Econometrica: Journal of the Econometric Society, 1988.

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019.

Shibani Santurkar, Dimitris Tsipras, and Aleksander Madry. Breeds: Benchmarks for subpopulation shift. ar Xiv:2008.04859, 2020.

Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. Advances in Neural Information Processing Systems, 2018.

Aman Sinha, Hongseok Namkoong, and John C. Duchi. Certifying some distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018.

Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European Conference on Computer Vision, 2016.

Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classiﬁcation. Advances in Neural Information Processing Systems, 2020.

Aad W van der Vaart and Jon A Wellner. Weak convergence. Springer, 1996.

Vladimir Vapnik. Estimation of dependences based on empirical data. Springer Science & Business Media, 2006.

Roman Vershynin. High-dimensional probability: An introduction with applications in data science. Cambridge University Press, 2018.

Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press, 2019.

Kai Xiao, Logan Engstrom, Andrew Ilyas, and Aleksander Madry. Noise or signal: The role of image backgrounds in object recognition. ar Xiv preprint ar Xiv:2006.09994, 2020.

Tengyang Xie and Nan Jiang. Q* approximation schemes for batch reinforcement learning: A theoretical comparison. In Conference on Uncertainty in Artiﬁcial Intelligence, pages 550 559. PMLR, 2020.

Tengyang Xie, Dylan J Foster, Yu Bai, Nan Jiang, and Sham M Kakade. The role of coverage in online reinforcement learning. ar Xiv preprint ar Xiv:2210.04157, 2022.

Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

Learning under Heterogenous Distribution Shift

A. Discussion of Assumptions

In this section we provide additional details about our main assumptions, conditional completeness. As this is closely related to orthogonal ML, we also provide some additional context and comparisons.

A.1. On Conditional Completeness

Our results hinge on the conditional completeness assumption, where G is expressive enough to capture the conditional bias functions βf. To clarify when this assumption may hold, we discuss an example.

Partially Linear Regression. First, let us consider a paradigmatic model in econometrics, statistics, and causal inference. This model, known as the partial linear regression (PLR) model (Chernozhukov et al., 2017; Robinson, 1988), speciﬁes a joint distribution over tuples (z, x, y) via the structural equations:

z = x, f + h1(y) + ε

x = h0(y) + τ

y P, E[ε | x, y]= 0, E[τ | y]= 0

Here x belongs to a (ﬁnite dimensional) vector space, say Rd, and f is a linear function, so we use f both to describe the mapping and the vector itself. In the learning setting for PLR, we are given access to function class H0, H1 such that h0 H0, h1 H1, and we assume that f has bounded norm, say f B.

This model is amenable to our techniques whenever H1 consists of linear projections of H0, i.e.,

H0 := {y 7 v, h1(y) : v Rd, h1 H1}.

Clearly we can apply ERM with F as the linear class and G = H0. But we must verify that conditional completeness holds. This follows because, for any f, we have

βf(y) := E[ f f , x | y]= f f , E[x | y] = f f , h0(y) G.

Thus, our results demonstrate favorable guarantees for estimating f via ERM in this setting.

Remark A.1 (Compatibility of conditional completeness and boundedness). If conditional-completeness were stipulated as a global condition, i.e. for all (f, g) F G, g βf G, then necessarily (f, g kβf) F G for all k N. Therefore, G would not in general consist only of functions which are uniformly bounded (except in the special case where βf = 0 for all f F). By imposing the restriction Rtrain(f, g) γ2, we avoid this pathology because, unles βf 0, limk Rtrain(f, g kβf) = .

A.2. Proof of Lemmas 3.2 and 3.3

Proof of Lemma 3.2. Suppose that if x y under Ptrain. Then βf(y) = E[f(x) f (x) | y = y] = E[f(x) f (x)] is a constant in y.

ν1 := sup f F

Etest[(f f βf)2] Etrain[(f f βf)2]

sup f F,c R

Etest[(f f c)2] Etrain[(f f c)2]

Ptest[x A] Ptrain[x A] = νx,

as the functions f f c are functions of x alone.

Proof of Lemma 3.3. By linearity (assumption (a) of Lemma 3.3, we can write f(x) = v, x and f (x) = v , x for

Learning under Heterogenous Distribution Shift

some v, v H. Then, setting βx(y) = Etrain[x | y],

Etest[(f(x) f (x) βf(x))2] = Etest[ v v , x βx(y) 2]

ν lin Etrain[ v v , x βx(y) 2] (Lemma 3.3, assumption (c))

= ν lin Etrain[(f(x) f (x) Etrain[f(x) f (x) | y] | {z } =βf (y)

A.3. Beyond Uniform Density Bounds

In this section, we consider a generalization of Eqs. (3.3) and (3.4) to allow for additive slack. We show that this allows for generalizations of Lemma 3.1 and Corollary 3.1 which accomodate density ratios which are possibly unbouned, or even take the value with positive probability (Appendix A.3.1). Finally, we show that our guarantees imply guarantees when the χ2 divergence (or more generally, power divergence) are bounded (Appendix A.3.2), thereby establishing the proof of Lemma 3.4. We begin by restating Corollary 3.2.

Corollary 3.2. Suppose that, for all f F and g G,

Etest[(f f βf)2] ν1Etrain[(f f βf)2] + 1 Etest[(g g βf)2] ν2Etrain[(g g βf)2] + 2,

Then, immediately from Lemma 3.1,

Rtest(f, g) 2(ν1Rtrain[f] + ν2Rtrain(f, g) + 1 + 2).

Though seemingly simple, the accomodation of additive slack is deceptively ﬂexible. Given probability laws P, Q on a space X = {X X}, recall their Radon-Nikodym derivative (see, e.g. Polyanskiy and Wu (2022, Chapter 2)) d P(X)/d Q(X) (which may take value ). In the language of Radon-Nikodym derivatives, the worst-case density ratios νx,y and νy in Deﬁnition 2.1 can be deﬁned as

νx,y := sup x,y d Ptest(x, y) d Ptrain(x, y), νy := sup y d Ptest(y) d Ptrain(y). (A.1)

The following corollary, whose proof we give in Appendix A.3.3, shows that we can replace the dependence on the worstcase density ratios in Lemma 3.1 with a bound that depends only on the tails of those density ratios:

Corollary A.1. Recall the boundedness assumption Assm. 2.4, such that all of |f|, |g|, |f |, |g | are uniformly at most B in magntinude. For any functions f F and g G, it holds that

Rtest(f, g) inf t1,t2>0 2(t1Rtrain[f] + t2Rtrain(f, g) + 16B2 x,y(t1) + 16B2 y(t2)),

where we deﬁne

x,y(t) := P(x,y) Ptest

d Ptest(x, y)

d Ptrain(x, y) > t , y(t) := Py Ptest

d Ptrain(y) > t .

Remark A.2. As in Section 3.3, the above bound can be reﬁned in a function-class dependent fashion by replacing the density ratio tail-bounds in the deﬁnitions of x,y(t) and y(t) with the restriction of these density ratios to the σ-algebra generated by the function classes {(x, y) 7 f(x) f (x) βf(y) : f F} and {y 7 g(y) g (y) βf(y) : f F, g G}. When ν1 and ν2, as deﬁned in Eqs. (3.3) and (3.4), are bounded, then taking t1 = ν1 and t2 = ν2 recovers the bounds obtained in Section 3.3.

A.3.1. EXAMPLES WITH UNBOUNDED DENSITY RATIOS.

We give two simple examples demonstrating how the bound in Corollary A.1 can be ﬁnite even when νx,y and νy are not. Our ﬁrst illustrative example shows that Corollary A.1 can be ﬁnite even if there are values of y for which the density ratios are inﬁnte.

Learning under Heterogenous Distribution Shift

Example A.1 (Inﬁnite Density Ratios). Consider a discrete setting where, for simplicity, x = 1 is deterministic, and y [n + 1] = {1, . . . , n + 1}. Suppose that y under Ptrain is distributed uniformly on [n], and uniformly on [n + 1] under Ptest. For discrete distributions, density ratios are just ratios of probabilities:

Ptest(x, y) Ptrain(x, y) = d Ptest(y)

d Ptrain(y) =

( n n+1 y [n] y = [n + 1]. (A.2)

Thus, the density ratios are not uniformly bounded, and may indeed take the value . Still, Py Ptest[y = n + 1] = 1

n, so Corollary A.1 with t1 = t2 = n n+1 1yields

Rtest(f, g) 2(Rtrain[f] + Rtrain(f, g)) + 64B2

n + 1, (A.3)

which is not only not inﬁnite, but indeed decays with n.

Our second example shows the sample can happen even if the density ratios are ﬁnite, but unbounded.

Example A.2. Let γ1, γ2 (0, 1). Again, consider deterministic x = 1 and discrete y Z 0 with geometric distributions. Ptrain[y = k] = (1 γ1)γk 1, Ptest[y = k] = (1 γ2)γk 2. Then,

d Ptest(y = k) d Ptrain(y = k) = 1 γ2

When γ2 > γ1, supy d Ptest(y) d Ptrain(y) = . Still, Corollary A.1 is non vacuous, yielding

Rtest(f, g) inf t>0 2t(Rtrain[f] + Rtrain(f, g)) + 64B2 Ptest

With some algebra1, we can compute

Rtest(f, g) inf t>0 2t(Rtrain[f] + Rtrain(f, g)) + 64B2 t 1 γ1

α , α := log(1/γ2)

log γ2/γ1 > 0

An interesting feature of this example is that, even thought the density ratio in Eq. (A.4) appears to grow exponentially k, the tradeoff in t is polynomial due to the exponential decay of y Ptest.

A.3.2. CONSEQUENCES FOR CERTAIN f-DIVERGENCES, AND PROOF OF LEMMA 3.4

We show that Corollary A.1 can be instantiated for a subclass of f-divergences, which include the χ2-divergence (with arguments ordered appropriately) as a special case. To avoid confusion with our function class f, we shall replace f with the functions φ : R>0 R 0.

Deﬁnition A.1 (φ-divergence, Chapter 2 in (Polyanskiy and Wu, 2022)). Given measures P, Q on the same probability space X = {X}, and φ : R>0 R, we deﬁne

Dφ(Q, P) := Z

where d Q(X) and d P(X) denote the Radon-Nikodym derivatives of Q and P, respectively, evaluated at X X. By convention, it is typically required that φ is convex, φ(1) = 0, and that φ(0) is deﬁned via φ(0) = limt 0+ φ(t).

Observe that if the function φ satisﬁes φ + M is non-negative for some M R, Markov s inequality implies

> u Dφ(Q, P)

1The missing steps are as follows. We have Ptest h 1 γ2 1 γ1 γ2 γ1

y > t i = Ptest

y > log( 1 γ1

1 γ2 )+log t

, which can be bounded using

the distribution of y under Ptest[y > u] Ptest[y u] = Ptest[y u ] (γ2)u = exp(u log γ2).

Learning under Heterogenous Distribution Shift

And, if in addition φ(u) is strictly decreasing, n d P(X) d Q(X) > t o = n d Q(X) d P(X) < 1

t o = n φ( d Q(X)

d P(X) ) < φ( 1

t ) o . Thus,

d Q(X) > t Dφ(Q, P)

φ(1/t) , φ(1/t) M. (A.4)

Then, Corollary A.1 directly implies the following consequence.

Corollary A.2. Consider any non-negative and strictly decreasing functions φ1, φ2 : R>0 [ M, ); the other axioms of the φ-divergence need not be met. Then, for any t1, t2 > 0 for which φ1(1/t1), φ2(1/t2) M,

Rtest(f, g) 2(t1Rtrain[f] + t2Rtrain(f, g))

+ 32B2 Dφ1(Ptrain(x, y), Ptest(x, y))

φ1(1/t1) + Dφ2(Ptrain(y), Ptest(y))

where Dφ1(Ptrain(x, y), Ptest(x, y) denotes the φ1-divergence between the joint distribution of (x, y) under Q = Ptrain and P = Ptest, and Dφ2(Ptrain(y), Ptest(y)) the φ2-divergence between the marginal of y under these measures.

We show now describe two special cases of interest.

Example A.3 (Power Divergences). A special case is when φi(t) = 1 t αi i 1, αi > 0,2 then for any t1, t2 1,

Rtest(f, g) 2(t1Rtrain[f] + t2Rtrain(f, g))

+ 32B2 Dφ1(Ptrain(x, y), Ptest(x, y))

tα1 1 + Dφ2(Ptrain(y), Ptest(y))

We obtain Lemma 3.4 as a special case:

Proof of Lemma 3.4. An archetypical example of the special case of power-divergences described above is φ1(u) = φ2(u) = 1 u 1. Then, it can be shown that Dφ(Q, P) = χ2(P, Q), where χ2( , ) denotes the χ2 divergence (note the reversed order of the arguments). Thus, specializing Example A.3 further yields that, for any t1, t2 1,

Rtest(f, g) 2(t1Rtrain[f] + t2Rtrain(f, g))

+ 32B2 χ2(Ptest(x, y), Ptrain(x, y))

t1 + χ2(Ptest(y), Ptrain(y))

When Rtrain[f], Rtrain(f, g) are sufﬁciently small relative to the above χ2( , ) divergences, we can minimize over t1, t2 without the t 1 constraint:

Rtest(f, g) 8B p

Rtrain[f] χ2(Ptest(x, y), Ptrain(x, y)) + 8B p

Rtrain(f, g) χ2(Ptest(y), Ptrain(y)).

A.3.3. PROOF OF COROLLARY A.1

We begin with a standard change-of-measure bound.

Lemma A.1 (Change of Measure). For any measurable, bounded function h : X [0, M],

EX P[h(X)] inf t>0 M PX P

d Q(X) > t + t EX Q[h(X)].

2Note that this ensures that φi( ) is strictly decreasing, φi + 1 0, as well as the f-divergence axioms φi(1) = 0 and φi is convex.

Learning under Heterogenous Distribution Shift

Proof of Lemma A.1. Fix an event E X, and suppose that 0 h( ) M.

EX P[h(X)] EX P[h(X)I{E}] + EX P[h(X)I{Ec}]

EX P[h(X)I{E}] + M P[Ec]

= M P[Ec] + Z h(X)I{E}d P(X)

= M P[Ec] + Z h(X)I{E}d Q(X) d P(X)

M P[Ec] + Z h(X)I{E}d Q(X) sup X E

M P[Ec] + EX Q[h(X)] sup X E

To conclulde, we take E := { d P(X)

Proof of Corollary A.1. We aim to establish the following:

Etest[(f f βf)2] ν1Etrain[(f f βf)2] + 1 (A.5)

Etest[(g g βf)2] ν2Etrain[(g g βf)2] + 2, (A.6)

In view of Corollary 3.2, it sufﬁces to check that for any t1, t2 > 0, Eqs. (A.5) and (A.6) hold with (ν1, ν2) (t1, t2), and ( 1, 2) (16B2 x,y(t1), 16B2 y(t2))). Consider the functions of the form h1(x, y) = (f(x) f (x) βf(y))2

and h2(y) = (g(y) g (y) βf(y))2. By Assm. 2.4, it holds that the image of both functions lies in [0, 16B2]. The result now follows by applying Lemma A.1 with M 16B2 to the functions h1 (resp h2) with t t1 (resp. t t2).

A.4. Correspondence with Double ML

As we have mentioned, our results have a similar ﬂavor to those in the literature on Neyman orthogonalization. In this section, we expand on this comparison, following the treatment in (Foster and Syrgkanis, 2019). As terminology, we refer to these idea broadly as orthogonal (statistical) learning.

The orthogonal learning setup describes a similar situation where a pair (f , g ) is unknown, but weare primarily interested in f , referring to g as a nuisance function. For example, we may have Ez | x, y = f (x) + g (y) as in our setup, but orthogonal learning is more general in this respect. Unlike our setting however, orthogonal learning requires some auxiliary mechanism or data to learn ˆg such that E(ˆg(y) g (y))2 raten(G). Given this initial estimate, orthogonal learning describes an algorithm and conditions under which one can learn ˆf satisfying

E( ˆf(x) f (x))2 raten(F) + raten(G)2.

Here raten( ) should be thought of as the standard fast" rate for learning with the function class, i.e., raten(F) log |F|

n when |F| < . This guarantee naturally leads to a distribution shift bound of the form:

Rtest( ˆf, ˆg) νx raten(F) + raten(G)2 + νyraten(G).

As with our bound, the complexity of the class G does not interact with distribution shifts on νx in a signiﬁcant way. Indeed, the error rate for ˆf has a quadratic dependence on that for ˆg, and this is typically lower order when considering distribution shift settings. Thus, at a conceptual level, orthogonal learning can provide a similar robustness to heterogenous distribution shifts as our results.

Quantitatively, the bound should be compared with our Theorem 3. The general takeaway is that our bound is worse in at least two respects. First, the quadratic dependence on the rate for G cannot exploit localization in our setup, so our bound is weaker when G is small. This is why we need hypercontractivity conditions to obtain favorable rates when G is ﬁnite/parametric, which is not required for orthogonal learning. Second, when considering distribution shift, we incur a dependence on νx,y rather than just νx. This arises from the identiﬁability issues that are inherent in our setting, which can be resolved in the orthogonal learning setting due to the auxiliary mechanism for estimating g .

Learning under Heterogenous Distribution Shift

On the other hand, our results compare favorable at a qualitative level. Most importantly, our bound applies to ERM directly while orthogonal learning requires algorithmic modiﬁcations, which in turn require more modeling of the data generating process. Additionally, while we do not believe the assumptions are formally comparable, we view ours as somewhat more practical. Speciﬁcally, it is rather uncommon that auxiliary information for estimating the nuisance parameter is available; yet in canonical settings for orthogonal learning, we can show that conditional completeness holds. One such example of the latter is the PLR model above.

B. Proof of Main Technical Results

This section provides the proofs of the most signiﬁcant technical results in the paper. Speciﬁcally, Appendix B.1 give the proofs of the excess error decompositions, Lemma 3.1. Appendix A.2 establishes Lemmas 3.2 and 3.3, which reﬁne upper bounds bounds on the distribution-shift term ν1. Next, we prove Proposition 4.2 which upper bounds Rtrain[f, g], and thus, in view of Lemma 3.1, Rtrain[g; f]. Appendix B.3 gives the proof of Proposition 4.3 which provides reﬁned control of Rtrain[f]; the key step is an (empirical) excess-risk decomposition, Lemma B.3, which we prove in Appendix B.5. Finally, Appendix B.6 establishes our Hölder inequality for Rademacher complexities of Hadamard product classes (Proposition 4.4). Lastly, Appendix B.7 derives the standard Dudley integral bound from the aforementioned proposition for product classes.

B.1. Proof of Lemma 3.1

Write β = βf for simplicity. For any environment e and any triplet (f, g, β), the polarization identity yields

Re(f, g) = Ee[(f + g f g )2]

= Ee[(f f β + g g + β)2]

= Ee[(f f β)2] + Ee[(g g + β)2] + 2Ee[(f f β)(g g + β)]

= Re[f] + Re[g; f] + 2Ee[(f f β)(g g + β)].

For any e (in particular, e = test), we have

Ee[(f f β)(g g + β)] Ee[(f f β)2]1/2 Ee[(g g + β)2]1/2

Ee[(f f β)2] + Ee[(g g + β)2] = Re[f] + Re[g; f].

Re(f, g) 2(Re[f] + Re[g; f]).

When e = train, the fact that β(y) = Etrain[(f f )(x) | y] implies

Etrain[(f f β)(g g + β)] = Etrain[((f f )(x) β(y)) (g g + β)(y) | y]

= Etrain[(Etrain[(f f )(x) | y] β(y)) (g g + β)(y)] = 0.

Rtrain(f, g) = Rtrain[f] + Rtrain[g; f].

This proves the ﬁrst two parts of the lemma. For the last part, we have

Rtest(f, g) 2Rtest[f] + 2Rtest[g; f] (Lemma 3.1)

(i) 2νx,y Rtrain[f] + 2νy Rtrain[g; f] (B.1)

(ii) 2νx,y Rtrain[f] + 2νy Rtrain(f, g),

where (i) invokes Deﬁnition 2.1, and where in (ii), Rtrain[f] 0 and Rtrain(f, g) = Rtrain[f] + Rtrain[g; f] implies Rtrain[g; f] Rtrain(f, g).

Learning under Heterogenous Distribution Shift

B.2. Proof of Proposition 4.2

This section proves Proposition 4.2, which we use to upper bound Rtrain[f, g], and thus, by way of Lemma 3.1, Rtrain[g; f]. We begin by establishing the following more or less standard guarantee (see, e.g. (Liang et al., 2015)) for a generic function class H. which controls the so-called basic inequality in square-loss learning (see, e.g. Wainwright (2019, Chapter 13)).

Lemma B.1. Let H be a functions from W [ B, B] containing the zero function h0(w) 0. Fix σ > 0, τ 1. Then,

for any probability measure P, w1, . . . , wn i.i.d P and i.i.d. standard normal random variables ξ1 . . . , ξn, the following holds with probability 1 δ

1 4 h 2 L2(P ) h(w1:n) 2 2,n + 2τσ

i=1 ξih(wi) γn,δ,σ(H)2,

where C > 0 is a universal constant and

γn,σ,τ(H, δ)2 = δn,R(H, B)2 + τ 2δn,G (H, σ)2 + (τ 2σ2 + B2) log(1/δ)

Proof. We have

1 4 h 2 L2(P ) h(w1:n) 2 2,n + 2τσ

i=1 ξih(wi)

h 2 L2(P ) 2 h(w1:n) 2 2,n

| {z } Term1

2 h(w1:n) 2 2,n + 2τσ

i=1 ξih(wi)

| {z } Term2

By Lemma C.14 and Lemma C.10, respecitively, the following holds with probability at least 1 δ,

Term1 δn,R(H, B)2 + B2 log(1/δ)

Term2 τ 2 σ2 log(1/δ)

n + δn,G (H, σ)2 .

Summing concludes.

Proof of Proposition 4.2. Let wi = (xi, yi) and w = (x, y), and set h := f + g and ˆhn := ˆfn + ˆgn. Note that ˆhn h Hcnt. It follows from the so-called basic ineqality Wainwright (2019, Eq. 13.36) that

(ˆhn h )(w1:n) 2 2,n 2ξi(ˆhn h )(wi) 0,

Thus, by adding and subtracting 1

4E[ˆhn(w)2] and rearranging

E[(ˆhn h )(w)2] 4 E[(ˆhn h )(w)2] + (ˆhn h )(w1:n) 2 2,n + 2ξi(ˆhn h )(wi) .

Passing to the supremum over all h Hcnt := F + G (f + g ) and invoking Lemma B.1 shows that

Rtrain( ˆfn, ˆgn) γn(δ)2 := δn,R(Hcnt, B)2 + δn,G (Hcnt, σ)2 + (B2 + σ2) log(1/δ)

Learning under Heterogenous Distribution Shift

B.3. Proof of Proposition 4.3

This section proves Proposition 4.3, which we use to bound Rtrain[f]. It is considerably more involved than the proof of Proposition 4.2, as we need to argue that the large class G does not heavily obfuscate recovery in the class F. Throughout, let us use w = (x, y), wi = (xi, yi), and W = X Y. Recall the sets

Fcnt := {f βf f : f F}, Gcnt := {g + βf f : f F, g G}.

We begin by uniformly bounding the elements of Fcnt and Gcnt.

Lemma B.2. For any h Fcnt Gcnt, supw H |h(w)| 4B, where B is as in Assm. 2.4.

Proof. Follows directly from Assm. 2.4 and the fact that βf(y) := E[(f f )(x) | y = y].

In Appendix B.5, we prove the following, which shows that the conditional completeness allows us to decompose Rtrain[ ˆfn] across these two terms, the ﬁrst of which represents a standard excess risk in terms of F, and the latter of which measures the contamination due to errors in G. This bound can be thought of as a careful reﬁnement of the standard basic inequality (Wainwright, 2019, Eq. 13.36).

Lemma B.3. If (F, G) satisﬁes γ-conditional completeness, then for any empirical risk minimizer ( ˆfn, ˆgn) for which Rtrain( ˆfn, ˆgn) γ2, the following bound holds deterministically:

Rtrain[ ˆfn] 8 sup h1 Fcnt

1 4E[h1(w)2] 1

2 h1(w1:n) 2 2,n 2σ 1

i=1 ξih1(wi)

+ 32 sup h1 Fcnt,h2 Gcnt(γ)

2E[h1(w)2] 4

i=1 h1(wi) h2(wi)

In the remainder of the proof, we apply various learning-theoretic tools to upper bound the right-hand side of Lemma B.3. These tools, and their proofs, are detailed in Appendix C. While the tools themselves are more-or-less standard, deriving from the offset-Rademacher arguments in (Liang et al., 2015), their application to the reﬁned decomposition in Lemma B.3 yields the novelty of Proposition 4.3.

Proof of Proposition 4.3. We prove the variant of the lemma of the lemma involving δn,cross, and explain how to modify the proof to obtain dependence on δn,cross at the end.

The ﬁrst term in the above display can be bounded directly from Lemma B.1 with τ set to 1, yielding

Term1 δn,R(Fcnt, B)2 + δn,G (Fcnt, σ)2 + (σ2 + B2) log(1/δ)

To bound the second term, we use a localization argument. Deﬁne the doubly-localized term

Ψ(r, γ) := sup h1 Fcnt(r),h2 Gcnt(γ)

i=1 h1(wi) h2(wi)

Following the same argument as in Lemma C.9, one can check that

Term2 = 32 sup r>0 Ψ(r, γ),

Term2 32 inf r2 : Ψ(r, γ) r2

Learning under Heterogenous Distribution Shift

Hence, let us exhibit an r for which Ψ(r, γ) 0 with high probability. Deﬁne the shorthand Hr,γ := Fcnt(r) Gcnt(γ). For an r > 0 to be chosen, Lemma C.8 implies that with probability 1 δ/4,

Ew1:n[Rn(Hr,γ[w1:n])] + r

n + B log(1/δ)

where c 1 is a universal constant. By AM-GM, we have

Ψ(r, γ) c Ew1:n[Rn(Hr,γ[w1:n])] + r2

16c + (8c + B) log(1/δ)

= c Ew1:n[Rn(H[w1:n])] r2

16c + (8c + B) log(1/δ)

We now compute

Ew1:n P [Rn(Hr,γ[w1:n])] r2

= Ew1:n Eε1:n

i=1 εih(wi)

= Ew1:n Eε1:n

sup h1 Fcnt(r),h2 Gcnt(γ)

i=1 εih1(wi)h2(wi)

16c (Deﬁnition of Hr,γ)

= Ew1:n Eε1:n

sup h1 Fcnt(r),h2 Gcnt(γ)

i=1 εih1(wi)h2(wi) 1 16c E[h1(w)2]

(Localization of Fcnt(r))

1 16c T1 + 1 32c T2,

where we deﬁne

T1 := Ew1:n Eε1:n

sup h1 Fcnt(r),h2 Gcnt(γ)

i=1 εih1(wi)h2(wi) 1

2 h1(w1:n) 2 2,n

T2 := Ew1:n

sup h1 Fcnt h1(w1:n) 2 2,n 2E[h1(w)2] .

Bounding T1. To being, we remove the localization of Fcnt in the term T1, upper bounding

T1 T 1 := Ew1:n Eε1:n

sup h1 Fcnt,h2 Gcnt(γ)

i=1 εih1(wi)h2(wi) 1

2 h1(w1:n) 2 2,n

For any ﬁxed w1:n Wn, consider the process

i=1 εih1(wi)h2(wi).

Introduce the class

H[ρ] := {h1 h2 : h1 Fcnt, h2 Gcnt(γ), h1(w1:n) 2,n ρ}, (B.5)

which localizes h1 at empirical L2-norm ρ. Note that that, for any h H[ρ], we can write h = h1 h2 where

h 2 2,n = 1

i=1 h1(wi)2h2(wi)2 4B2

i=1 h1(wi)2 = 4B2 h1(w1:n) 2 2,n 4B2ρ2,

Learning under Heterogenous Distribution Shift

where the ﬁrst inequality is by Lemma B.2 and second by deﬁnition of H[ρ]. Again by Lemma B.2, maxi sup h H[ρ] 4B2. It follows by Lemma C.7 that the following holds with probability 1 δ

Z[ρ, w1:n] := sup h1 Fcnt: h1(w1:n) 2,n ρ sup h2 Gcnt(γ)

i=1 εi h(wi)

i=1 εi h(wi)

i=1 εi h(wi)

log(1/δ)/n + B2

Applying Lemma C.9 with the classes

V = H[w1:n], ui = (εi, i), Φ := {ui 7 εih2(wi) : h2 Gcnt(γ)}

and constants

c1 1, c2 B, c2 B2, σ = 1, τ 1,

we conclude

T 1 sup w1:n Eε1:n

sup h1 Fcnt,h2 Gcnt(γ)

i=1 εih1(wi)h2(wi) 1

2 h1(wi) 2 2,n

δ2 n := inf ρ2 : sup w1:n E[Z[ρ, w1:n]] ρ2

sup w1:n E[Z[ρ, w1:n]] := sup w1:n Rn(Fcnt[ρ, w1:n] Gcnt(γ)[w1:n]),

δ2 n = inf{ρ2 : sup w1:n Rn(Fcnt[ρ, w1:n] Gcnt(γ)[w1:n]) ρ2

:= δ2 n,cross(Fcnt; Gcnt(γ)), (Deﬁnition 4.5)

so that by Eq. (B.6),

T1 T 1 δ2 n,cross(Fcnt; Gcnt(γ)) + B2

Bounding T2. This second term can be bounded by Lemma C.13 and is at most

n + δn,R(Fcnt, 4B)2 B2

n + δn,R(Fcnt, B)2 (B.8)

where the ﬁrst inequality also uses Lemma B.2 to bound supw |h(w)| 4B for h Fcnt(r), and the second uses Lemma C.3 to remove the factor of 4. Hence, with probability 1 δ/4, the following inequality holds for any ﬁxed r > 0:

Learning under Heterogenous Distribution Shift

Concluding the proof Combining Eqs. (B.4), (B.7) and (B.8) gives that with probability 1 δ,

Ψ(r, γ) T1 + T2 + (1 + B) log(1/δ)

B2 + (1 + B) log(1/δ)

n + δn,R(Fcnt, B)2 + δ2 n,cross(Fcnt; Gcnt(γ))

(1 + B2) log(1/δ)

n + δn,R(Fcnt, B)2 + δ2 n,cross(Fcnt; Gcnt(γ)).

Hence, if for a sufﬁciently large constant c , we take

r2 := c (1 + B2) log(1/δ)

n + δn,R(Fcnt, B)2 + δ2 n,cross(Fcnt; Gcnt(γ)) ,

then Ψ(r, γ) r2

2 with probability at least 1 δ. Therefore by Eq. (B.3), we conclude that with probability 1 δ,

Term2 (1 + B2) log(1/δ)

n + δn,R(Fcnt, B)2 + δ2 n,cross(Fcnt; Gcnt(γ)).

Combining with the bound on Term1 due to Eq. (B.2) concludes the proof.

B.4. A modiﬁcation of Proposition 4.3

For sharper rates with ﬁnite function classes (Appendix D), we modify Proposition 4.3 as follows.

Deﬁnition B.1 (Population-Localized Cross-Critical Radius).

δn,cross(Fcnt; H) := inf r : Ew1:n Rn((Fcnt(r) H)[w1:n]) r2

Proposition B.4. Suppose that (F, G) satisfy γ-conditional completeness. Then, whenever Rtrain[ˆgn; ˆfn] γ, the following holds with probability at least 1 δ,

Rtrain[ ˆfn] δ2 n,cross(Fcnt; Gcnt(γ))2 + δn,R(Fcnt, B)2 + δn,G (Fcnt, σ)2 + (σ2 + B2) log(1/δ)

Modiﬁcation to obtain dependence on δn,cross. To obtain a dependence on δn,cross, we change our bound the term T1 above to use Lemma C.12 instead of Lemma C.9. The details are very similar.

B.5. Proof of Lemma B.3

This section establishes the generalized excess-risk decomposition which forms the basis of the argument in the previous section, and which decouples - via conditional-completeness - the recovery of f F with conﬂation by g G.

Recall w = (x, y) and for f F, deﬁne

hf(w) := (f f )(x) βf(y),

Learning under Heterogenous Distribution Shift

and note that hf Fcnt. Then, for any f F, g F, and g0 G, we have

ˆLn(f, g) ˆLn(f , g0)

i=1 (f(xi) + g0(yi) zi)2 (f (xi) + g0(yi) zi)2

i=1 ((f f )(xi) σξi + (g g )(yi))2 ( σξi + (g0 g )(yi))2

i=1 ((f f )(xi) βf(yi) σξi + (g g + βf)(yi))2 ( σξi + (g0 g )(yi))2

i=1 ((f f )(xi) β(yi) σξi + (g g + βf)(yi))2 ( σξi + (g g + βf)(yi))2

i=1 ( σξi + (g g + βf)(yi))2 ( σξi + (g0 g )(yi))2

i=1 ((f f )(xi) βf(yi))2 2σ 1

i=1 ξi((f f )(xi) βf(yi))

i=1 ((f f )(xi) βf(yi)) (g g + βf)(yi))

i=1 ( σξi + (g g + βf)(yi))2 ( σξi + (g0 g )(yi))2

| {z } Remainder(g0;f,g)

Applying the deﬁnition of hf, the above admits the more compact form

ˆLn(f, g) ˆLn(f , g0) := 1

n hf(w1:n) 2 2,n 2σ 1

i=1 ξihf(wi) + 2

i=1 hf(wi) (g g + βf)(yi))

+ Remainder(g0; f, g).

By γ-conditional completeness, we have that if Rtrain[f, g] γ2, then we may select g = g + βf G so that Remainder(g0; f, g) = 0. Similarly, if we now consider ˆfn, ˆgn to be empirical risk minimizers of ˆLn(f, g), it must hold that ˆLn( ˆfn, ˆgn) infg0 G ˆLn(f , g0) 0. Thus,

0 ˆLn(f, g) inf g0 G ˆLn(f , g0)

ˆLn(f, g) ˆLn(f , g)

n h ˆ fn(w1:n) 2 2,n 2σ 1

i=1 ξih ˆ fn(wi) + 2

i=1 h ˆ fn(wi) (g g + β ˆ fn)(yi))

n h ˆ fn(w1:n) 2 2,n 2σ 1

i=1 ξih ˆ fn(wi) + 2

i=1 h ˆ fn(wi) (g g + β ˆ fn)(yi))

Note that E[h ˆ fn(w)2] is precisely equal to Rtrain[ ˆfn]. Adding and substracting an η multiple of this term for η tunable,

ηRtrain[ ˆfn] 1

n h ˆ fn(w1:n) 2 2,n ηE[h ˆ fn(w)2] 2σ 1

i=1 ξih ˆ fn(wi)

i=1 h ˆ fn(wi) (g g + β ˆ fn)(yi)).

Learning under Heterogenous Distribution Shift

Rearranging,

Rtrain[ ˆfn] η 1

ηE[h ˆ fn(w)2] 1

n h ˆ fn(w1:n) 2 2,n 2σ 1

i=1 ξih ˆ fn(wi)

i=1 h ˆ fn(wi) (g g + β ˆ fn)(yi))

2ηE[h ˆ fn(w)2] 1

n h ˆ fn(w1:n) 2 2,n 2σ 1

i=1 ξih ˆ fn(wi)

n E[h ˆ fn(w)2] 1

i=1 h ˆ fn(wi) (g g + β ˆ fn)(yi))

Finally, note that h ˆ fn Fcnt. As established above, g β ˆ fn G by conditional completeness, so g g + β ˆ fn Gcnt. In

fact, the condition Rtrain( ˆfn, ˆgn) γ2 implies via Lemma 3.1 that E[(g β ˆ fn g )2] γ2, so that g g +β ˆ fn Gcnt(γ). Thus, we may pass to a supremum on the right-hand side equations:

Rtrain[ ˆfn] η 1 sup h1 Fcnt

2ηE[h1(w)2] 1

n h1(w1:n) 2 2,n 2σ 1

i=1 ξih1(wi)

+ η 1 sup h1 Fcnt,h2 Gcnt

n E[h1(w)2] 1

i=1 h1(wi) h2(wi)

Selecting η = 1/8 conclues.

B.6. Proof of Proposition 4.4

This section establishes the Hölder-style inequality for Rademacher complexities of product classes. For completeness, we begin by reproducing a standard bound on the Rademacher complexity of ﬁnite function classes.

Lemma B.5. Let V Rn be a ﬁnite set. Then,

Rn(V) rad2(V) min{1, p

2 log |V|/n}

The same bounds also hold for ˆ Gn, Gn, and more generally, whenever the variables εi in the deﬁnition of the Rademacher complexities are replaced by arbitrary 1-sub Gaussian variables.3

Proof. Let us bound ˆ Rn, with εi replaced by arbitrary 1-sub Gaussian random variables. Recall the deﬁnition of a 1sub Gaussian variable ε: log E[exp(λε)] 1

2λ2 (it is standard that Gaussian random variables and Rademacher variables satisfy this inequality). By Taylor expanding log E[exp(λε)] = log(1 + P

i 1(λE[εi]/i!), it follows that E[ε] = 0, and E[ε2] 1. Hence, by Cauchy-Schwartz,

Rn(V) = E[sup v V

i=1 εivi] sup v V v 2,n

i=1 ε2 i rad2(V).

The second bound is a consequence of standard sub-Gaussian maximal inequality (see, e.g. , Theorem 2.5 in Lugosi) and the fact that 1

n Pn i=1 εivi is 1

n v 2 2,n-sub Gaussian (e.g., the discussion in Boucheron et al. (2013, Chapter 2.3)).

We now turn to the proof of Proposition 4.4. We ﬁrst state two useful lemmas. The ﬁrst is a direct consequence of Hölder s inequality and the fact that p,n p ,n for p p.

Lemma B.6 (Variant of Hölder s inequality). For any p, q 2 satisfying 1/p + 1/q 1/2,

v, u Rn, v u 2,n v p,n u q,n (B.10)

3Recall a variable ε is 1-sub Gaussian if, for all λ 0, log E[exp(λε)] λ2/2.

Learning under Heterogenous Distribution Shift

The second bounds the Rademacher complexity of Hadamard products in terms of a ﬁnite cover. It is stated with a factor of 2 for convenience when applied below.

Lemma B.7. Let V, U Rn, p, q 2 satisfy 1/p + 1/q 1/2, δ1, δ2 0, and let V be a 2δ1-net of V in p,n and U an 2δ2-net of U in q,n. Then,

Rn(V U) Rn(V U ) + 2(radp(V)δ1 + radq(U)δ2),

The above bound also holds for the Gn, and more generaly, any analogous complexity using suprema over 1-sub Gaussian random variables.

Proof of Lemma B.7. Observe that, for any (v, u) V U, there exists a (v , u ) V U with v v p,n δ1 and u u q,n δ2. Hence, by Eq. (B.10) followed by Eq. (B.12),

v u (v ) (u ) 2,n v (v v ) 2,n + v (u u) 2,n 2(radq(U)δ1 + radp(V)δ2)

Rn(V U) = E sup (v,u) V U

E sup (v,u) Vj1 Uk1

i=1 εiuivi E sup (v,u) V1 Uj1 inf (v ,u ) V U 1 n

i=1 εi(uivi u iv i)

Rn(Vj1 Uk1) E sup w: w 2,n 2(radq(U)δ1+radp(V)δ2)

Rn(Vj1 Uk1) + 2(radq(U)δ1 + radp(V)δ2), (Lemma B.5)

We now turn to the proof of the main result of this section.

Proof of Proposition 4.4. Recall that radp(V) and radq(U) denote the radii of V and U in the p,n and q,n norms, respectively, assuming 0 U V. The only properties of Rademacher variables we use are those assumed by Lemma B.5, i.e. 1-sub Gaussianity, so our bound holds for Gaussian complexity and other sub Gaussian ensembles.

We begin with the classical construction of Dudley s integral. Fix δ1 radp(V), δ2 radq(U)

j1 := sup{j : 2 jradp(V) δ1}, k1 := sup{k : 2 kradq(U) δ2}

For each j [j1] {0}, let Vj denote a minimal 2 jradp(V) covering of V in p,n. Note that since 0 V, we can take V0 = {0}, so |V0| = 0. Deﬁne πj1(v; V) = v for v Vj1, and recursively set πj 1(v; V) arg minv Vj 1 v πj(v; V) p,n. Set 0(v; V) = π0(v; V), and for j 1, set j(v; V) := πj(v; V) πj 1(v; V) for j [j1]. Repeat the construction to construct Vk, projection πk(U), and remainders k(u; U) analogously, but replacing p,n the its conjugate q,n. The for all u Uj1, v Uk1.

j=0 j(v; V), u =

k=0 k(u; U). (B.11)

Lastly, as a shorthand, set

εj(V) := radp(V)2 j, εk(U) = 2 kradq(U),

noting that

δ1 εj1(V) 2δ1, δ2 εk1(U) 2δ2 (B.12)

Learning under Heterogenous Distribution Shift

By Lemma B.7, it sufﬁces to bound ˆ Rn(Vj1 Uk1). For, (u, v) Vj1 Uk1,

ˆ Rn(Vj1 Uk1) = sup (u,v) Vj1 Uk1

= sup (u,v) Vj1 Uk1

k=0 εi j(v; U) k(u; U)

k=0 sup (u,v) Vj1 Uk1

i=1 εi j(v; U) k(u; U)

k=1 ˆ Rn(Wj,k)

where we deﬁne Wj,k := { j(v; U) k(u; U) : v Vj1, u Uk1}. Taking expectations yields

Rn(Vj1 Uk1)

k=0 Rn(Wj,k) (B.13)

From our construnction, we can bound

log |Wj,k| log(|Vj||Vj 1||Uk||Uk 1|) log(|Vj|2|Uk|2) log(2|Vj|2|Uk|2)

= 2(Mp(V; εj(V)) + Mq(U; εk(U)))

where we use 1 |Vj 1| |Vj| = Mp(V; εj(V)), and similarly for the sets Uk. Moreover, Eq. (B.10)

rad2(Wj,k) = sup{ j(v; U) k(u; U) 2,n : v Vj1, u Uk1}

sup v Vj j(v; U) p,n sup u Uk k(u; U) q,n

Thus, Lemma B.5 yields

Rn(Wj,k) εj(V)εk(U)

(2(Mp(V; εj(V)) + Mq(U; εk(U)))

εj(V)εk(U) q

Mp(V; εj(V) + εj(V)εk(U) q

Mq(U; εk(U))

radq(U)2 kεj(V) q

Mp(V; εj(V) + radp(V)2 jεk(U) q

Mq(U; εk(U))

Hence, Eq. (B.13) and evaluating convergent sums yields

Rn(Vj1 Uk1) 2 n

radq(U)2 kεj(V) q

Mp(V; εj(V) + radp(V)2 jεk(U) q

Mq(U; εk(U))

j=1 εj(V) q

Mp(V; εj(V)) + 4radp(V) n

k=1 εk(U) q

Mq(U; εk(U))

where in the second-to-last line, we use that we have Mp(V; ε0(U)) = log |V0| = 0.

To simplify, we invoke the following claim.

Learning under Heterogenous Distribution Shift

Claim B.8 (Sum-to-Integral Coversion). Let φ be a non-increasing function, and let εj = 2 j R for some R > 0. Then, for ja jb,

j=ja εjφ(εj) 2 Z εja

εjb+1 φ(ε)dε = Z 2εja

εjb φ(ε/2)dε.

Proof. The ﬁrst inequality follows since φ is non-increasing, and the second line uses a change of variables

j=ja εjφ(εj)

εj+1 φ(ε)dε = 2

εj+1 φ(ε)dε 2 Z εja

εjb+1 φ(ε)dε

2εjb+1 φ(ε/2)dε = Z 2εja

εjb φ(ε/2)dε.

In particular, since metric entropies are non-increasing in their scale factors,

j=1 εj(V) q

Mp(V; εj(V))

εj(V) εj(V) εj+1(V)

Mp(V; ε)dε = 2 Z ε1(V)

= 2 Z radp(V)/2

= Z radp(V)

Mp(V; ε/2)dε Z radp(V)

Mp(V; ε/2)dε,

where the last inequality uses Eq. (B.12).

Invoking a similar bound for the analogus U-term, we conclude

Rn(Vj1 Uk1) 4radq(U) n

Mp(V; ε/2)dε + 4radp(V) n

Mq(U; ε/2)dε

Combining with Lemma B.7 and taking the inﬁnum over valid δ1, δ2,

Rn(V U) radq(U) inf δ1 radp(V)

Mp(V; ε/2)dε

| {z } Dn,p(V)

+ radp(V) inf δ2 radq(U)

Mq(U; ε/2)dε

| {z } Dn,p(U)

B.7. Derivation of Lemma 4.1 from Proposition 4.4

We consider the Rademacher complexity, as Proposition 4.4 guarantees the same holds of the Gaussian complexity. Let U = {w 7 (1, 1, . . . , 1) Rn}. Applying Proposition 4.4 with the square-Hölder conjugates p = 2 and q = . As the construction of U ensures V U = V, this yields

Rn(V) = Rn(V U) rad (U)Dn,2(V) + rad2(V)Dn, (U).

Learning under Heterogenous Distribution Shift

Notice that rad (U) = (1, 1, . . . , 1) = 1. Moreover, the covering number of U is 1, so its log-covering numbers are zero. Thus, the integral in Dn, (U) vanishes. This concludes the demonstration that

Rn(V) Dn,2(V) (B.14)

As a consequence,

δn,R(H, c) := inf r : Rn(H[r, w1:n]) r2

inf r : Dn,2(H[r, w1:n]) r2

(Eq. (B.14))

:= δn,D(H, c).

C. Technical Tools

This section enumerates the accompanying technical results applied in the proofs in Appendix B. Whereas Appendix B highlights conceptually novel arguments, this section massages more standard material into the most convenient form for adoption in the prior section.

The results in this section are stated at the following level of generality: Throughout, let H : W R denote a class of functions, and P be a measure over W, with Var and E its corresponding expectation and variance functionals with respect to P. We say H contains zero if the function h0(w) 0 lies in H. Many deﬁnitions results below involve star-hulls and convex-hulls.

Deﬁnition C.1. Let V Rn. We let conv(V) denote its convex hull and star(V) := {t v : t [0, 1], v V}. Similarly, for a function class H : [0, 1] R, we let star(H) := {t h, t [0, 1], h H}, convex hull as conv(H) as the minimal convex set containing H.

C.1. Basic Empirical Process Results

Properties of Rademacher and Gaussian complexities. We recall a couple standard facts about the Rademacher complexity. First is that Rademacher complexity is invariant under the convex hull operation, and also under the star-hull operation if the set contains zero.

Lemma C.1 (Convex Hulls). If V V, Rn(V ) Rn(V). Moreover, Rn(V) = Rn(conv(V)), and if in addition, 0 V, Rn(V) = Rn(star(V)).

Proof. Recall Rn(V) := 1 n Eε supv V Pn i=1 εivi. It is then clear that if V V, then Rn(V ) Rn(V). Then is establishes the ﬁrst point. The second follows because the maximum of a linear function v 7 Pn i=1 εivi occurs on the extreme points of V, which are the same as those of conv(V). Lastly, if V contains 0, conv(V) star(V) V.

Next, we state a classical Lipschitz contraction for Rademacher complexity.

Lemma C.2 (Rademacher Contraction, Lemma 29 in (Rakhlin, 2022) ). Let φ be any L-Lipschitz function, and given V Rn, let φ(V) := {φ(v) : v V}. Then, Rn(φ(V)) LRn(V).

The following lemma is standard (see, e.g. Wainwright (2019, Chapter 14) or, examine the proof of Rakhlin (2022, Lemma 30)).

Lemma C.3. Let H be star-shaped. Then H(cr) c H(r) for all c 1. Hence, for c 1, it holds that δn,R(H, c B) cδn,R(H, B), and similarly δn,G (H, cσ) cδn,G (H, σ).

The following lemma shows a similar property for the Dudley functional, this time without the constraint that H is starshaped.

Lemma C.4. For any class H and c 1, it holds that Dn,2(H[r, w1:n]) c 1Dn,2(H[cr, w1:n]). Thus, for any c 1 and B > 0, δn,D(H, c B) cδn,D(H, B).

Learning under Heterogenous Distribution Shift

Proof. We then have, recalling Deﬁnition 4.1 and using rad2(H[r, w1:n])) = r that

Dn,2(H[r, w1:n]) := inf δ r

M2(V; ε/2)dε

M2(V; ε/2c)dε

M2(V; ε/2c)dε

M2(V; ε/2)dε

c Dn,2(H[cr, w1:n]),

where in (i) we use anti-monotonicity of covering numbers M2(V; ε/2c) M2(V; ε/2) for c 1. Recall from Deﬁnition 4.1 the deﬁnition

δn,D(H, c B) := inf r : sup w1:n Dn,2(H[r, w1:n]) r2

= inf cr : sup w1:n Dn,2(H[cr, w1:n]) cr2

= c inf r : c 1 sup w1:n Dn,2(H[cr, w1:n]) r2

c inf r : sup w1:n Dn,2(H[r, w1:n]) r2

= cδn,D(H, c B),

where the inequality above follows from the ﬁrst part of the lemma.

Lemma C.5. Let H : W R be an arbitrary class of functions, and let h0 : W R be arbitrary. Then, the class H + h0 := {h + h0 : h H} satisﬁes, for all n N, c > 0, q 1, and w1:n Wn the equalities,

δn,D(H + h0, c) = δn,D(H, c), Dn,q((H + h0)[w1:n]) = Dn,q(H[w1:n]).

In particular, recalling the notation of Section 4.2, for any c > 0, δn,D(Fcnt, c) = δn,D(Fβ, c) and supw1:n Dn, ((G g )[w1:n])2 = supw1:n Dn, (G[w1:n])2.

Proof. The proof is immediate from the fact that translation by a single element leaves the covering numbers, and hence metric entropies, unchanged.

C.2. Deviation Inequalities for Empirical Processes

The following is a standard maximal inequality for empirical processes. Lemma C.6 (Empirical Process Inequality, Theorem 2.3 in (Bousquet, 2002)). Let H : W [ B, B] and let P be a measure on W such that suph H |Ew P [h(w)]| = 0. Let Z := 1

n suph H Pn i=1 h(wi), and let r2 := suph H Ew P [h(w)2]. Then, for any choice of parameter ϵ > 0.

Z (1 + ϵ)E[Z] r

By examining the proof of Theorem 2.3 in (Bousquet, 2002) from Theorem 2.1 in that same work, one can check that the concusion of Lemma C.6 holds verbtaim in the folllowing more general setup: the class of functions H : W [n] [ B, B] are index-dependent, the process is Z := 1 n sup h H Pn i=1 h(wi, i), and where we deﬁne r2 := 1 n E[Pn i=1 h(wi, i)2] as the average variance. A special case of this generalization applies to Rademacher processes Z := 1

n suph H Pn i=1 εih(wi), where εi plays the roll of the random variable wi, and where h(εi, i) = εih(wi).

Learning under Heterogenous Distribution Shift

Lemma C.7. Let H : W [ B, B]. Fix any w1:n W, and let Z := 1 n suph H Pn i=1 εih(wi), and let r2 := suph H h(w1:n) 2 2,n. Then, for any choice of parameter ϵ > 0.

Z (1 + ϵ)E[Z] r

We shall also need a related lemma that bounds deviations in terms of Rademacher complexity.

Lemma C.8 (Uniform Convergence, Theorem 2.1 in (Bartlett et al., 2005)). Let H : W [ B, B] be a family of uniformly bounded functions with |h| B and suph H Var[h2] r2 and let P be a measure over W. Then, with probability at least 1 δ, any

n=1 Ew P [h(w)] h(wi)

6Ew1:n P [Rn(H[w1:n])] + r

n + 11B log(1/δ)

In particular, if H = H(r) for some r, then the above holds for ν2 = r2.

C.3. Fixed-Design Guarantees

This section concerns various measures of complexity for a function class when its arguments ( design points ) w1:n are treated as deterministic. We begin with the following general lemma, which abstracts away the function class H[w1:n] evaluated on the design points with a set V Rn. This lemma measure offset complexities , were a mean zero process involving v V is offset by norms v 2. This lemma implies important consequences of this lemma for Gaussian and Rademacher compelxities.

Lemma C.9 (Fixed-Design Master Lemma). Let V Rn be a containing 0 Rn, with V[r] := {v V : v 2,n r}, and let Φ : U R be an arbitrary function classes (possibly even of cardinality one). Let u1 . . . , un be a random variables taking values in U, and deﬁne the processes

Z(r) := sup v V[r] sup φ Φ

i=1 viφ(ui) (localized maximal process)

Y(τ) := sup v V sup φ Φ

i=1 viφ(ui) 1

. (offset maximal process)

Lastly, deﬁne a modiﬁcation of Y which replaces the offset by v 2 2,n with the offset by r:

Y(r; τ) := sup v V[r] sup φ Φ

i=1 viφ(ui)

2 = 2τZ(r) r2

Then, the following are true.

(a) With probability one,

Y(τ) = sup r>0 Y(r; τ).

(b) With probability one,

Y(τ) inf r2 : Y(r; τ) r2

(c) Suppose that, for any choice of r > 0, Z(r) satisﬁes the following concentration inequality with parameters c1 1 and c2, c3 > 0:

Z(r) c1E[Z(τ)] + c2r

n + c3 log(1/δ)

Learning under Heterogenous Distribution Shift

Then, for any τ 1 and σ > 0, the following holds probability 1 δ, the following holds

Y(στ) c2 1τ 2 δn(σ)2 + (τ 2σ2c2 2 + (c3/c2)2) log(1/δ)

δn(σ) := inf r : E[Z(r)] r2

Thus, by itegrating,

EY(στ) c2 1τ 2 δn(σ)2 + (τ 2σ2(c2 2 + 1) + c3)

Proof. We prove the lemma in parts. First, however, we observe that we may assume without loss of generality that V is star-shaped. Note that star(V[r]) = star(V)[r]. Then, by the same logic as in the proof of Lemma C.1, the inclusion 0 V[r] implies conv(V[r]) star(V[r]) = star(V)[r] V[r], which establishes

Z(r) := sup v V[r] sup φ Φ

i=1 viφ(ui) = sup v conv(V[r]) sup φ Φ

i=1 viφ(ui)

sup v star(V)[r] sup φ Φ

i=1 viφ(ui) sup v V[r] sup φ Φ

i=1 viφ(ui) = Z(r),

Z(r) = sup v star(V)[r] sup φ Φ

i=1 viφ(ui).

Similarly, one can show that

Y(τ) := sup v star(V) sup φ Φ

i=1 viφ(ui) 1

, Y(r; τ) := sup v star(V)[r] sup φ Φ

i=1 viφ(ui)

Hence, we can apply the entire lemma to star(V), and then convert back to V by the above reduction.

Part (a). As r2 v 2 2,n for all v V[r], it is immediate that Y(τ) supr>0 Y(r; τ). We prove the other direction. Fix an arbitrary ϵ > 0 and suppose that that v V and φ Φ satisfy

i=1 viφ(ui) 1

2 v 2 2,n Y(τ) ϵ.

Letting r := v 2,n, it holds that

i=1 viφ(ui) r2

2 Y(r; τ) sup r>0 Y(r; τ).

As ϵ was arbitrary, Y(τ) Y(r; τ).

Part (b). Suppose that r satisﬁes

Learning under Heterogenous Distribution Shift

Fix v V. Since V is star-shaped, either v V[r], or αv V[r] for α = r/ v 2,n < 1. In the ﬁrst case,

i=1 viφ(ui) 1

2 v 2 2,n sup φ Φ

i=1 viφ(ui)

i=1 viφ(ui) r2/2 + r2/2

Y(r; τ) | {z }

In the second case, recalling α = r/ v 2,n < 1,

i=1 viφ(ui) 1

2 v 2 2,n = 1

i=1 αviφ(ui) 1

i=1 αviφ(ui) 1

2 2α 1 α 2 r2

2 where the second inequality uses maxx(2x x2) 1 This concludes the proof of part (b).

Part (c). Applying the AM-GM inequality twice to Eq. (C.1), the following holds with probability 1 δ,

Z(r) c1E[Z(r)] + r2

4τσ + (2τσc2 2 + c3) log(1/δ)

= c1E[Z(r)] + r2

4τσ + (2τ 2σ2c2 2 + τσc3) log(1/δ)

c1E[Z(r)] + r2

4τσ + (3τ 2σ2c2 2 + c2 3/2c2 2) log(1/δ) n τσ .

Consequently, setting α := 8c1τ 1,

Y(r; στ) = 2στZ(r) r2

2 2c1στE[Z(r)] r2

4 + (3τ 2σ2c2 2 + c2 3/2c2 2) log(1/δ) n

+ (3τ 2σ2c2 2 + c2 3/2c2 2) log(1/δ) n r2

To conclude it sufﬁces to show that the above expression is non-positive for the choice

r2 := max (24τ 2σ2c2 2 + 4(c3/c2)2) log(1/δ)

n , α2( δn(σ)2 + ϵ) .

Note that this choice makes the second term in the previous display vanishes, so

Y(r; στ) σα

E[Z(α δn(σ)2)] α δn(σ)2

αE[Z( δn(V, Φ, σ)2)] α δn(σ)2

E[Z( δn(σ)2)] δn(σ)2

where (i) uses that V is star-shaped, so E[Z(µr)] µE[Z(r)] for any µ 1. The proof now follows by subsituting in α = 8c1τ.

Learning under Heterogenous Distribution Shift

Consequences of the Master Lemma. Our ﬁrst consequence is for Gaussian complexities.

Lemma C.10 (Offset Gaussian Complexity Bound). Let H : W R be a function class containing the zero function. Fix any δ (0, 1) and σ > 0 and τ 1. Then, there exists a constant c > 0 such that

sup w1:n Pξ1:n

i=1 ξih(wi) 1

2n h(w1:n) 2 2,n > cτ 2 σ2 log(1/δ)

n + δn,G (H, σ)2 #

where above ξ1:n are i.i.d. standard Normal.

Proof. Recall the set H[r, w1:n] = {h H : h 2 2,n r2}. Then, the random variable

n sup h H[r;w1:n]

i=1 ξih(wi)

satisﬁes, by Gaussian-Lipschitz concentration (e.g. Boucheron et al. (2013, Chapter 2.3) or Wainwright (2019, Chapter 3)),

P[Z(r0) EξZ(r0) + r0 p

2 log(1/δ)/n] δ. (C.2)

The bound now follows from Lemma C.9, where ui = xi and Φ is a singleton consisting of the identity function.

We establish a similar guarantee for Rademacher variables.

Lemma C.11 (Offset Rademacher Complexity Bound). Let H : W [ B, B] be a function class containing zero, and let ε1:n be i.i.d. Rademacher random variables. Then, for any δ (0, 1/2), σ > 0 and τ 1, the following holds with probability 1 δ

i=1 εih(wi) 1

i=1 h(wi)2 (τ 2σ2 + B2) log(1/δ)

n + τ 2δn,R(H, σ)2

In particular, by integrating,

i=1 εih(wi) 1

i=1 h(wi)2 #

n + τ 2δn,R(H, σ)2

Proof. Recall the set H[r, w1:n] = {h H : h 2 2,n r2} and let

n sup h H[r;w1:n]

i=1 ξih(wi)

By Lemma C.7, for some universal constant c > 0,

P Z(r0) > c E[ Z(r0)] + r0 p

log(1/δ)/n + B log(1/δ)

The bound now follows from Lemma C.9.

C.4. Random-Design Complexities

The following is an analogue of Lemma C.9 for random design. It s proof is nearly identical, with the key difference between that localization occurs based on the empirical L2-norm E[h(w)]2 and not h(w1:n) 2 2,n. 4

4This remark is under the identiﬁcation V := H[w1:n]. We further not that Lemma C.12 implies Lemma C.9 by choosing the measure P to be a dirac-delta. However, to avoid confusion of the subtle differences in localization, we state these two lemmas separately.

Learning under Heterogenous Distribution Shift

Lemma C.12 (Random-Design Master Lemma). Let P be a measure over random variables (u, w), and let Φ : U R

and H : W R be function classes, and recall H(r) := {h H : Ew P h(w)2 r2}. For (u1, w1), . . . , (un, wn) i.i.d P, deﬁne the processes

Z(r) := sup h H(r) sup φ Φ

i=1 h(wi)φ(ui)

Y(τ) := sup h H sup φ Φ

i=1 h(wi)φ(ui) E[h(w)2]

Y(r; τ) := 2τZ(r) r2

Then, the conclusions of the ﬁxed-design master lemma Lemma C.9 hold verbatim with the above deﬁnitions.

Next, we establish two lemmas which give control on the complexities of relevant random-design (i.e. w1:n P)quantities involving quadratic terms such as E[h(w)2]. Lemma C.13 (Quadratic Loss Symmetrization). Let H : W [ B, B] be a function class containing zero, let P be a distribution over W, and let η > 0 be arbitrary. Consider the (very similar) terms

T1(η) := Ew1:n i.i.d P

h(w1:n) 2 2,n (1 + η)Ew P [h(w )2]

T2(η) := Ew1:n i.i.d P

Ew P [h(w )2] (1 + η) h(w1:n) 2 2,n

T3(η) := Ew1:n,w 1:n i.i.d P

h(w1:n) 2 2,n (1 + η) h(w 1:n) 2 2,n ,

as well as the term

T4(η) := Ew1:n

sup h H Eε1:n

i=1 εih(wi)

2E[ h(w1:n) 2 2,n]

max{T1(η), T2(η), T3(η), T4(η)} η(1 + η 1) 2

n + δn,R(H, B)2 .

Proof. By Liang et al. (2015, Lemma 14) (modifying the constant of 4B to 2B to account for the fact that we consider the uncentered h H, and not centered h h H, and reparameterizing η η/2), it holds that

η εih(wi) h(wi)2 #

The same argument can be modiﬁed to show that T1(η) satisﬁes the same upper bound, as T1(η) satisﬁes the same intermediate inequality obtained via Jensen s inequality (the third line of the proof in Liang et al. (2015, Lemma 14)), and the same argument extends to T3(η) because this expresssion is precisely the consequence of applying Jensen s inequality. Thus,

max {T1(η), T2(η), T3(η)} η

η εih(wi) h(wi)2 #

i=1 4B(2 + η 1)εih(wi) 1

(i) η(2 + η 1)2

n + δn,R(H, B)2 ,

η(1 + η 1)2

n + δn,R(H, B)2 ,

Learning under Heterogenous Distribution Shift

where the inequality (i) is by Lemma C.11 with σ = B, and τ = 2(1 + η 1). The bound on T4(η) follows from a similar application of Lemma C.11.

Lemma C.14 (Quadratic Lower Bound). Let H : W [ B, B] be a function class containing zero, and let P be a measure over W. Then, there is a universal constant c > 0 such that for any δ > 0, it holds that

P sup h H h 2 L2(P ) 2 h(w1:n) 2 2,n crquad(H, δ)2 1 δ,

rquad(H, δ)2 := δn,R(H, B)2 + B2 log(1/δ)

Proof. In view of Lemma C.1, the fact that H contains zero means we may assume without loss of generality that H is star-shaped (indeed, apply the lemma to star(H), and note that δn,R(H, B) = δn,R( (H), B)). Introduce the class of function Hr := {E[h2] h2 : h H(r)} (here, we use subscript r to distinguish from the standard localization notation). Then, Hr : W [ B2, B2], and E[ h(w)2] B2r2 for h Hr. Note that

i=1 h(wi) = sup h H(r) h 2 L2(P ) h(w1:n) 2 2,n

Hence, by Lemma C.6 and AM-GM, the following holds with probability 1 δ and for a universal constant c > 0 and any τ 0:

sup h H(r) h 2 L2(P ) h(w1:n) 2 2,n

sup h Hr h 2 L2(P ) h(w1:n) 2 2,n

+ τr2 + (τ 1 + 1)B2 log(1/δ)

sup h H(r) (1 τ) h 2 L2(P ) h(w1:n) 2 2,n

+ 2τr2 + (τ 1 + 1)B2 log(1/δ)

where the second inequality uses h 2 L2(P ) r2. Let τ 1/2 and let η be such that (1 τ) = 1 1+η. By Lemma C.13,

sup h H(r) (1 τ) h 2 L2(P ) h(w1:n) 2 2,n

= 1 1 + η E

sup h H(r) h 2 L2(P ) (1 + η) h(w1:n) 2 2,n

η(1 + η 1) 2

n + δn,R(H, B)

= (1 η 1) B2

n + δn,R(H, B)2

n + δn,R(H, B)2 ,

where in the last line, we use that η = 1 (1 τ) 1 τ for τ 1/2. In sum, there is a universal constant c such that, for all τ 1/2, the following holds with probability 1 δ:

sup h H(r) h 2 L2(P ) h(w1:n) 2 2,n c 1 τ

n + δn,R(H, B) 2 + τr2 + (τ 1 + 1)B2 log(1/δ)

By making τ a sufﬁciently small universal constant, we can ensure that there is a universal constants c , c such that, whenever

r2 = c δn,R(H, B)2 + B2 log(1/δ)

rquad(H, δ).

Learning under Heterogenous Distribution Shift

we have that with probability 1 δ,

sup h H(r) h 2 L2(P ) h(w1:n) 2 2,n r2

We claim that in fact, with probability 1 δ, it holds that the above holds for all h H, that is

sup h H h 2 L2(P ) 2 h(w1:n) 2 2,n r2

Indeed, it sufﬁces to check Eq. (C.3) implies the inequality for h / H(r). Since H is star-shaped, there exists some α such that αh H(r) and in fact αh 2 L2(P ) = r2. Then, on Eq. (C.3)

αh 2 L2(P ) αh(w1:n) 2 2,n r2

2 = αh 2 L2(P ) 2 so by rearranging

h 2 L2(P ) 2 h(w1:n) 2 2,n.

D. Rates for ﬁnite function classes.

In this section, we establish sharper bounds for ﬁnite function classes H1 and H2. Because errors for ﬁnite function classes already attain the O (1/n) parametric rate, we require an additional assumption to achieve improvement. Speciﬁcally, we need a hypercontractivity condition which states that higher moments of E[h(w)q] for h H are controller by lower order moments E[h(ws)] for s < q. This is the ﬁrst notion of hypercontractivity, deﬁned below. Deﬁnition D.1 (Hypercontractivity). We say a class H {W R} satisﬁes (κ, P, s, q)-hypercontractivity if, for all h H, E[|h(w)|q]1/q κE[h(w)s]1/s.

We achieve even faster rates under a stronger variant of hypercontractivity, deﬁned below. Deﬁnition D.2 (sub Gaussian Hypercontractivity). We say a class H {W R} satisﬁes (κ, P)-sub Gaussian hypercontractivity if, for all h H, h(w) E[h(w)] is κ2E[h(w)2] sub Gaussian. 5

Under the various hypercontractivity assumptions, we attain the following bound, which is the formal statement of Theorem 2. Theorem 4. Let F : X [ 1, 1] and G : Y [ 1, 1] be ﬁnite function classes, and suppose 1 d1 d2 satisfy log |F| d1 and log |G| d2. Deﬁne the class6 F := {f βf f : f F}, and casing on the hypercontractivity assumptions with parameter κ, deﬁne

φn(d1, d2) :=

q1 general 1

q2 = 2, F satisﬁes (κ, Ptrain, 2, q1) hypercontractivity d2

4 F satisﬁes (κ, Ptrain, 2, 4) hypercontractivity

n (d1 + log n) F satisﬁes (κ, Ptrain)-sub Gaussian hypercontractivity

Then, as long as σ2 1, for any δ (e 10d2, e 1), the following hold simultaneously with probability at least 1 δ:

Rtrain[ˆgn; ˆfn] Rtrain( ˆfn, ˆgn) d2

Rtrain[ ˆfn] κ2φn(d1, d2) d2

n + log(1/δ)

Rtest( ˆfn, ˆgn) νx,y d1 + log(1/δ)

n + (νy + νx,y κ2φn(d1, d2)) d2

n 5This is equivalent to O (κ)-hypercontractivity in the L2 ψ2 norms, where ψ2 is the sub Gaussian (Orlicz) norm (see e.g. Boucheron et al. (2013, Exercise 2.18).) 6note that Fcnt := star( F)

Learning under Heterogenous Distribution Shift

Notice that, as promised by Theorem 2 φn(d1, d2) tends to 0 as n and as the ratio of the class complexities d1/d2 tends to 0, such that with high probability. Moreover, under sub Gaussian hypercontractivity, limn φn(d1, d2) for any d1, d2 ﬁxed.

Remark D.1 (Extension to Parametric Classes). Up to logarithmic factors in n, the above bound can be extended easily extended to inﬁnite-cardinality parametric function classes (that is, function classes whose metric entropies scale as logarithmic in the scale ϵ). The guarantee of Theorem 4 also holds under the more general assumption that F and G are contained in the respective convex hulls of function classes F and G, where log | F| d1 and log | G| d2. This includes, for example, many natural linear classes.

D.1. Proof of Theorem 4

We start with localization for ﬁnite classes.

Lemma D.1 (Localization for Finite Classes). Let H be a ﬁnite function class uniformly bounded by 1, and let d = log |H|. Then, for any probability measure P over W,

δ2 n,R(H, B) B2d

n , δn,G (H, σ) dσ2

Proof of Lemma D.1. From Lemma C.8 and ﬁniteness of H, we have for any w1:n Wn that

Rn(H[r; w1:n]) rad2(H[r; w1:n] p

It then follows that δ2 n,R(H, B) = sup{r2 : Rn(H[r; w1:n]) r2 2B } B2d

n . The bound on δn,G similarly yields

δ2 n,R(H) = sup{r2 : Rn(H[r; w1:n]) r2

We continue with a generic bound on the following cross-critical radius. The next proposition is proved in Appendix D.2 below.

Proposition D.2. For i {1, 2}, let Hi {W [ 1, 1]} be ﬁnite function classes with di = log |Hi|. Assume for simplicity that d1 d2, and let γ2 d2/n. Finally, let P be a distribution of W. Deﬁne the shorthand

δn(γ) := inf r : Ew1:n[Rn((H1(r) H2(γ))[w1:n])] r2

Then, it holds that

Let 1/q1 + 1/q2 = 1/2 be square Hölder conjugates. If H1 satisﬁes (κ, P, 2, q1) hypercontractivity,

δn(γ)2 κ2γ4/q2 d2

n + γ2/q2 r

In particular, if γ2 d2/n,

q1 general 1

4 q1 = q2 = 4.

If H1 satsﬁes (κ, P)-sub Gaussian hypercontractivity,

δn(γ)2 κ2γ2d2

n (d1 + log n).

In particular, if γ2 d2/n, the above scales as κ2(d2/n)2 (d1 + log n).

Learning under Heterogenous Distribution Shift

Next, we recall standard localization bounds for ﬁnite function classes.

Proof of Theorem 4. As d2 d1, log |F + G| log(|F| + |G|) d2. Taking B = 1, σ2 1, and δ = e d2, Lemma D.1 and Proposition 4.2 allow us to bound

Rtrain[ˆgn; ˆfn] Rtrain( ˆfn, ˆgn) d2

n w.p. 1 e d2

By the same token, applying Proposition 4.3 and Lemma D.1, and making similar simpliﬁcations (B = 1, σ2 1), the following holds with probabilty 1 δ

Rtrain[ ˆfn] δn,cross(γ)2 + d1 + log(1/δ)

Bounding δn,cross(γ)2 by Proposition D.2, on the same event we have

Rtrain[ ˆfn] κ2φn(d1, d2)d2

n + d1 + log(1/δ)

where we recall φn(d1, d2) deﬁned in the theorem statement. When both events hold, Lemma 3.1 entails

Rtest( ˆfn, ˆgn) νx,y

κ2φn(d1, d2)d2

n + log(1/δ)

D.2. Proof of Proposition D.2

Deﬁne the radius of a class H : W R be a class, and let P be a measure over W. Deﬁne

radq(H) := sup h H E[|h(w)|q]1/q, radq(H[w1:n]) := sup h H h[w1:n] q,n

Part 1. Bounds on the empircal norms. The next lemma bounds the magnitude of the empirical q-norm radius.

Lemma D.3. Let H {W [ 1, 1]} be a ﬁnite class, and take δ (0, 1). With probability at least 1 δ,

radq(H[w1:n]) 2radq(H) + log |H|/δ

In particular, if rad2(H) r and H satisfes (κ, P, 2, q) hypercontractivity, then with probability 1 δ

radq(H[w1:n]) 2κr + log 1/δ

Proof. We observe that radq(H[w1:n])q radq(H)q + E suph H 1 n Pn i=1 |h(wi)|q E[|h(wi)q|]. As suph |h| 1, suph H E[|h(wi)|2q] suph H E[|h(wi)|q] = radq(H)q. By Bernstein s inequality and a union bound, with probability at least 1 δ,

i=1 |h(wi)|q E[|h(wi)|q]

2radq(H)q log(|H|/δ)

n + log |H|/δ

radq(H)q + log |H|/δ

n . (AM-GM, and 1

Hence, via the previous two displays, with probability 1 δ it holds that

radq(H[w1:n])q 2radq(H)q + log |H|/δ

Taking the q-th root and using of (x + y)1/q y1/q + x1/q for q 1 and x, y 0 concludes the proof.

Learning under Heterogenous Distribution Shift

When H satisﬁes (κ, P)-sub Gaussian hypercontractivity, we can improve this bound.

Lemma D.4. Suppose H {W [ 1, 1]} be a ﬁnite class which satisﬁes (κ, P)-sub Gaussian hypercontractivity. Then,

P[rad (H[w1:n]) κ p

log |H| + log(n/δ) rad2(H)] 1 δ.

Proof. By Gaussian concentration and (κ, P)-su Bgaussian hypercontractivity, for any h H, i [n] and δ > 0, we have

P |h(w)i| κrad2(H) log(1/δ) P h |h(w)i| κE[h(wi)2]1/2 log(1/δ) i δ

Union bounding over h H and i [n] concluds the proof.

Part 2. Controlling the Rademacher Complexities Next, we turn to bounding the Rademacher complexity in terms of empirical radii.

Lemma D.5. Let 1/q1 + 1/q2 1/2 be squared Hölder conjugates. Then,

Rn,P(H1(r) H2(γ)) p

2(d1 + d2)E [radq1(H1,r[w1:n]) radq2(H2,γ[w1:n])]

Proof. This is a direct consequence of Lemma B.5, and the fact that log |H1(r) H2(γ)| log |H1| + log |H2| = d1 + d2.

Part 3. Conclusion the proof. We can now conclude.

Proposition D.2. Let us start with the case that H1 satisﬁes (κ, P, 2, q1) hypercontractivity. Uing boundedness of |h| 1 and q2 1 we get

radq2(H2,γ)q2 = sup h H2,γ E[|h(w)|q2] sup h H2,γ E[|h(w)|2] γ2,

so radq2(H2,γ) γ2/q2. Hence, as d2/n γ2 by assumption,

radq2(H2,γ) + (d2

1 q2 γ2/q2 + (d2/n)1/q2 2γ2/q2. (D.1)

Next, by Lemma D.3, hypercontractivity of H1, and the above bound implies that, for all δ > 0, both

radq1(H1,r[w1:n]) 2κ1r + d1 + log 1/δ

radq2(H1,r[w1:n]) 2γ2/q2 + d2 + log 1/δ

hold with probability at least 1 δ. Taking the product, integrating the tail over δ, and invoking Eq. (D.1) implies

E [radq1(H1,r[w1:n]) radq2(H2,γ[w1:n])] γ2/q2

Note the resulting constant does not depend on q1 or q2, as 0 1/q1, 1/q2 1 are both bounded. Consequently, Lemma D.5 and d1 d2 entails

Rn,P( H1(r) H2(γ))

Learning under Heterogenous Distribution Shift

inf r2 : Rn,P( H1(r) H2(γ)) r2

n + γ2/q2 r

Next, let s consider the sub Gaussian hypercontractive case. Here, we replace Lemma D.3 with Lemma D.4. A similar computation yields

E [rad (H1,r[w1:n]) rad2(H2,γ[w1:n])] κrγ p

d1 + log n.

Hence, Lemma D.5 (with d1 d2) gives

Rn,P( H1(r) H2(γ))

d1 + log n.

We may then conclude

inf r2 : Rn,P( H1(r) H2(γ)) r2

n κ(d1 + log n).

E. Formal Guarantees for Nonparametric Classes (formal statement of Theorem 1)

In this section, give a formal statement of Theorem 1. Recall the deﬁnition of the normalized q-norms: For v = (v1, . . . , vn) Rn and q [1, ), we have

v q,n = ( 1

i=1 |vi|q)1/q, v ,n = v = max i [n] |vi|.

We now deﬁne the radii and metric entropies in these norms, with a deﬁnition that expands upon Deﬁnition 3.1.

Deﬁnition E.1 (q-norms, radii, and metric entropies). Given a subset V Rn and q [1, ], deﬁne the the radius radq(V) = maxv V v q,n, deﬁne the covering number N (V, q,n, ε) as the cardinality of minimal-cardinality ε-cover of V in the norm q,n, and deﬁne the metric entropy Mq(V, ε) = log N (V, q,n, ε) as the logarithmic of the covering number. For a function class H : W R, we deﬁne its q-norm metric entropy as

Mq(H, ε) := sup n N sup w1:n Wn Mq(H[w1:n], ε).

We make the following mild compactness assumption, which holds whenever Theorems 1 and 6 is non-vacuous.

Assumption E.1. For all ε > 0, M2(F, ε) < .

We also introduce a strictly optional second assumption, but codiﬁes a way in which F is simpler than G, and enables further simpliﬁcations when it holds.

Assumption E.2. For all ε > 0, M (F, ε) M (G, ε).

Lastly, we deﬁne the class

βF := {βf : f F}. (E.1)

E.1. An Intermediate Dudley Bound

Recall the Dudley functional from Deﬁnition 4.1 We begin by deﬁning an upper bound on the Dudley critical radius δn,D of a function class H

Learning under Heterogenous Distribution Shift

Deﬁnition E.2 (Upper Bound on Dudley Critical Radius). We deﬁne

Dn,2(H, r) := inf δ r

M2(H; ε/4)dε ,

δn,D(H, c) := inf{r : Dn,2(H, r) r2

Lastly, we deﬁne

Dn,q(H) := inf δ B

Mq(H; ε/2)dε

, B := sup w |h(w)|.

Recall ν1, ν2 from Eqs. (3.3) and (3.4). We have the following theorem. Theorem 5. Recall the class βF := {βf : f F}, and suppose Assm. E.1 holds. Then, for any δ (0, 1), if δn,D(F, σB)2 + δn,D(G, σB)2 + σ2 B log(1/δ)

n c1γ. Then it holds that with probability at least 1 δ,

Rtest( ˆfn, ˆgn) (ν1 + ν2) δn,D(F, σB)2 + ν1( Dn, (G)2 + min{ Dn, (F)2, Dn, (βF)2})

+ ν2 δn,D(G, σB)2 + (ν1 + ν2)σ2 B log(1/δ) n .

In particular, if Assm. E.2 also holds, then

Rtest( ˆfn, ˆgn) (ν1 + ν2) δn,D(F, σB)2 + ν1( Dn, (G)2 + ν2 δn,D(G, σB)2 + (ν1 + ν2)σ2 B log(1/δ) n .

And in addition, when we upper bound ν1 νx,y and ν2 νy,

Rtest( ˆfn, ˆgn) νx,y( δn,D(F, σB)2 + Dn, (G)2) + νy δn,D(G, σB)2 + νx,yσ2 B log(1/δ)

E.1.1. PROOF OF THEOREM 5

We sketch the proof of Theorem 5, deferring supporting proofs to Appendix E.1.2. From Theorem 3, it sufﬁces to establish the following inequalities for c 1:

δn,D(Hsum, c) δn,D(F, c) + δn,D(G, c)

δn,D(Fcnt, c) δn,D(F, c)

sup w1:n Dn, (Gcnt[w1:n]) min{ Dn, (F), Dn, (βF)} + Dn, (G). (E.2)

We ﬁrst upper bound all relevant non-barred Dudley integrals in terms of barred integrals. Lemma E.1. For any class H,

δn,D(H, c) δn,D(H, c), and Dn, (H) sup w1:n Dn, (H[w1:n]).

Next, we give a technical lemma which allows us to relate the Dudley integral/critical radius of class H in terms of classes which upper bound its metric entropy. Lemma E.2. Fix a 1.

(a) Suppose that H, H1, H2 satisfy the ℓ2-metric entropy inequality, for all ε > 0,

M2(H, ε) M2(H1, ε/a) + M2(H2, ε/a).

Then, it holds that δn,D(H, c) a maxi [2] δn,D(Hi, 2c/a). In particular, if a 2,

δn,D(H, c) a max i [2] δn,D(Hi, c).

Learning under Heterogenous Distribution Shift

(b) Suppose instead H, H1, H2 satisfy the ℓ -metric entropy inequality, for all ε > 0,

M (H, ε) M (H1, ε/a) + M (H2, ε/a).

Then, Dn, (H) a Pn i=1 Dn, (Hi)

To apply these, we require control over the metric entropy of βF.

Lemma E.3. Let βF := {βf : f F}. Then, as long as M2(F, ε ) is ﬁnite for all ε ,

M2(βF, ε) inf ε <ε M2(F, ε ), and M (βF, ε) inf ε <ε M (F, ε )

To prove Lemma E.3, we require the following qualitative statement, which can be derived from a Glivenko-Cantelli Theorem (e.g. van der Vaart and Wellner (1996, Theorem 2.8.1), with the substitution F (H H)2).

Proposition E.4 (Uniform Covergence of L2 measures). Let P be any measure over W, let H be any class for which M2(H, ε) is ﬁnite for all ε. Then, for all t > 0

lim n Pw1:n P

sup h,h H h(w1:n) h (w1:n) 2,n Ew P [(h(w) h (w))2] t

Proof of Lemma E.3. Fix w1:n = (xi, yi)1:n Wn and a slack parameter t > 0. Introduce the measure P to be the mixture distribution P = 1

n Pn i=1 Ptrain[x = | y = yi]. Recall H2 = {βf : f F},

βf[w1:n] βf [w1:n] 2 2,n =

i=1 (Etrain[f(x) f (x) | y = yi])2

i=1 Etrain[(f(x) f (x))2 | y = yi]

= Ex P Etrain[(f(x) f (x))2]

Proposition E.4 implies that there must exists some number m n and x1:m X m such that, for all f, f F,

Ex P Etrain[(f(x) f (x))2] t + f[ x1:m] f [ x1:m] 2 2,m

M2(H2[w1:n], ε + t) M2(F[ x1:m], ε) M2, ε).

As t, w1:n are arbitrary, the ﬁrst bound follows. To prove the second, we apply a similar argument, bounding

βf[w1:n] βf [w1:n] 2 max i [n] Etrain[(f(x) f (x))2 | y = yi],

For each yi, there exists some mi and a sequence x(i) 1:mi such that for all f, f F,

Etrain[(f(x) f (x))2 | y = yi] t + f[ x1:m] f [ x1:m] 2 2,m t max j [mi] | x(i) j ] f [ x(i) j ]|2

Hence, introduce the ﬁnite set X := Sn i=1 Smi j=1{x(i) j }, we have that for all f, f F,

max i Etrain[(f(x) f (x))2 | y = yi] t + max x X |f( x) f ( x)|2.

The bound follows.

Eq. (E.2) is now an immediate consequence of Lemma E.1 and the following lemma.

Learning under Heterogenous Distribution Shift

Lemma E.5. The following bounds hold:

(a) δn,D(Hsum, c) 2 max{ δn,D(F, c), δn,D(G, c)}

(b) δn,D(Fcnt, c) 4 δn,D(F, c).

(c) Dn, (Gcnt) min{ Dn, (F), Dn, (βF)} + Dn, (G).

Proof of Lemma E.5. For all points, we apply Lemma E.2. For (a), we can verify by the triangle inequality that for any H1, H2,

M2(H1 + H2, ε) M2(H1, ε/2) + M2(H2, ε/2), (E.3)

which yields part (a) when specializing to H1 = F, H2 = G, and applying Lemma E.2. For part (b), we observe that Fcnt (F f ) βF. Thus, Fact E.1, followed by Eq. (E.3) and ﬁnally Lemma E.3 imply

M2(Fcnt, ε) M2(Fcnt βF, ε/2) (Fact E.1)

M2(F f , ε/4) + M2(βF, ε/4) (Eq. (E.3))

= M2(F, ε/4) + M2(βF, ε/4)

M2(Fcnt, ε/4) + inf b>1 M2(F, ε/4b) (Lemma E.3)

inf b>1 M2(Fcnt, ε/4b) + M2(F, ε/4b).

The result now follows from Lemma E.2 with a 4b, and taking b 1.

The proof of part (c) is similar:

M2(Fcnt, ε) = M2(G g + βF, ε) (Fact E.1)

M2(G g , ε/2) + M2(βF, ε/2) (Eq. (E.3))

= M2(G, ε/2) + M2(βF, ε/2)

M2(Fcnt, ε/2) + inf b>1 M2(F, ε/2b) (Lemma E.3)

inf b>1 M2(Fcnt, ε/4b) + M2(F, ε/2b),

and follows from similar steps as part (c).

E.1.2. PROOFS OF SUPPORTING LEMMAS FOR THEOREM 5

Fact E.1. [Exercise 4.2.10 in (Vershynin, 2018)] For any sets V V, ε-covering number of V in any norm is at most the ε/2 covering number of V.

Proof of Lemma E.1. The proof of Dn, (H) supw1:n Dn, (H[w1:n]). is straightforward. To check δn,D(H, c) δn,D(H, c), we invoke Fact E.1:

sup w1:n Mq(H[r, w1:n]; ε/2) sup w1:n Mq(H[w1:n]; ε/4) Mq(H, ε/4). (E.4)

sup w1:n Dn,2(H[r, w1:n]) := sup w1:n inf δ R

Mq(H[r, w1:n]; ε/2)dε

, where R = rad2(H[r, w1:n])

sup w1:n inf δ r

Mq(H[r, w1:n]; ε/2)dε (rad2(H[r, w1:n] r by localization)

Mq(H[r, w1:n]; ε/2)dε

Mq(H; ε/4)dε (Eq. (E.4))

:= Dn,2(H, r).

Learning under Heterogenous Distribution Shift

δn,D(H, c) := inf r : sup w1:n Dn,2(H[r, w1:n]) r2

inf{r : Dn,2(H, r) r2

2c} = δn,D(H, c).

Lemma E.6 (Concavity of D). For all a 1, Dn,2(H, ar) a Dn,2(H, r). Hence, if Dn,2(H, r) r2 2c, then Dn,2(H, r ) (r )2

2c for r r. Moreover, if b 1, δn,D(H, br) b δn,D(H, r).

Proof of Lemma E.6.

Dn,2(H, ar) = inf δ ar

Mq(H; ε/4)dε

Mq(H; aε/4)dε

Mq(H; ε/4)dε

Mq(H; ε/4)dε = a Dn,2(H, r).

The rest of the result follows similarly to Lemma C.3.

Proof of Lemma E.2. We prove part (a), the calculate for part (b) is near-identical.

Dn,2(H, r) inf δ r

Mq(H; ε/4)dε

Mq(Hi; ε/4a)dε

(by assumption)

Mq(Hi; ε/4)dε

= a inf δ r/a

Mq(Hi; ε/4)dε

= a inf δ1,δ2 r/a

2 max{δ1, δ2} + 4 n

Mq(Hi; ε/4)dε

a inf δ1,δ2 r

Mq(Hi; ε/4)dε

i=1 Dn,2(Hi, r

Learning under Heterogenous Distribution Shift

Consequently,

δn,D(H, c) := inf{r : Dn,2(H, r) r2

inf{r : Dn,2(H1, r

a) + Dn,2(H2, r

= a inf{r : Dn,2(H1, r) + Dn,2(H2, r) ar2

= a inf{r : max i [2] Dn,2(H1, r) ar2

By Lemma E.6, the above is at most a maxi [2] inf{r : Dn,2(H1, r) r2

2c} a maxi [2] δn,D(Hi, 2c/a).

E.2. Instantiating the Rates

Next, we formally deﬁne families of function classes we all entropy families, which are characterized by upper bounds of their metric entropies. These entropy families formally capture the entropy rates depicted in Theorem 1.

Deﬁnition E.3 (Entropy Families). Let H : W R, and let q [1, ], and let τ = (τ0, τ1, τ2) R3 denote a vector of parameters. We say that H Ent Famq(p, τ; R) if radq(H) R and either

p = 0, and for all ε > 0, Mq(H) τ0 + τ1 log(τ2/ε) or

p > 0, and for all ε > 0, Mq(H, ε) τ0 + τ1ε p.

Notice that the sets Ent Famq(p, τ) are non-increasing in q, and non-decreasing in the coordinates of τ, and (up to constants) non-increasing in p.

We now deﬁne complexities measures that upper bound the localized and unlocalized Dudley integrals for function classes in a given entropy family.

Deﬁnition E.4 (Key Complexities). Let τ = (τ0, τ1, τ2) R3 0, p [0, ), and R, c > 0. We deﬁne the global complexity term

Ξn,glob(p, τ, R) := Rτ0 n +

τ1 log(e+τ2/R)

n p (0, 2) p τ1

n log(e + R p

n/τ1) p = 2 ( τ1

n ) 1 p (p/2 1) 2

and the local complexity term

Ξn,loc(p, τ, c) := c2(1 + τ0)2

c2τ1 log(e+ nτ2/c)

n p = 0 c2 (1 p/2)2 τ1

n 2 2+p p (0, 2)

τ1 log(e+c n/τ1)

n ) 1 p p > 2.

( n 2 2+p p 2 n 1

Lastly, we deﬁne the rate functionals.

Deﬁnition E.5 (Rate Functionals). We deﬁne the following rate functionals:

raten,q(H, c) := inf p,τ,R{Ξn,loc(p, τ, c) : H Ent Famq(p, τ; R)}

raten, (H) := inf p,τ,R{Ξn,glob(p, τ, R) : H Ent Fam (p, τ; R)}.

Learning under Heterogenous Distribution Shift

That is, raten,q(H, c) is the smallest possibly local complexity term subject to H being in the appropriate entropy family, and raten, (H) is the smallest possible global complexity term, always taken with metric entropy in the -norm. We are now ready to state our main theorem with explicit rates:

Theorem 6. Suppose Assms. 2.1 to 2.3 hold. Let σB := max{B, σ}, let ν1, ν2 be as in Eqs. (3.3) and (3.4), and let c2 be a sufﬁciently small universal constant. Then if

raten,2(G, σB) + raten,2(F, σB) + σ2 B log(1/δ)

it holds that probability at least 1 δ, (recalling the class βF := {βf : f F}),

Rtest( ˆfn, ˆgn) (ν1 + ν2)raten,2(F, σB) + ν1 raten, (G)2 + min{raten, (F)2, raten, (βF)2}

+ν2 raten,2(G, σB) + (ν1 + ν2) σ2 B log(1/δ)

If in addition Assm. E.2 holds, obtain

Rtest( ˆfn, ˆgn) (ν1 + ν2)raten,2(F, σB) + ν1raten, (G)2

+ν2 raten,2(G, σB) + (ν1 + ν2) σ2 B log(1/δ)

and in addition, if we upper bound ν1 νx,y and ν2 νy, we obtain (as νy νx,y)

Rtest( ˆfn, ˆgn) νx,y raten,2(F, σB) + raten, (G)2 + νy raten,2(G, σB) + νx,y σ2 B log(1/δ)

Proof. The proof is a direct consequence of Theorem 3 and the following two lemmas to bound the Dudley integrals and critical radii, whose computations are essentially standard but which we prove in the Appendix E.3.

Lemma E.7. Suppose that H Ent Famq(p, τ, R). Then, Dn,q(H) Ξn,glob(p, τ; R). Hence,

Dn, (H) raten, (H).

Lemma E.8. Suppose that H Ent Fam2(p, τ, R), and let c > 0 be arbitrary. Then, δn,D(H, c)2 Ξn,loc(p, τ, c). Hence,

δn,D(H, c)2 raten,2(H, c).

E.3. Proof of Dudley Bounds (Lemmas E.7 and E.8)

Proof of Lemma E.7. From the deﬁnition of the Dudley functional, Deﬁnition 4.1, it is clear that the additive τ0-term in the metric entric bound contributes at most an additive Rτ0 n term to the integral. Consequently, let handle what is left over, assuming throughout that τ0 = 0. For a class H, let φ(ε) be the upper bound on Mq(H, ε) prescribed by Deﬁnition E.3 (again, setting τ0 = 0). Then, from Deﬁnition 4.1, so that

Dn,q(H) inf δ R

where last inequality uses similar changes of variables as in Lemma E.2. When p < 2, we take δ = 0 and attain

Dn,q(H) R 1 n

φ(ε)dε R τ0 n + R rτ1

log(τ2R) p = 0 1 p/2 1R1 p/2 p (0, 2) .

Next, for the case p = 2, we pick δ = R p

τ1/n to get

inf δ R δ + 1 n

φ(ε) R τ0 n + inf δ R δ + p

τ1/n log(R/δ) R τ0 n + p

τ1/n log(e + R p

Learning under Heterogenous Distribution Shift

Finally, consider p > 2. Then,

inf δ R δ + 1 n

φ(ε) R τ0 n + inf δ R δ + 1 p/2 1

Taking δ = 1 p/2 1 τ1δ1 p/2, we choose δ = τ 1/p 1 (p/2 1) 2/p yielding

inf δ R δ + 1 n

φ(ε) R τ0 n + (τ1/n)1/p(p/2 1) 2/p.

This concludes the proof.

Proof of Lemma E.8. Modifying the computation in Lemma E.7 implies

Dn,2(H, r) Ξn,glob(p, τ; r).

. For p = 0, we use that that for r c/ n,

Ξn,glob(p, τ; r) = r

τ1 log(e + τ2/r) n

(e + τ0) + p

τ1 log(e + nτ2/c) n

δn,D(H, c) = inf r2 : Dn,2(H, r) r2

c2 τ 2 0 + τ1 log(e + nτ2/c)

:= Ξn,loc(0, τ; c)

For p (0, 2),

Ξn,glob(p, τ; r) = rτ0 n + r1 p/2

Going forward, let c0 a universal constant for which Dn,2(H, r) c0Ξn,glob(p, τ; r). Thus, one can check that

inf r2 : Dn,2(H, r) r2

max{r2 1, r2 2},

where r1 and r2 balance the following equations

r1(1 + τ0) n = r2 1/4c0c, r1 p/2 2 1 p/2

n = r2 2/4c0c.

Solving yields

r2 1 = 16c2 0c2(1 + τ0)2

n , r2 2 = 4c0c 1 p/2

2 1+p/2 = 1 (1 p/2)2 16c2 0c2τ1

which gives

max{r2 1, r2 2} c2(1 + τ0)2

(1 p/2)2 τ1

2 2+p := Ξn,loc(p, τ; c), p (0, 2).

For p = 2, we note that

Ξn,glob(p, τ, τ1; r)

τ1 log(e + R n/τ1) n

Learning under Heterogenous Distribution Shift

Figure 2. Mean Squared Error of predictors of form ˆz = fθ(x) + gθ(y) (Left) and ˆz = hθ(x, y) (Right) on shifted distributions, where we hold the mixing probability of one of {x, y} ﬁxed, and vary the other s probability. MSE for both predictors declines less with shifts in px than those in py.

So similarly,

inf r2 : Dn,2(H, r) r2

max{r2 1, r2 3},

where r1 is as above, and where r3 is the smallest term satisfying p

τ1 log(e + r3 n/τ1) n r2 3/4c0c

τ1 log(e+r3 n/τ1) n r3 by Jensen s inequality, we can take r3 4c0c. Thus, it sufﬁces that r3 satisfy

τ1 log(e + 4c0c n/τ1) n ,

so that suppressing the universal constant c0,

r2 1 + r2 3 c2(1 + τ 2 0 ) n + c

τ1 log(e + c n/τ1) n = Ξn,loc(2, τ; c).

For p > 2, repeating the same arguments as above, we have

inf r2 : sup w1:n Dn,2(H, r) r2

max{r2 1, r2 4},

where r1 is as above and r4 satisﬁes

r2 4/4c0c = (τ1

n )1/p(p/2 1) 2

so that r2 4 c( τ1

n )1/p(p/2 1) 2

p . Following similar steps concludes the proof.

F. Experiment Details

This section describes various experiment details regarding model architectures and training hyperparameters.

F.1. Regression

Analyzing simple feature To make x simpler, we either have dx < dy or df < dg. We compare the generalization error (i.e., the difference between test and train mean squared error) for the auxiliary task of predicting f (x) and g (y). We train fφ(x) to predict f (x) and gφ(y) to predict g (y). fφ incurs a generalization error of 2.5 10 5. When dx < dy, gφ incurs a generalization error of 4.9 10 5. When df < dg, gφ incurs a generalization error of 7.8 10 5. In both cases, we observe that the auxiliary task of predicting f (x) has less generalization error.

Learning under Heterogenous Distribution Shift

Testing resiliency of predictors We independently train two predictors ˆz = fθ(x) + gθ(y) and ˆz = hθ(x, y) (using concatenated features (x, y)) to minimize mean-square error (MSE) under a training distribution with px = py = .01. We then measure the MSE for both predictors on shifted distributions, where we hold the mixing probability of one of {x, y} ﬁxed, and vary the other s probability in the range {0.1, 0.2, 0.5, 0.9, 0.99}. Figure 2 shows that MSE for both predictors declines less with shift in px than with those in py, thereby corroborating our theoretical expectations.

Implementation details We use dx = 3 and dy is either 3 (when dx = dy) or 6 (when dx < dy). Both f(x) and g(y) are 2-layered Multi-layered perceptions (MLP) with Re LU activation. While f has a hidden dimension df = 32, g has a hidden dimension of either dg = 32 (when df = dg) or dg = 4096 (when df < dg). We parameterize fθ(x) and fφ(x) with 2-layered MLPs having Re LU activation and same hidden dimension as f. We parameterize gθ(y) and gφ(y) with 2-layered MLPs having Re LU activation and same hidden dimension as g. We parameterize hθ(x, y) with a 2-layered MLP having Re LU activation and same hidden dimension as g. We collect 100k data points with px = py = 0.01 for training hθ. We train hθ with Adam optimizer (Kingma and Ba, 2014) for 100 epochs using a learning rate of 0.001 and a batch size of 50. During test, we increase px and py to {0.1, 0.2, 0.5, 0.9, 0.99} and evaluate the model using 1000 data points.

F.2. Binary Classiﬁcation with Waterbird dataset

Setup Our task is to classify images of birds as waterbirds or landbirds, against a background of either land or water. These images have two high-level features: birdtype and background. We ﬁrst empirically determine which feature is simple. We then empirically test if the classiﬁer is more resilient to distribution shift in the simple feature.

Determining simple feature To determine which feature is simple, we learn two classiﬁers, fbird predicting birdtype and fback predicting background. We train them on standard training set of waterbird dataset and test them on a sampled test set that has the same distribution as the training set. While both fbird and fback have a training accuracy of 1.0, fbird has a test accuracy of 0.88 and fback has a test accuracy of 0.96. Since fback has lower generalization error, we consider background as simple.

Testing classiﬁer resiliency We test resiliency of fbird as we shift distribution of one feature while keeping the distribution of the other feature ﬁxed. Speciﬁcally, we vary the proportion (in percentages) of images with waterbird (resp. land background) in test set while keeping the proportion of images with land background (resp. waterbird) ﬁxed. We show the results in Figure 1. We observe that test accuracy of fbird varies less when we shift distribution of background while keeping the distribution of birdtype ﬁxed.

Implementation details We parameterize fbird and fback with Res Net50 model (He et al., 2016) and train them with stochastic gradient descent for 300 epochs using a learning rate of 0.001, batch size of 128, momentum of 0.9 and l2 regularization of 0.0001. We took these hyperparameters and architectural choices from Sagawa et al. (2019) which introduced the Waterbird dataset. We used the github repo https://github.com/kohpangwei/group_DRO for running our experiments.

F.3. Multi-class Classiﬁcation with FMo W

Setup Our task is to predict landtype from images with top down satellite view. These images also contain information about their geographical region (Africa, the Americas, Oceania, Asia, or Europe). Hence, they have two high-level features: landtype and region. We ﬁrst empirically determine which feature is simple. We then empirically test if the classiﬁer is more resilient to distribution shift in the simple feature.

Determining simple feature To determine which feature is simple, we learn two classiﬁers, fland predicting landtype and fgeo predicting region. We train and test them on standard training and test set of FMo W dataset. While fland and fgeo have a training accuracy (i.e. number of correct predictions/ number of datapoints) of 0.71 and 0.74, they have a test accuracy of 0.59 and 0.69 respectively. Since fgeo has lower generalization error, we consider region as simple.

Testing classiﬁer resiliency We test resiliency of fland as we shift distribution of one feature while keeping the distribution of the other feature ﬁxed. To shift the distribution of landtype, we vary the proportion of images with labelled as zoo (a landtype). Similarly, to shift the distribution of region, we vary the proportion of images from Africa region. We

Learning under Heterogenous Distribution Shift

Figure 3. (Left) Visualization of FMo W dataset. (Right) Test accuracy of landtype classiﬁer as we shift distribution of region (resp. landtype) while keeping the distribution of landtype (resp. region) ﬁxed. The results again show that the classiﬁer is more robust to distribution shift in simple feature region.

Figure 4. Test accuracy of logical operators over a pair of (simple, complex) attribute. We vary the proportion (in percentages) of images with simple attribute (resp. complex attribute) in test set while keeping the proportion of images with complex attribute (resp. simple attribute) ﬁxed. We use pairs {bald, big_lips} (Left) and {pale_skin, narrow_eyes} (Right).

show the results in Figure 3. We observe that test accuracy of fland varies less (i.e. fland is more resilient) when we shift distribution of region while keeping the distribution of landtype ﬁxed.

Implementation details We parameterize fland and fgeo with Dense Net121 model (Huang et al., 2017) and train them with Adam optimizer (Kingma and Ba, 2014) for 60 epochs using a learning rate of 0.0001 and batch size of 32. We took these hyperparameters and architectural choices from Koh et al. (2021). We used the github repo https://github.com/plambda/wilds for running our experiments.

F.4. Learning logical operators with Celeb A

Setup We re-purpose the Celeb A dataset to learn logical operators OR and XOR for two attributes. We ﬁrst empirically determine which attributes are simpler than others. We then learn logical operators combining a simple attribute and a complex attribute. Finally, we empirically test the resilience of logical operators against distribution shifts in simple and complex attributes.

Determining simple attributes We ﬁrst train and test a multi-head binary classiﬁer that detects presence of 40 different attributes, with one head per attribute, on images from the Celeb A standard training set (Celeb A-STS). We select attributes {bald, pale_skin} as simple due to their low generalization error, and {big_lips, narrow_eyes} as complex due to their larger generalization error. Table 1 shows a complete list of training and test accuracy of the multi-head binary classiﬁer for each attribute.

Testing resiliency of logical operators We ﬁrst learn logical operators f OR and f XOR over a pair of simple and complex attribute. We use two such pairs in our experiments: {bald, big_lips} and {pale_skin, narrow_eyes}. We represent these logical operators as binary classiﬁers and train them on Celeb A-STS. We get the labels for these logical operators by applying the same logical operation over labels for the simple and the complex attribute. We then test the resiliency of these logical operators by shifting the distribution of the simple attribute and the complex attribute, one at a time. Speciﬁcally, we vary the proportion (in percentages) of images with simple attribute (or complex attribute) in test set while keeping the proportion of images with complex attribute (or simple attribute) ﬁxed. We show the results in Figure 1 and Figure 4. We observe that the success rate of the logical operators vary less when we shift the distribution of the simple attributes

Learning under Heterogenous Distribution Shift

{bald, big_lips}.

Implementation details We parameterize multi-head binary classiﬁer, predicting presence of 40 different facial attribute, with Mobile Net (Howard et al., 2017). We train it with Adam optimizer (Kingma and Ba, 2014) for 100 epochs using a learning rate of 0.001 and a batch size of 64. We use the same architecture and the training hyperparameters for learning logical operators f OR and f XOR. We borrow these hyperparameters and the code for running our experiments from the github repo https://github.com/suikei-wang/Facial-Attributes-Classiﬁcation.

F.5. Imitation learning on Robotic pusher arm environment

Environment Description We use Robotic pusher arm environment adapted from Ajay et al. (2022); Gupta et al. (2018) where the goal is to push the red cube to the green circle. When the red cube reaches the green circle, the agent gets a reward of +1. The state space is 12-dimensional consisting of mass of red cube (1), dampness parameter for each joint (1), joint angles (3) and velocities (3) of the gripper, COM of the gripper (2) and position of the red cube (2). The green circle s position is ﬁxed and at an initial distance of 0.5 from COM of the gripper. The red cube (of size 0.03) is initially at a distance of 0.1 from COM of the gripper and at an angle π/4. During training, at beginning for every episode, we sample m N(60, 15) and d N(0.5, 0.1). The task horizon is 60 timesteps.

Expert Policy To obtain expert policy that provides data for imitation learning and for training dynamics models (pφi(st+1|st, at, i), i {d, m}), we train a policy πexp(a|s, m, d) with Soft-Actor-Critic (Haarnoja et al., 2018) for 10e6 environment steps.

Determining simple factor To determine which of the two is simpler , we measure generalization error on the auxillary task of predicting next-step dynamics where one of {m, d} is held ﬁxed, and the other drawn from a certain distribution, ﬁxed across both testing and training. We learn two dynamics model pφm(st+1|st, at, m) and pφd(st+1|st, at, d) on two separate datasets Dtrain dyn,m and Dtrain dyn,d . Both Dtrain dyn,m and Dtrain dyn,d contain 100 expert trajectories each, with varying m and d respectively while keeping the other factor ﬁxed. While m N(60, 15) and d = 0.5 in Dtrain dyn,m, d N(0.5, 0.1) and m = 60 in Dtrain dyn,d . To evaluate learned dynamics models, we generate Dtest dyn,m and Dtest dyn,d in same way as their training counterparts. On training datasets, we ﬁnd mean squared error ( 1

|D| st+1 pφi(st+1|st, at, i) 2 2, i {m, d}) of pφm(st+1|st, at, m) and pφd(st+1|st, at, d) to be 0.062 and 0.183 respectively. On test datasets, we ﬁnd the mean squared error of pφm(st+1|st, at, m) and pφd(st+1|st, at, d) to be 0.081 and 0.237 respectively. Since pφm(st+1|st, at, m) has smaller generalization error, we consider object mass as simple. Intuitively, object mass only affects the dynamics of the system when the robotic arm is in contact with the object. In contrast, the joints dampness affects the way the robotic arm moves and hence affects the system s dynamics independent of whether the robotic arm is in contact with the object.

Implementation details We parameterize dynamics model (pφi(st+1|st, at, i), i {d, m}), expert policy πexp(a|s, m, d) and imitator policy πθ(a|s, m, d) with a 3-layered Multi-layer perception (MLP) having hidden dimension of 512 and Re LU activation. We train dynamics model and imitator policy with Adam optimizer (Kingma and Ba, 2014) for 100 epochs using a learning rate of 0.001 and a batch size of 128.

Learning under Heterogenous Distribution Shift

Attribute Train Accuracy Test Accuracy 5_o_Clock_Shadow 0.952 0.003 0.945 0.008 Arched_Eyebrows 0.879 0.01 0.84 0.006 Attractive 0.846 0.001 0.828 0.003 Bags_Under_Eyes 0.872 0.004 0.845 0.006 Bald 0.991 0.002 0.988 0.004 Bangs 0.968 0.006 0.96 0.005 Big_Lips 0.8 0.008 0.716 0.009 Big_Nose 0.867 0.002 0.839 0.004 Black_Hair 0.918 0.007 0.896 0.006 Blond_Hair 0.963 0.005 0.957 0.003 Blurry 0.966 0.008 0.961 0.005 Brown_Hair 0.886 0.009 0.885 0.009 Bushy_Eyebrows 0.929 0.007 0.922 0.002 Chubby 0.964 0.003 0.948 0.005 Double_Chin 0.971 0.002 0.961 0.006 Eyeglasses 0.998 0.003 0.996 0.006 Goatee 0.978 0.002 0.974 0.003 Gray_Hair 0.984 0.005 0.98 0.002 Heavy_Makeup 0.939 0.007 0.918 0.005 High_Cheekbones 0.898 0.006 0.876 0.004 Male 0.989 0.001 0.979 0.003 Mouth_Slightly_Open 0.957 0.004 0.936 0.008 Mustache 0.975 0.005 0.97 0.007 Narrow_Eyes 0.915 0.002 0.875 0.004 No_Beard 0.971 0.009 0.96 0.007 Oval_Face 0.795 0.002 0.758 0.005 Pale_Skin 0.97 0.008 0.967 0.005 Pointy_Nose 0.795 0.011 0.774 0.009 Receding_Hairline 0.952 0.003 0.939 0.008 Rosy_Cheeks 0.959 0.004 0.95 0.009 Sideburns 0.981 0.002 0.978 0.003 Smiling 0.948 0.006 0.928 0.005 Straight_Hair 0.856 0.003 0.831 0.009 Wavy_Hair 0.87 0.003 0.833 0.004 Wearing_Earrings 0.923 0.004 0.9 0.002 Wearing_Hat 0.994 0.009 0.989 0.005 Wearing_Lipstick 0.946 0.003 0.934 0.007 Wearing_Necklace 0.895 0.004 0.87 0.006 Wearing_Necktie 0.97 0.001 0.965 0.002 Young 0.912 0.002 0.877 0.004

Table 1. Train and test accuracy of multi-head binary classiﬁer for each attribute in Celeb A dataset. We consider attributes with low generalization error as simple and attributes with high generalization error as complex. We highlight the selected simple (Bald, Pale Skin) and complex (Big Lips, Narrow Eyes) features. We report mean and standard error across 4 replicates.