# context_is_environment__b19c8815.pdf

Published as a conference paper at ICLR 2024

CONTEXT IS ENVIRONMENT

Sharut Gupta Meta AI, MIT CSAIL sharut@mit.edu

Stefanie Jegelka MIT CSAIL stefje@mit.edu

David Lopez-Paz, Kartik Ahuja Meta AI {dlp,kartikahuja}@meta.com

Two lines of work are taking center stage in AI research. On the one hand, the community is making increasing efforts to build models that discard spurious correlations and generalize better in novel test environments. Unfortunately, a hard lesson so far is that no proposal convincingly outperforms a simple empirical risk minimization baseline. On the other hand, large language models (LLMs) have erupted as algorithms able to learn in-context, generalizing on-the-fly to the eclectic contextual circumstances that users enforce by prompting. We argue that context is environment, and posit that in-context learning holds the key to better domain generalization. Via extensive theory and experiments, we show that paying attention to context unlabeled examples as they arrive allows our proposed In-Context Risk Minimization (ICRM) algorithm to zoom-in on the test environment risk minimizer, leading to significant out-of-distribution performance improvements. Furthermore, training with context helps the model learn a better featurizer. From all of this, two messages are worth taking home: researchers in domain generalization should consider environment as context, and harness the adaptive power of in-context learning. Researchers in LLMs should consider context as environment, to better structure data towards generalization. Code is available at https://github.com/facebookresearch/ICRM.

1 INTRODUCTION

One key problem in AI research is to build systems that generalize across a wide range of test environments. In principle, these algorithms should discard spurious correlations present only in certain training environments, and capture invariant patterns appearing across conditions. For example, we would like to build self-driving systems that, while trained on certain weather conditions, levels of traffic, and driving rules, can perform satisfactorily in new circumstances. Unfortunately, this has proven challenging for instance, these models often fail to drive in unseen weather conditions (Lechner et al., 2022), creating immediate hazards. Despite its importance, how to perform well beyond the distribution of the training data remains a burning question. In fact, major international conferences offer well-attended workshops dedicated to the issue (Wald et al., 2023), and news articles remind us of the profound societal impact of the failures of ML systems (Angwin et al., 2016).

Research efforts have so far led to domain generalization algorithms that fall into two broad categories. On the one hand, invariance proposals (Ganin et al., 2016; Peters et al., 2016; Arjovsky et al., 2019), illustrated in Figure 1a, discard all environment-specific information, thus removing excessive signal about the problem. On the other hand, marginal transfer proposals (Blanchard et al., 2011; Li et al., 2016a; Zhang et al., 2020; Bao and Karaletsos, 2023), illustrated in Figure 1b, summarize observed inputs in each environment as a coarse embedding, thus diluting important signal at the example level. So far, the bitter lesson is that no algorithm geared towards out-of-distribution generalization convincingly outperforms a simple empirical risk minimization (ERM) baseline, which pools the data from environments, when evaluated across standard real-world benchmarks (Gulrajani and Lopez-Paz, 2020; Gagnon-Audet et al., 2023; Yao et al., 2022). Has the generalization project hit a dead end?

In parallel, large language models (Open AI, 2023; Touvron et al., 2023, LLMs) are taking the world by storm. LLMs are next-token predictors built with transformers (Vaswani et al., 2017) and trained on enormous amounts of natural language. One impressive capability of LLM systems is their ability to learn in-context, that is, to generalize on-the-fly to the eclectic circumstances that users enforce by prompting (Brown et al., 2020). For example, a trained LLM would complete the sequence

Published as a conference paper at ICLR 2024

xe 1 xe i 1 xe i

(a) Invariance DG.

xe 1 xe i 1 xe i

j=1 ϕ(xe j)

(b) Marginal transfer DG.

xe 1 xe i 1 xe i

transformer

(c) In-context DG (ours).

Figure 1: Three frameworks for domain generalization (DG), predicting the target ye i from the input xe i in test environment e. Depicted in blue, the last example xe i 1 contains relevant features for the current prediction. (a) Invariance DG discards all of the previously observed information from the test environment, removing too much predictive signal. (b) Marginal transfer DG summarizes all of the previously observed test inputs as a coarse embedding, diluting predictive signal found at the example level. (b) Our in-context DG directly observes all of the previous test inputs, allowing the search of needle-in-the-haystack signals, such as the relevant one, i.e., xe i 1.

France-Paris Italy-Rome Spainwith the sequence Madrid, effectively learning, from the input itself, that the user is demanding a capital prediction task. When interacting with LLMs, one feels closer towards solving the puzzle of out-of-distribution (OOD) generalization. Could LLMs hold a key piece to the OOD puzzle?

This paper suggests a positive answer, establishing a strong parallel between the concept of environment in domain generalization, and the concept of context in next-token prediction. In fact, different environments describe varying contextual circumstances such as time, location, experimental intervention, and other background conditions. On the one hand, describing environments as context opens the door to using powerful next-token predictors off-the-shelf, with their adaptability to learn in-context, to address domain generalization problems. This allows us to move from coarse domain indices to fine and compositional contextual descriptions, helpful to amortize learning across similar environments. On the other hand, using context as environment can help LLM researchers to use various domain generalization methods such as distributionally robust optimization (Sagawa et al., 2019; Xie et al., 2023, DRO) across varying contexts.

Based on these insights, we propose a natural algorithm, In-Context Risk Minimization (ICRM), illustrated in Figure 1c. Given examples (xe i, ye i ) from environment e, we propose to address out-ofdistribution prediction as in-distribution context-based prediction, training a machine:

ye i h(xe i; xe 1, . . . , xe i 1 | {z } environment context

While the requested prediction ye i concerns only the input xe i, the machine can now pay attention to the test experience so far, extracting relevant environment information from instance and distributional features. Our theoretical results show that such in-context learners can utilize context to zoom-in on the empirical risk minimizer of the test environment, achieving competitive out-of-distribution performance. Further, we show that the extended input-context feature space in ICRM can reveal invariances that ERM-based algorithms ignore. A standout feature of ICRM is its capability to improve feature learning through context-based training enabling ICRM to outperform counterparts even in the absence of context. Our extensive experiments demonstrate the efficacy of ICRM, and extensive ablations dissect and deepen our understanding of it.

We organize the rest of the exposition as follows. Section 2 reviews the fundamentals of domain generalization, centered around the concept of environment. Section 3 explains the basics of nexttoken prediction, with an emphasis on learning from context. Section 4 knits these threads together to propose ICRM, a framework to learn from contexts from multiple environments, supported by a

Published as a conference paper at ICLR 2024

host of theory. Section 6 showcases the efficacy of our ideas in a variety of domain generalization benchmarks, and Section 7 closes the exposition with some topics for future discussion.

2 THE PROBLEM OF DOMAIN GENERALIZATION

The goal of domain generalization (DG) is to learn a predictor that performs well across a set of domains or environments E (Muandet et al., 2013). Environment indices e E list different versions of the data collection process variations that may occur due to time, location, experimental interventions, changes in background conditions, and other contextual circumstances leading to distribution shifts (Arjovsky et al., 2019).

During training we have access to a collection of triplets D = {(xi, yi, ei)}n i=1. Each triplet contains a vector of features xi, a target label yi, and the index of the corresponding training environment ei Etr E. Each example (xi, yi) is sampled independently from a joint distribution P e(X, Y ). Using the dataset D, we learn a predictor h that maps features to labels, while minimizing the worst risk across the set of all environments E:

h = arg min h max e E Re(h), (2)

where Re(h) = E(X,Y ) P e[ℓ(h(X), Y )] is the risk of the predictor h in environment e, as measured by the expectation of the loss function ℓwith respect to the environment distribution P e.

As one example, we could train a self-driving model h to classify images xi into a label yi indicating the presence of a pedestrian. Each training example (xi, yi) is hereby collected from one of the training cities ei Etr, with its own weather conditions. The goal of Equation (2) is to obtain a predictor that correctly classifies x in new cities e Ete observed during test time. This has proved to be challenging (Lechner et al., 2022), as predictors often exhibit penurious performance in unseen weather conditions.

Domain generalization is challenging because we do not have access to test environments during training time, rendering Equation (2) challenging to estimate. Therefore, to address the DG problem in practice, researchers have proposed various algorithms that make different assumptions about the invariances shared between Etr and Ete. In broad strokes, domain generalization algorithms fall in one of the two following categories. In the first category, domain generalization algorithms based on invariance (Muandet et al., 2013; Ganin et al., 2016; Peters et al., 2016; Arjovsky et al., 2019), illustrated in Figure 1a, regularize predictors h(xe i) to not contain any information about the environment e. This however results in removing a lot of signal about the prediction task. In the second category, domain generalization algorithms based on marginal transfer (Blanchard et al., 2011; Li et al., 2016a; Zhang et al., 2020; Bao and Karaletsos, 2023) extract environment-specific information. These methods implement predictors h(xe i, ϕe i), where ϕe i = 1 i 1 Pi 1 j=1 ϕ(xe j) coarsely summarizes the environment e in terms of previously observed instances. Different choices for ϕ include kernel functions (Blanchard et al., 2011, MTL), convolutional neural networks (Zhang et al., 2020, ARM), and patch embeddings (Bao and Karaletsos, 2023, Context-Vi T). Alas, all of these alternatives in the second category dilute relevant features found in individual examples. For example, the size of the representation ϕ would have to grow linearly with the size of the training data to describe aspects corresponding to a small group of examples, such as extreme value statistics.

As a result, and despite all efforts, no proposal so far convincingly outperforms a simple empirical risk minimization baseline (Vapnik, 1998, ERM) across standard benchmarks (Gulrajani and Lopez-Paz, 2020; Gagnon-Audet et al., 2023; Yao et al., 2022). Effectively, ERM simply pools all training data together and seeks the global empirical risk minimizer:

h = arg min h

e Etr P(E = e) Re(h). (3)

Does the efficacy of ERM suggest that environmental information is useless? We argue that this is not the case. The key to our answer resides in a recently discovered emergent ability of next-token predictors, namely, in-context learning.

Published as a conference paper at ICLR 2024

paradigm training data testing data estimates

ERM x, y xe P(Y | X) IRM x, y, e xe P(Y | ϕinv(X)) LLM z zt and context zj<t P(Zt+1 | Zt, . . . , Z1) ICRM x, y, e xe t and context ce t = (xe j )j<t P(Y |X, C) P e (Y | X)

Table 1: Different learning paradigms discussed in this work, together with their training data and testing data formats, as well as the estimated predictors. In our ICRM, we amortize the current input xe and its context ce , containing previously experienced unlabeled examples from the same environment e , and zoom-in ( ) to the appropriate local risk minimizer.

3 NEXT-TOKEN PREDICTORS AND IN-CONTEXT LEARNING

In next-token prediction, we aim to learn the conditional distribution

P(Zt+1 = zt+1 | Zt = zt, . . . Z1 = z1), (4)

describing the probability of observing the token zt+1 after having observed the sequence of tokens (z1, . . . , zt). The quintessential next-token prediction task is language modeling (Bengio et al.,

2000), where the sequence of tokens represents a snippet of natural language text. Most LLMs estimate Equation (4) via a transformer zt+1 h(zt; zt 1, . . . , z1) (Vaswani et al., 2017).

LLMs exhibit a certain ability, termed in-context learning (ICL), relevant to our interests. ICL is the ability to describe and learn about a learning problem from the sequence of tokens (typically labeled (x, y) pairs) itself, called the context or prompt. As an example, trained LLM would complete the sequence France-Paris Italy-Rome Spainwith the sequence Madrid, demonstrating its ability to infer from a few input-output pairs that the user is demanding a capital prediction task. To illustrate this ability for contexts without labels, consider the two following sequences:

You are talking to a teenager. | {z } context c1

Write a poem on gravitational fields. | {z } x1 You are talking to a Physics graduate. | {z } context c2

Write a poem on gravitational fields. | {z } x2

As widely observed, LLMs answer differently to these two sequences, producing two poems, say y1 and y2, each adapted to the assumed audience. While nothing unexpected is happening here at the sequence level the model simply produces a high-likelihood continuation to each of the two prompts we observe a degree of compositional generalization, because the LLM can provide different but correct answers to the same question x1 = x2 when presented under two contexts c1 and c2. By addressing the general task of in-distribution language modeling, LLMs can attain significant out-of-distribution abilities in a multitude of specific tasks such as writing poems. ICL is reminiscent of meta-learning (Schmidhuber, 1987; Finn et al., 2017), yet it seamlessly accommodates to contexts without labels, and does not require updating the parameters of the model. ICL is also similar in spirit to test-time adaptation (Wang et al., 2020, TTA); however, TTA often requires updating model parameters with an externally hand-crafted objective.

Notably, ICL emerges without supervision. The training corpus does not contain any explicit division between questions and their context beyond the natural order of the words within each snippet of language in the training data. However, since we train the machine to produce an enormous amount of completions, some of which start with partially overlapping contexts, the predictor has the opportunity to amortize learning to a significant degree i.e. use the trained model to generalize across unseen distributions rather than explicitly optimizing a separate model for each distribution. While the machine may have never observed the context c1 = You are now speaking to a teenager, its semantic similarity to c1 above plus other similar contexts where the word speaking appears may endow OOD generalization. This is the desired ability to generalize over environments described in the previous section, which remained out of reach when using coarse domain indices.

Published as a conference paper at ICLR 2024

4 ADAPTIVE DOMAIN GENERALIZATION VIA IN-CONTEXT LEARNING

Our exposition has so far laid out two threads. First, Section 2 motivated the need for domain generalization algorithms capable of extracting relevant environment-specific features, at both the example and distributional levels. To this end, we have argued to move away from coarse environment indices, and towards rich and amortizable descriptions shared in new circumstances. Second, Section 3 suggests understanding context as an opportunity to describe environments in precisely this manner. We now knit these threads together with a protocol to address domain generalization with in-context learners.

In-Context Risk Minimization (ICRM, Figure 1c):

Collect a dataset of triplets D = {(xi, yi, ei)}n i=1 as described in Section 2. Initialize a context-based predictor ˆy = h(x; c), tasked with predicting the label y associated to the input x, as supported by the context c.

During training, select e Etr at random. Draw t examples from this environment at random, construct one input sequence (xe 1, . . . , xe t) and its associated target sequence (ye 1, . . . , ye t ). Update the context-based predictor to minimize the auto-regressive loss Pt j=1 ℓ(h(xe j; ce j), ye j), where the context is ce j = (xe 1, . . . , xe j 1), for all j = 2, . . . , t, and ce 1 = .

During test time, a sequence of inputs (xe 1 , . . . , xe t ) arrives for prediction, one by one, all from the test environment e Ete. We predict ˆye j = h(xe j , ce j ) for xe j , where the context ce j = (xe 1 , . . . , xe j 1), for all j = 2, . . . , t , and ce 1 = .

A few critical remarks about the above proposal are in order. The idea of using contextual information to aid the prediction through attention mechanisms in transformers has been used in earlier works on neural processes (Kim et al., 2019; Nguyen and Grover, 2022), prior data fitted networks (M uller et al., 2021) and more recent works that study the mechanisms underlying in-context learning (Garg et al., 2022; Aky urek et al., 2022). These works leverage labeled contextual data. Our proposal embraces the challenge of domain generalization and only uses unlabeled data from the environments both at train and test time. The most natural way to construct contexts is to use past samples that appear in the natural order in which data was collected (e.g., video). Since existing DG datasets do not provide such a refined ordering, we build contexts using environment indices that are more readily available. The proposal also requires the data at test time to be sampled from the same or slowly changing environments. While the proposal is strongly inspired from LLMs in that both pay attention to current query and the contextual information, there are differences namely we predict the label of the input and not the next x in the sequence.

Next, we develop theoretical guarantees on the behavior of ICRM. The results below concern the joint distribution of (X1, Xt), (Y1, . . . , Yt), E , where each Xj, Yj is an independent draw from environment E with distribution P E(X, Y ) . For query Xj, the context preceding it is Cj = (X1, , Xj 1) and the environment underlying this context is E. To orient ourselves around these results, we recall three predictors featured in the exposition so far. First, the global empirical risk minimizer over the pooled training data, denoted by h in Equation (3), estimates P(Y | X). Second, the environment risk minimizer estimates P(Y | X, E). Third, our in-context risk minimizer, denoted by

h = arg min h

j=1 E(Xj,Cj,Yj)[ℓ(h(Xj; Cj), Yj)]. (5)

estimates the conditional expectation E(Y | X, C). The sequel focuses on the binary cross-entropy loss ℓ. Our first result shows that, in the absence of context, ICRM zooms-out to behave conservatively.

Proposition 1 (Zoom-out). In the absence of context, ICRM behaves as the global empirical risk minimizer across the support of the training environments, i.e., h( ; ) = h ( ).

The above result is built on the insight that ICRM is Bayes optimal at all context lengths and ERM is Bayes optimal for context c = . Having established the connection between ICRM and ERM in

Published as a conference paper at ICLR 2024

the absence of any context, we now study the benefits of ICRM in the presence of sufficiently long contexts. The following result shows that, when provided with context from a training environment e Etr, our ICRM zooms-in and behaves like the appropriate environment risk minimizer, as shown in Table 1. We assume that P(Y = 1 | X = x, E = e) is parametrized by a function h (x, θe x), where θe x describes features of the environment relevant to the query x, for all e E. We also assume there exists an ideal amortization function b that takes as input the query X and context Ct preceding it both sampled from environment E and approximates θE X. Formally, the sequence of random variables b(X, Ct) indexed by t converges almost surely to the random variable θE X. Theorem 1 (Full iid zoom-in). Let h (x, θe x) describe P(Y = 1 | X = x, E = e) for all e E. Further, we assume the existence of an amortization function b(X, Ct) a.s. θE X. Then, ICRM zooms-in on the environment risk minimizer and achieves a cross-entropy loss over the training distribution

lim t H(Y | X, Ct) = H(Y | X, E).

Further, if I(Y ; E | X) > 0, ICRM has better performance than the global risk minimizer.

Theorem 1 states that ICRM converges to empirical risk minimizer of the environment under infinitely long contexts. Next, we show that ICRM can partially zoom-in on the appropriate environment risk minimizer even with contexts of length of one. Theorem 2 (Partial iid zoom-in). Suppose the joint distribution ((X1, Xt), (Y1, . . . , Yt), E) is Markov with respect to a Bayesian network. The query X and the environment E are statistically dependent and form the Markov blanket of Y . Then ICRM partially zooms-in on the environment risk minimizer, improving over the performance of the global empirical risk minimizer in terms of the cross-entropy loss. Further, the improvement is strictly monotonic in context length t.

Next, we move to the out-of-distribution setting where the test environments can be different from the training environments. To provide theory for a domain generalization result, we must place some assumptions on the data generation process. In particular, and for all e E, let z | y, e N(µy e, Σy e), and x g(z) where the latent variables z are sampled conditional on the label y and environment e from a Gaussian distribution with mean and covariance depending on (y, e), and are then mixed by a map g to generate the observations x. We summarize the environment as a parameter vector γe = (py e, µy e, Σy e)y {0,1} , where py e is the probability of label y in environment e. Our next result shows that ICL algorithms that learn h(x; c) exhibit robust behavior under distribution shifts. In contrast, such guarantees are not known for algorithms that generate predictors of the form h(x).

Define δe to be a permutation of γe that swaps its two components. We construct the Voronoi cells corresponding to the points in the union of sets {γe}e Etr and {δe}e Etr. The set of points in the Voronoi cells corresponding to the set of points {γe}e Etr define the Voronoi cells of the training environments. Next, we show that there exists an ICL algorithm, which takes the data from multiple environments as input and outputs a predictor that takes current query and context as input, whose output predictors perform well in novel test environments even those that are sufficiently far away from the training environments, so long as they are in the Voronoi cells of training environments. Theorem 3 (Full OOD zoom-in). Consider data triplets (x, y, e) generated from z N(µy e, Σy e) and x g(z), e E, where g is the identity map (see Appendix A for extension to general diffeomorphism g). There exists an ICL algorithm that in the limit of infinitely long contexts produces Bayes optimal predictions for all the test environments in the Voronoi cells of the training environments.

5 ICRM UNDER THE LENS OF INVARIANCE

Common advice in domain generalization recommends following the invariance principle to learn robust predictors (Peters et al., 2016; Arjovsky et al., 2019). One simple version of the invariance principle is to select those inputs leading to stable predictors across training environments. At first sight, one could argue that the proposed ICRM does not adhere to such a principle, as it is adapting to environment-specific information provided in the form of context. As we shall now illustrate, ICRM s implementation of ERM on the extended input-context feature space reveals invariant predictors that a vanilla implementation of ERM on the standard feature space fails to find. To see this, consider a linear least-squares regression problem mapping two dimensional inputs x = (x1, x2) into a target y under environments e E as y = α x1 + β µ2 e + ε, where µi e = E[Xi | E = e], the pair (α, β)

Published as a conference paper at ICLR 2024

are invariant regression coefficients, and ε is an independent noise term. We make one simplifying assumption for pedagogic purposes. During training, we provide ICRM training directly with the relevant extended feature space (x1, x2, µ1 e, µ2 e), instead of requiring the algorithm to learn such representation from general-form sequential context.

In this setup, ICRM learns to predict using α x1 + 0 x2 + 0 µ1 e + β µ2 e. In contrast, ERM trains a linear model on (x1, x2) and predicts using α x1 + β x2. The main point is: if β = 0 and cov(X1, X2) = 0, then α = α, and the error of ERM in a new environment grows with the variance of x1. On the other hand, ICRM estimates the true invariant coefficient α. and the resulting error is independent of variance of x1, even in the absence of context during test time. As a result, ICRM exhibits better out-of-distribution performance than ERM without any contextual information at test time. For a derivation and generalization of these claims, see Appendix A.

We believe that ICRM, and more generally ICL, provide a novel view on invariance. On the one hand, prior DG algorithms advocated to remove features as a guide to reveal invariance. On the other hand, in-context learners suggest that extending features with context affords invariance otherwise unnoticed. This needs further clarification: while the process of zooming-in to an environment risk minimizer does not provide us with an invariant predictor over the original feature space, the process of zooming-in is often an invariant mechanism over the extended feature space. These points are reminiscent of the concept of fragility in the philosophy of causation (Menzies and Beebee, 2020). Does smoking cause cancer? Not invariably across all contexts or environments. Yet, smoking does cause cancer invariably across all contexts or environments when extending the feature space as to include additional causes such as diet, genetic predispositions, and the number of smoked cigarettes. The ever-growing collection of causes approaches what Mill (1856) called the total cause, a large context sharpening invariance at the expense of constraining the diameter of the environment. In the extreme, when constraining the environment to contain only one smoker, the outcome of lung cancer disease invariably follows.

6 EXPERIMENTS

To evaluate the efficacy of ICRM, our experiments address the following questions:

1. How does ICRM fare against competitive DG algorithms, for different context sizes? 2. What is the impact of model architecture on ICRM s gains? 3. Can ICRM search for query relevant signals in the context?

In our experiments, we compare ICRM against marginal transfer methods such as Adaptive Risk Minimization (Zhang et al., 2020, ARM), and test-time adaptation proposals such as TENT (Wang et al., 2020). As a strong baseline, we also include ERM in our experimental protocol. To ensure a fair comparison across different algorithms for each dataset, we use a standardized neural network backbone (Conv Net or Res Net-50 depending on the dataset) as described in Appendix C.4. For ICRM, the same backbone is used to featurize the input, which is then processed by the decoder-only GPT2 (Radford et al., 2019). During both training and inference for ICRM, data in a sequence is sampled from the same environment. For fair comparisons, we adhere to Domain Bed s protocols for training, hyper-parameter tuning, and testing (Gulrajani and Lopez-Paz, 2020), details in Appendix C.4. We assess these methods across six image classification benchmarks, featuring diversity shift FEMNIST (Cohen et al., 2017) contains MNIST digits and handwritten letters from individual writers as environments. Rotated MNIST concerns varied rotational angles as environments. Tiny Image Net-C and CIFAR10-C (Hendrycks and Dietterich, 2019) introduce diverse image corruptions to create multiple environments. WILDS Camelyon17 (Koh et al., 2021) studies tumor detection with data from multiple hospitals as distinct environments and Imagenet-R (Hendrycks et al., 2021) contains various renditions (e.g., paintings, embroidery, etc.) of Image Net object classes as domains. More details are provided in Appendix C.3. We consider addressing correlation shifts (Ye et al., 2022) as a future work of our paper, as it involves scenarios where the test domain contains naturally occurring subpopulations or time-based shifts without clear domain separation

6.1 ADAPTATION TO DISTRIBUTION SHIFT

To study the adaptation of other approaches to distribution shifts, we report the average performance on four datasets across three independent runs of the entire sweep for test context lengths of 0, 25, 50,

Published as a conference paper at ICLR 2024

Table 2: Average/worst OOD test accuracy for different context lengths, for Adaptive Risk Minimization (ARM), Empirical Risk Minimization (ERM), Test Entropy Minimization (TENT) and our ICRM on FEMNIST, Rotated MNIST, WILDS Camelyon17 and Tiny-Image Net-C.

Data / method Average test accuracy Worst case test accuracy

FEMNIST 0 25 50 75 100 0 25 50 75 100 ARM 49.5 83.9 84.4 84.7 84.6 23.6 59.5 60.7 57.0 58.8 TENT 78.1 77.9 81.2 82.5 83.3 55.2 57.2 63.3 65.9 67.2 ERM 79.3 79.3 79.3 79.3 79.3 59.0 59.0 59.0 59.0 59.0 ICRM 78.7 87.2 87.4 87.5 87.8 59.8 69.3 70.6 70.6 70.6

Rotated MNIST 0 25 50 75 100 0 25 50 75 100 ARM 36.5 94.2 95.1 95.3 95.5 28.2 85.3 87.2 87.9 87.9 TENT 94.1 88.0 91.9 93.8 94.3 80.2 88.5 88.5 80.2 81.3 ERM 94.2 94.2 94.2 94.2 94.2 80.8 80.8 80.8 80.8 80.8 ICRM 93.6 96.1 96.2 96.2 96.2 82.5 88.5 88.5 88.8 88.8

WILDS Camelyon17 0 25 50 75 100 0 25 50 75 100 ARM 61.2 59.5 59.7 59.7 59.7

same as average accuracy TENT 67.9 81.8 87.2 89.4 89.4 ERM 68.6 68.6 68.6 68.6 68.6 ICRM 92.0 90.7 90.8 90.8 90.8

Tiny Image Net-C 0 25 50 75 100 0 25 50 75 100 ARM 30.8 31.0 31.0 31.0 31.0 8.2 8.3 8.2 8.3 8.2 TENT 31.7 1.6 1.7 2.0 2.1 9.4 1.2 1.4 1.6 1.6 ERM 31.8 31.8 31.8 31.8 31.8 9.5 9.5 9.5 9.5 9.5 ICRM 38.3 39.2 39.2 39.2 39.2 18.8 19.2 19.5 19.5 19.4

75, and 100 samples. The results for other datasets and more DG algorithms are reported in Table 7 and Table 8. As Table 2 shows, ICRM outperforms all methods across context lengths, except at null context length on MNIST datasets, where ERM exceeds by 1%. Further, our gains persist over both the worst group and average accuracy across testing environments. Figure 5 zooms into the model s performance between no-context and 25 context samples, highlighting the consistent superiority of ICRM even with small contexts. Additionally, ICRM demonstrates gains in performance even in the absence of test context. Specifically for both WILDS Camelyon17 and Tiny Image Net-C, ICRM outperforms baselines despite not leveraging any context from the test environment. We hypothesize that ICRM training still benefits from contexts as to find contextual features that ERM ignores.

6.2 UNDERSTANDING THE IMPACT OF ARCHITECTURE

To dissect the performance gains potentially arising from ICRM s transformer architecture, we explore two additional competitors. First, we train an ERM baseline, ERM+ using an identical architecture to ICRM, but without context. Second, we train an ARM baseline, ARM+, where the input and context coarse summary are sent to a GPT-2 such that the model now attends to the summary through attention layers.

Table 3: Worst group OOD test accuracies for ARM+ and ERM+ in contrast to their base algorithms, ARM and ERM across FEMNIST, Rotated MNIST, WILDS, Camelyon17, and Tiny-Image Net-C.

Dataset ARM ARM+ ERM ERM+

# Context Samples 0 100 0 100 0 100 0 100

FEMNIST 23.6 58.8 51.7 62.0 59.0 59.0 53.3 53.3 Rotated MNIST 28.2 87.9 71.4 81.1 80.8 80.8 81.9 81.9 WILDS Camelyon17 61.2 59.7 55.8 55.0 68.6 68.6 50.1 50.1 Tiny Image Net-C 8.2 8.2 1.9 1.9 9.5 9.5 8.3 8.3

Table 3 presents the performance of both ERM+ and ARM+ relative to ERM and ARM, across the four datasets. ARM+ demonstrates superior zero-shot performance over ARM on both FEMNIST and Rotated MNIST. However, ARM maintains a performance advantage over ARM+ across varying counts of in-context samples on WILDS Camelyon17 and Tiny Image Net-C, with a pronounced

Published as a conference paper at ICLR 2024

difference on the latter. Similarly, ERM either matches or outperforms ERM+ on all four datasets. Therefore, even with similar architectures, prior protocols fall short of the proposed ICRM.

0.12 0.06 0.32 0.10 0.15 0.10 0.00 0.00 0.11 0.05 Query

0.16 Beach Wagon

0.24 Academic Gown

0.12 Walking stick

0.15 Volleyball

0.00 0.00 0.00 0.11 Volleyball Query Volley ball

Query Image

Figure 2: Attention scores for random test sequences, for ICRM on FEMNIST (top) and Tiny Image Net-C (bottom), with the class label shown below. Samples following the query have an attention score of 0.0 because of the causal attention mechanism.

6.3 INVESTIGATING ATTENTION IN ICRM

As discussed in Section 2, ICRM can learn an amortization function by paying attention to the input query and context. To understand this better, we construct a random data sequence from the test environment and analyze attention scores between each example in this context and a novel input query. Figure 2 illustrates attention scores from a single head for a query image (marked in blue) for FEMNIST and Tiny Image Net-C. Note that samples following the query have an attention score of 0.0 because of the causal attention mechanism. The top row reveals that the model selectively attends to images featuring at least two curved arcs (marked in green) while paying little attention to a partial circle (highlighted in red). Similarly, in the bottom row, the model effectively discerns individuals across samples within the sequence and also indicates a semantic understanding of similarity.

7 DISCUSSION

We introduced In-Context Risk Minimization (ICRM), a framework to address domain generalization as context-based prediction. ICRM learns in-context about environmental features by paying attention to unlabeled instances as they arrive. In such a away, ICRM dynamically zooms-in on the test environment risk minimizer, enabling competitive out-of-distribution generalization.

ICRM provides a new perspective on invariance. While prior work on DG focused on information removal as a guide to generalization, ICRM suggests that extending the feature space with the relevant environment information affords further invariance. By addressing the general problem of context-based prediction in-distribution, we amortize the performance over a multitude of specific out-of-distribution tasks. More generally, by framing DG as next-token prediction, our approach can be adapted to fully exploit data in natural order (such as in video or text, ordered by time and position), more closely mimicking the human learning experience as L eon Bottou once said, Nature does not shuffle data. That said, we view extending ICRM to scenarios where the environment contains naturally occurring subpopulations or time-based shifts without clear separation as an exciting future direction. As a word of caution, we must conduct research to guarantee that in-context learners do not zoom-in on toxic spurious correlations with high predictive power in certain environments. We close with a quote from Andersen et al. (2022), for whom zooming-in

refers to a cognitive agent s ability to intelligently ignore irrelevant information and zero in on those aspects of the world that are relevant to their goals. The relevance realization framework suggests that the brain achieves this feat by attempting to balance the competing goals of remaining efficient in the current environment while also being resilient in the face of environmental perturbations.

Paralleling the examples from Andersen et al. (2022), we would like to further understand how nexttoken prediction and in-context learning serves as a powerful mechanism to amortize and dynamically navigate trade-offs such as such as efficiency-resiliency, exploration-exploitation, specializationgeneralization, and focusing-diversifying.

Published as a conference paper at ICLR 2024

ACKNOWLEDGEMENTS

SG and SJ acknowledge funding from the Office of Naval Research grant N00014-20-1-2023 (MURI ML-SCOPE) and NSF award CCF-2112665 (TILOS AI Institute). We are thankful to Martin Arjovsky, L eon Bottou, Elvis Dohmatob, Badr Youbi Idrissi, Maxime Oquab, and Ahmed Touati for their valuable feedback and help.

Kartik Ahuja and David Lopez-Paz. A closer look at in-context learning under distribution shifts. ar Xiv, 2023.

Kartik Ahuja, Karthikeyan Shanmugam, Kush Varshney, and Amit Dhurandhar. Invariant risk minimization games. In ICML, 2020.

Kartik Ahuja, Ethan Caballero, Dinghuai Zhang, Jean-Christophe Gagnon-Audet, Yoshua Bengio, Ioannis Mitliagkas, and Irina Rish. Invariance principle meets information bottleneck for out-ofdistribution generalization. Neur IPS, 2021.

Ekin Aky urek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. ar Xiv preprint ar Xiv:2211.15661, 2022.

Brett P Andersen, Mark Miller, and John Vervaeke. Predictive processing and relevance realization: exploring convergent solutions to the frame problem. Phenomenology and the Cognitive Sciences, 2022.

Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias. Pro Publica, 2016. URL https://www.propublica.org/article/ machine-bias-risk-assessments-in-criminal-sentencing.

Martin Arjovsky, L eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv, 2019.

Robert B Ash and Catherine A Dol eans-Dade. Probability and measure theory. 2000.

Yujia Bao and Theofanis Karaletsos. Contextual Vision Transformers for Robust Representation Learning. ar Xiv, 2023.

Yoshua Bengio, R ejean Ducharme, and Pascal Vincent. A neural probabilistic language model. Neur IPS, 2000.

Gilles Blanchard, Aniket Anand Deshmukh, Urun Dogan, Gyemin Lee, and Clayton Scott. Domain generalization by marginal transfer learning. JMLR, 2011.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Neur IPS, 2020.

Shiyu Chang, Yang Zhang, Mo Yu, and Tommi S Jaakkola. Invariant rationalization. In ICML, 2020.

Yining Chen, Elan Rosenfeld, Mark Sellke, Tengyu Ma, and Andrej Risteski. Iterative feature matching: Toward provable domain generalization with logarithmic environments. Neur IPS, 2022.

Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In IJCNN, 2017.

Cian Eastwood, Alexander Robey, Shashank Singh, Julius Von K ugelgen, Hamed Hassani, George J Pappas, and Bernhard Sch olkopf. Probable domain generalization via quantile risk minimization. Neur IPS, 2022.

Cian Eastwood, Shashank Singh, Andrei Liviu Nicolicioiu, Marin Vlastelica, Julius von K ugelgen, and Bernhard Sch olkopf. Spuriosity didn t kill the classifier: Using invariant predictions to harness spurious features. ar Xiv preprint ar Xiv:2307.09933, 2023.

Published as a conference paper at ICLR 2024

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.

Jean-Christophe Gagnon-Audet, Kartik Ahuja, Mohammad Javad Darvishi Bayazi, Pooneh Mousavi, Guillaume Dumas, and Irina Rish. WOODS: Benchmarks for out-of-distribution generalization in time series. TMLR, 2023.

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franc ois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. JMLR, 2016.

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583 30598, 2022.

Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo Rezende, and SM Ali Eslami. Conditional neural processes. In International conference on machine learning, pages 1704 1713. PMLR, 2018a.

Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami, and Yee Whye Teh. Neural processes. ar Xiv preprint ar Xiv:1807.01622, 2018b.

Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. ar Xiv, 2020.

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. ar Xiv, 2019.

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8340 8349, 2021.

Pavel Izmailov, Polina Kirichenko, Nate Gruver, and Andrew G Wilson. On feature learning in the presence of spurious correlations. Neur IPS, 2022.

Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Enforcing predictive invariance across structured biomedical domains, 2020.

Ilyes Khemakhem, Diederik Kingma, Ricardo Monti, and Aapo Hyvarinen. Variational autoencoders and nonlinear ica: A unifying framework. In AISTATS, 2020.

Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, and Yee Whye Teh. Attentive neural processes. ar Xiv preprint ar Xiv:1901.05761, 2019.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv, 2014.

Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations. ar Xiv, 2022.

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In ICML, 2021.

Masanori Koyama and Shoichiro Yamaguchi. Out-of-distribution generalization with maximal invariant predictor. ar Xiv, 2020.

David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk extrapolation. ar Xiv, 2020.

S ebastien Lachapelle, Pau Rodriguez, Yash Sharma, Katie E Everett, R emi Le Priol, Alexandre Lacoste, and Simon Lacoste-Julien. Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ica. In Conference on Causal Learning and Reasoning, 2022.

Published as a conference paper at ICLR 2024

Mathias Lechner, Ramin Hasani, Alexander Amini, Tsun-Hsuan Wang, Thomas A Henzinger, and Daniela Rus. Are all vision models created equal? a study of the open-loop to closed-loop causality gap. ar Xiv, 2022.

Ya Li, Mingming Gong, Xinmei Tian, Tongliang Liu, and Dacheng Tao. Domain generalization via conditional invariant representations. In Proceedings of the AAAI conference on artificial intelligence, 2018.

Yanghao Li, Naiyan Wang, Jianping Shi, Jiaying Liu, and Xiaodi Hou. Revisiting batch normalization for practical domain adaptation. ar Xiv, 2016a.

Yanghao Li, Naiyan Wang, Jianping Shi, Jiaying Liu, and Xiaodi Hou. Revisiting batch normalization for practical domain adaptation. ar Xiv preprint ar Xiv:1603.04779, 2016b.

Divyat Mahajan, Shruti Tople, and Amit Sharma. Domain generalization using causal matching. ar Xiv, 2020.

Maggie Makar, Ben Packer, Dan Moldovan, Davis Blalock, Yoni Halpern, and Alexander D Amour. Causally motivated shortcut removal using auxiliary labels. In AISTATS, 2022.

Peter Menzies and Helen Beebee. Counterfactual Theories of Causation. In The Stanford Encyclopedia of Philosophy. 2020.

John Stuart Mill. A System of Logic, Ratiocinative and Inductive, volume 1. 1856.

Krikamol Muandet, David Balduzzi, and Bernhard Sch olkopf. Domain generalization via invariant feature representation. In ICML, 2013.

Jens M uller, Robert Schmier, Lynton Ardizzone, Carsten Rother, and Ullrich K othe. Learning robust models using the principle of independent causal mechanisms. ar Xiv, 2020.

Samuel M uller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. Transformers can do bayesian inference. ar Xiv preprint ar Xiv:2112.10510, 2021.

Tung Nguyen and Aditya Grover. Transformer neural processes: Uncertainty-aware meta learning via sequence modeling. ar Xiv preprint ar Xiv:2207.04179, 2022.

Open AI. GPT-4 Technical Report. ar Xiv, 2023.

Giambattista Parascandolo, Alexander Neitz, Antonio Orvieto, Luigi Gresele, and Bernhard Sch olkopf. Learning explanations that are hard to vary. In ICLR, 2021.

Jonas Peters, Peter B uhlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society Series B: Statistical Methodology, 2016.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 2019.

Alexandre Rame, Corentin Dancette, and Matthieu Cord. Fishr: Invariant gradient variances for out-of-distribution generalization. In ICML, 2022.

Alexander Robey, George J Pappas, and Hamed Hassani. Model-based domain generalization. ar Xiv, 2021.

Mateo Rojas-Carulla, Bernhard Sch olkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. JMLR, 2018.

Walter Rudin. Principles of mathematical analysis. 1953.

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv, 2019.

Published as a conference paper at ICLR 2024

Olawale Salaudeen and Oluwasanmi Koyejo. Target conditioned representation independence (tcri); from domain-invariant to domain-general representations. ar Xiv preprint ar Xiv:2212.11342, 2022.

J urgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. Ph D thesis, Technische Universit at M unchen, 1987.

Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, and Matthias Bethge. Improving robustness against common corruptions by covariate shift adaptation. Advances in neural information processing systems, 33:11539 11551, 2020.

Zheyan Shen, Jiashuo Liu, Yue He, Xingxuan Zhang, Renzhe Xu, Han Yu, and Peng Cui. Towards out-of-distribution generalization: A survey. ar Xiv, 2021.

Yuge Shi, Jeffrey Seely, Philip HS Torr, N Siddharth, Awni Hannun, Nicolas Usunier, and Gabriel Synnaeve. Gradient matching for domain generalization. ar Xiv preprint ar Xiv:2104.09937, 2021.

Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In ECCV Workshops, 2016.

Damien Teney, Ehsan Abbasnejad, and Anton van den Hengel. Unshuffling data for improved generalization. ar Xiv, 2020.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023.

Vladimir Vapnik. Statistical learning theory. Wiley, 1998.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Neur IPS, 2017.

Victor Veitch, Alexander D Amour, Steve Yadlowsky, and Jacob Eisenstein. Counterfactual invariance to spurious correlations in text classification. Neur IPS, 2021.

Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo ao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151 35174. PMLR, 2023.

Yoav Wald, Amir Feder, Daniel Greenfeld, and Uri Shalit. On calibration and out-of-domain generalization. Neur IPS, 2021.

Yoav Wald, Gal Yona, Uri Shalit, and Yair Carmon. Malign overfitting: Interpolation can provably preclude invariance. ar Xiv preprint ar Xiv:2211.15724, 2022.

Yoav Wald, Claudia Shi, Aahlad Puli, Amir Feder, Limor Gultchin, Mark Goldstein, Maggie Makar, Victor Veitch, and Uri Shalit. Workshop on spurious correlations, invariance and stability. ICML, 2023. URL https://icml.cc/virtual/2023/workshop/21493.

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. ar Xiv, 2020.

Haoxiang Wang, Haozhe Si, Bo Li, and Han Zhao. Provable domain generalization via invariantfeature subspace recovery. In ICML, 2022.

Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining. ar Xiv, 2023.

Sidney J Yakowitz and John D Spragins. On the identifiability of finite mixtures. The Annals of Mathematical Statistics, 1968.

Shen Yan, Huan Song, Nanxiang Li, Lincan Zou, and Liu Ren. Improve unsupervised domain adaptation with mixup training. ar Xiv preprint ar Xiv:2001.00677, 2020.

Published as a conference paper at ICLR 2024

Huaxiu Yao, Caroline Choi, Bochuan Cao, Yoonho Lee, Pang Wei W Koh, and Chelsea Finn. Wild-time: A benchmark of in-the-wild distribution shift over time. Neur IPS, 2022.

Nanyang Ye, Kaican Li, Haoyue Bai, Runpeng Yu, Lanqing Hong, Fengwei Zhou, Zhenguo Li, and Jun Zhu. Ood-bench: Quantifying and understanding two dimensions of out-of-distribution generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7947 7958, 2022.

Marvin Zhang, Henrik Marklund, Nikita Dhawan, Abhishek Gupta, Sergey Levine, and Chelsea Finn. Adaptive risk minimization: Learning to adapt to domain shift. Neur IPS, 2020.

Yihua Zhang, Pranay Sharma, Parikshit Ram, Mingyi Hong, Kush Varshney, and Sijia Liu. What is missing in IRM training and evaluation? challenges and solutions. ar Xiv, 2023.

Published as a conference paper at ICLR 2024

A Theorems and Proofs 15

A.1 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.2 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.3 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

A.4 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

A.5 Extension of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

A.6 Comparing ICRM and ERM under the lens of invariance . . . . . . . . . . . . . . 27

A.7 Illustration of failure of existing MTL methods . . . . . . . . . . . . . . . . . . . 29

B Related work 30

C Supplementary experimental details and assets disclosure 30

C.1 Assets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

C.2 Hardware and setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

C.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

C.3.1 Federated Extended MNIST (FEMNIST) . . . . . . . . . . . . . . . . . . 30

C.3.2 Rotated MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

C.3.3 WILDS Camelyon17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

C.3.4 CIFAR10-C and Tiny Image Net-C . . . . . . . . . . . . . . . . . . . . . . 31

C.3.5 Image Net R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

C.4 Experimental protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

D Additional experiments 33

D.1 Adaptation curves of various algorithms . . . . . . . . . . . . . . . . . . . . . . . 33

D.2 Domain generalization accuracies per algorithm and dataset . . . . . . . . . . . . . 34

D.2.1 Adaptation to distribution shift . . . . . . . . . . . . . . . . . . . . . . . . 34

D.2.2 Robustness of ICRM in the absence of environment labels . . . . . . . . . 37

D.2.3 Understanding the impact of architecture . . . . . . . . . . . . . . . . . . 37

D.3 Comparison of ICRM with In-Context Learning . . . . . . . . . . . . . . . . . . . 41

D.4 Investigating the features learned by ICRM . . . . . . . . . . . . . . . . . . . . . 41

A THEOREMS AND PROOFS

A.1 PROOF OF PROPOSITION 1

Lemma 1. ICRM is Bayes optimal at all context lengths. Suppose ℓis the binary cross-entropy loss and the labels Y are binary. The optimal in-context learner h (equation 5) satisfies the following condition, i.e., for each k [t]

h(xk; ck) = E[Y |Xk = xk, Ck = ck], (6)

Published as a conference paper at ICLR 2024

for almost all (ck, xk) in the support of training distribution except over a set of a measure zero, and where the expectation is over Y conditional on [ck, xk]. In other words, the in-context learner is Bayes optimal at each context length.

Proof. In this result, we consider the problem of binary classification. Suppose h(xk; ck) is the predicted probability of class Y = 1 conditional on xk and ck. Define h(xk; ck) = h(xk; ck), 1 h(xk; ck) describing the probability of both the classes.

From equation 5, recall that the objective of ICRM is to minimize

j=1 E(Xj,Cj,Yj)[ℓ(h(Xj; Cj), Yj)]. (7)

Consider one of the terms in the sum above - E ℓ(h(Xk; Ck), Yk) . Substituting ℓas the cross-entropy in this term, we obtain

E ℓ(h(Xk; Ck), Yk) = H(Yk|Xk, Ck) + E KL P(Yk|Xk, Ck) h(Xk; Ck) .

If h(Xk; Ck) = P(Yk|Xk, Ck), then the second term in the above is zero and E ℓ(h(Xk; Ck), Yk)

equals H(Yk|Xk; Ck). Since KL divergence is always non-negative, H(Yk|Xk, Ck) corresponds to the lowest value that can be achieved by E ℓ(h(Xk; Ck), Yk) . If h(Xk; Ck) = P(Yk|Xk, Ck) for all k [t], then each of the terms in the sum in equation 7 are minimized. As a result, h(Xk; Ck) = P(Yk|Xk, Ck) for all k [t] is a solution to equation 5.

Consider another minimizer h of equation 5 and define the corresponding distribution h . For each k [t], the second term E KL(P(Yk|Xk, Ck) h (Xk; Ck) has to be zero for h to be a minimizer.

If E KL(P(Yk|Xk, Ck) h (Xk; Ck) = 0, then we claim that h (xk; ck) = P(Yk|Xk = xk, Ck = ck) for almost all (xk, ck) in the support of training distribution except over a set of measure zero. If the probability measure associated with Xk, Ck is absolutely continuous w.r.t Lebesgue measure, then this claim follows from Theorem 1.6.6 (Ash and Dol eans-Dade, 2000). If the probability measure associated with Xk, Ck is absolutely continuous w.r.t counting measure, then this claim trivially follows.

We proved the above result for classification and cross-entropy loss for measures over X, C that are either absolutely continuous w.r.t Lebesgue measure or the counting measure. It is easy to extend the above result for regressions and least square loss; see Lemma 1 in Ahuja and Lopez-Paz (2023).

Proposition 1 (Zoom-out). In the absence of context, ICRM behaves as the global empirical risk minimizer across the support of the training environments, i.e., h( ; ) = h ( ).

Proof. From Lemma 1, it follows that h(xk; ck) = E[Y |Xk = xk, Ck = ck]. The solution to empirical risk minimization is h (x) = E[Y |X1 = x], where the expectation is computed over the training distribution of Y conditional on x. When the context is empty, then we have h(x; ) = E[Y |X1 = x] = h (x) for almost all x in the support of training distribution except over a set of measure zero.

A.2 PROOF OF THEOREM 1

Before stating the proof of Theorem 1, we provide an example of an ideal amortization map b( ).

Example of ideal amortization map. Consider the example from equation ??, where y = α x1 + β µ2 e + ε. P(Y = y|X = x, E = e) = pε(y αx1 βµ2 e), where pε is the probability density of noise. Observe that P(Y = y|X = x, E = e) is parametrized in terms of µ2 e and the sequence of random variables b(X, Ct) = 1 t 1 Pt 1 j=1 X2 j converge almost surely to µ2 e, where X2 j is the second component of Xj and Ct = (X1, , Xt 1).

Published as a conference paper at ICLR 2024

For ease of exposition, we start with the case when all the concerned random variables X, Y, Ct, E, b(X, Ct), where X is the current query and Y is its label and Ct is the context preceeding it sampled from environment E, and b( ) is the ideal amortization map, are discrete-valued with a finite support. Subsequently, we study more general settings. Theorem 1 (Full iid zoom-in). Let h (x, θe x) describe P(Y = 1 | X = x, E = e) for all e E. Further, we assume the existence of an amortization function b(X, Ct) a.s. θE X. Then, ICRM zooms-in on the environment risk minimizer and achieves a cross-entropy loss over the training distribution

lim t H(Y | X, Ct) = H(Y | X, E).

Further, if I(Y ; E | X) > 0, ICRM has better performance than the global risk minimizer.

Proof. As stated above, in this proof, we work with discrete-valued X, Y, Ct, E, b(X, Ct) that also have finite support, where X is the current query and Y is its label and Ct is the context preceeding it sampled from environment E, and b( ) is the ideal amortization map. Subsequently, we study more general settings.

Since each (Xj, Yj) is sampled independently given a training environment E, we can conclude I(Y ; Ct|X, E) = 0. Therefore,

I(Y ; Ct|X, E) = 0 = H(Y |X, E) = H(Y |X, E, Ct).

Observe that for all t Z+ H(Y |X, E) = H(Y |X, E, Ct) H(Y |X, Ct) H(Y |X, b(X, Ct)), (8)

where Z+ is the set of all positive integers. The first inequality in the above follows from the fact that conditioning reduces entropy. For the second inequality, we use the following property. Consider U, V as two random variables and define W = a(V ). Observe that I(U; W|V ) = 0 = H(U|V ) = H(U|V, W) H(U|W).

Since the inequality above equation 8 holds for all t, we obtain

H(Y |X, E) lim t H(Y |X, Ct) lim t H(Y |X, b(X, Ct)). (9)

In the above, we use the following property. If an bn, n Z+ and limn an and limn bn exist, then limn an limn bn. In what follows, we will show that both the limits limt H(Y |X, Ct) and limt H(Y |X, b(X, Ct)) exist. First observe that H(Y |X, Ct+1) H(Y |X, Ct) for all t as a result the sequence is decreasing bounded below by 0 and thus from monotone convergence theorem (Rudin, 1953) limt H(Y |X, Ct) exists. Next, we will show that limt H(Y |X, b(X, Ct)) = H(Y |X, E). We will then combine it equation 9 to obtain what we intend to prove, i.e., limt H(Y |X, Ct) = H(Y |X, E).

For each X = x and E = e in the support of training distribution, we argue that b(X, Ct) a.s. θe x. Suppose this was not true. This implies that the probability that P(limt b(X, Ct) = θe x|X = x, E = e) = β > 0. Since X = x, E = e occurs with a finite probability (as X and E are discrete-valued and x, e is in the support) say α, then αβ fraction of sequences of b(X, Ct) do not converge to θe x, which contradicts the assumption that b(X, Ct) a.s. θE X.

Consider a (x, θ) from the support of (X, θE X), where X is the current query and E is the environment from which X and context preceeding it is sampled. Let us consider the distribution P(Y |X, b(X, Ct)).

P(Y = y|X = x, b(X, Ct) = θ) = P(Y = y, X = x, b(X, Ct) = θ)

P(X = x, b(X, Ct) = θ) (10)

We simplify limt P(Y |X, b(X, Ct)) below.

lim t P(Y = y|X = x, b(X, Ct) = θ) = limt P(Y = y, X = x, b(X, Ct) = θ)

limt P(X = x, b(X, Ct) = θ) (11)

Published as a conference paper at ICLR 2024

We show that the limits of the numerator and denominator exist (and non-zero for the denominator) and we simplify these separately below.

lim t P(Y = y, X = x, b(X, Ct) = θ) = lim t

e P(Y = y, X = x, E = e, b(X, Ct) = θ)

e P(Y = y|X = x, E = e) lim t P(X = x, E = e, b(X, Ct) = θ)

e P(Y = y|X = x, E = e)P(X = x, E = e) lim t P(b(X, Ct) = θ|X = x, E = e)

In the simplification above, we firstly used the fact that we can interchange sum and limits, this is true because e only takes finitely many values. In the simplification above, we also use the fact Y Ct|X, E. Since b(X, Ct) converges to θe x almost surely, the distribution limt P(b(X, Ct) = θ|X = x, E = e) takes a value one if θ = θe x and zero otherwise. As a result, the above expression becomes

lim t P(Y = y, X = x, b(X, Ct) = θ) = X

e Ex,θ P(Y = y|X = x, E = e)P(X = x, E = e).

(13) where Ex,θ is the set of all the environments observed conditional on X = x with θe x = θ. Observe that all the environments in Ex,θ have the same P(Y = 1|X = x, E = e) given by h (x, θ). We can write

lim t P(Y = 1, X = x, b(X, Ct) = θ) = h (x, θ) X

e Ex,θ P(X = x, E = e). (14)

We simplify limt P(X = x, b(X, Ct) = θ) in a similar manner to obtain

lim t P(X = x, b(X, Ct) = θ) = X

e Ex,θ P(X = x, E = e). (15)

Observe that the denominator is positive and not zero because x, θ is in support of X, θE X. We use equation 14 and equation 15 to obtain

lim t P(Y = 1|X = x, b(X, Ct) = θ) = limt P(Y = 1, X = x, b(X, Ct) = θ)

limt P(X = x, b(X, Ct) = θ)

= h (x, θ) P

e Ex,θ P(X = x, E = e) P

e Ex,θ P(X = x, E = e) = h (x, θ). (16)

Therefore, lim t P(Y = 1|X = x, b(X, Ct) = θ) = P(Y = 1|X = x, E = e). (17)

where e is any environment in Ex,θ, i.e., it is in the support of data sampled with X = x and that also satisfies θe x = θ.

lim t H(Y |X, b(X, Ct)) = X

x,θ lim t P(X = x, b(X, Ct) = θ) lim t H(Y |X = x, b(X, Ct) = θ)

e Ex,θ P(X = x, E = e) lim t H(Y |X = x, b(X, Ct) = θ)

Published as a conference paper at ICLR 2024

In the above simplification, we again swap limits and sum because the summation is over a finite set of values. From equation 17, it follows that limt H(Y |X = x, b(X, Ct) = θ) = H(Y |X = x, E = e), where e is any environment in Ex,θ. We use this in the above to get

lim t H(Y |X, b(X, Ct)) = X

e Ex,θ P(X = x, E = e) H(Y |X = x, E = e)

e Ex,θ P(X = x, E = e) H(Y |X = x, E = e)

x, e P(X = x, E = e)H(Y |X = x, E = e) = H(Y |X, E).

We combine the above with equation 9 to obtain limt H(Y |X, Ct) = H(Y |X, E). Finally, observe that if I(Y ; E|X) > 0 = H(Y |X, E) < H(Y |X). Since limt H(Y |X, Ct) = H(Y |X, E), ICRM impoves over ERM that attains a cross-entropy loss of H(Y |X).

This completes the proof.

We now extend the argument to setting beyond discrete random variables. In particular, we consider settings where X, E, b(X, Ct) can be either discrete or continuous random variables. In the notation to follow, we use d P to denote the Radon-Nikodym derivatives. For discrete random variable, the Radon-Nikodym derivatives correspond to the standard probability mass function and for continuous random variables it would correspond to standard probability density functions. We operate under some regularity assumptions. We assume that the support of E has a finite volume and the support of (X, b(X, Ct)) has a finite volume for all t. Further, we assume that the Radon-Nikodym derivative of the joint d P(X = x, E = e, b(X, Ct) = θ) is bounded above. While much of the proof that follows is same as the previous proof, we repeat the arguments for completeness. Theorem 4. Let h (x, θe x) describe d P(Y = 1 | X = x, E = e) for all e E. Further, we assume the existence of an amortization function b(X, Ct) a.s. θE X. Then, ICRM zooms-in on the environment risk minimizer and achieves a cross-entropy loss over the training distribution

lim t H(Y | X, Ct) = H(Y | X, E).

Further, if I(Y ; E | X) > 0, ICRM has better performance than the global risk minimizer.

Proof. Since each (Xj, Yj) is sampled independently given a training environment E, we can conclude I(Y ; Ct|X, E) = 0. Therefore,

I(Y ; Ct|X, E) = 0 = H(Y |X, E) = H(Y |X, E, Ct).

Observe that for all t Z+

H(Y |X, E) = H(Y |X, E, Ct) H(Y |X, Ct) H(Y |X, b(X, Ct)), (20)

where Z+ is the set of all positive integers. The first inequality in the above follows from the fact that conditioning reduces entropy. For the second inequality, we use the following property. Consider U, V as two random variables and define W = a(V ). Observe that I(U; W|V ) = 0 = H(U|V ) = H(U|V, W) H(U|W).

Since the inequality above equation 20 holds for all t, we obtain

H(Y |X, E) lim t H(Y |X, Ct) lim t H(Y |X, b(X, Ct)). (21)

In the above, we use the following property. If an bn, n Z+ and limn an and limn bn exist, then limn an limn bn. In what follows, we will show that both the limits limt H(Y |X, Ct) and limt H(Y |X, b(X, Ct)) exist. First observe that H(Y |X, Ct+1) H(Y |X, Ct) for all t as a result the sequence is decreasing bounded below by 0 and thus from Monotone convergence theorem limt H(Y |X, Ct) exists. Next, we will show that

Published as a conference paper at ICLR 2024

limt H(Y |X, b(X, Ct)) = H(Y |X, E). We will then combine it with equation 21 to obtain what we intend to prove, i.e., limt H(Y |X, Ct) = H(Y |X, E).

For each X = x and E = e in the support except over a set of probability measure zero, we argue that b(X, Ct) a.s. θe x. Suppose this was not true. Define Γ to be the set of values of x, e for

which b(X, Ct) a.s. θe x. Let P((X, E) Γ) > 0 and the probability that P(limt b(X, Ct) = θE X|(X, E) Γ) > 0. If this is true then P(limt b(X, Ct) = θe x) > 0 contradicts the fact that b(X, Ct) a.s. θE X. Therefore, P((X, E) Γ) = 0.

Consider a (x, θ) from the support of (X, θE X) except from Γ, where X is the current query and E is the environment from which X and context preceeding it is sampled. Let us consider the distribution d P(Y |X, b(X, Ct)).

d P(Y = y|X = x, b(X, Ct) = θ) = d P(Y = y, X = x, b(X, Ct) = θ)

d P(X = x, b(X, Ct) = θ) (22)

We simplify limt d P(Y = y|X = x, b(X, Ct) = θ) below.

lim t d P(Y = y|X = x, b(X, Ct) = θ) = limt d P(Y = y, X = x, b(X, Ct) = θ)

limt d P(X = x, b(X, Ct) = θ) (23)

We simplify the numerator and the denominator of the above separately.

lim t d P(Y = y, X = x, b(X, Ct) = θ) = lim t

e d P(Y = y, X = x, E = e, b(X, Ct) = θ)

e d P(Y = y|X = x, E = e) lim t d P(X = x, E = e, b(X, Ct) = θ)

e d P(Y = y|X = x, E = e)d P(X = x, E = e) lim t d P(b(X, Ct) = θ|X = x, E = e)

In the above, we use dominated convergence theorem (Ash and Dol eans-Dade, 2000) to swap limit and the integrals (to use dominated convergence theorem, we use the fact that the d P(X = x, E = e, b(X, Ct) = θ) is bounded and support of E has a finite volume). In the simplification above, we also use the fact Y Ct|X, E. Since b(X, Ct) converges to θe x almost surely, the distribution limt d P(b(X, Ct) = θ|X = x, E = e) evaluates to probability one when θ = θe x and is zero otherwise. As a result, the above expressions become

lim t d P(Y = y, X = x, b(X, Ct) = θ) = Z

e Ex,θ d P(Y = y|X = x, E = e)d P(X = x, E = e).

where Ex,θ is the set of all the environments observed conditional on X = x with θe x = θ. Observe that all the environments in Ex,θ have the same d P(Y = 1|X = x, E = e) given by h (x, θ). Similarly,

lim t d P(X = x, b(X, Ct) = θ) = Z

e Ex,θ d P(X = x, E = e). (26)

As a result, we can write

lim t d P(Y = 1, X = x, b(X, Ct) = θ) = h (x, θ) Z

e Ex,θ d P(X = x, E = e).

Published as a conference paper at ICLR 2024

We use this to obtain

lim t d P(Y = 1|X = x, b(X, Ct) = θ) = limt d P(Y = 1, X = x, b(X, Ct) = θ)

limt d P(X = x, b(X, Ct) = θ)

= h (x, θ) R

e Ex,θ d P(X = x, E = e) R

e Ex,θ d P(X = x, E = e) = h (x, θ). (27)

lim t d P(Y = y|X = x, b(X, Ct) = θ) = d P(Y = y|X = x, E = e). (28)

where e is any environment that is in the support of data sampled with X = x and that also satisfies θe x = θ.

lim t H(Y |X, b(X, Ct)) = Z

x,θ lim t d P(X = x, b(X, Ct) = θ) lim t H(Y |X = x, b(X, Ct) = θ)

e Ex,θ d P(X = x, E = e) lim t H(Y |X = x, b(X, Ct) = θ)

In the above, we use dominated convergence theorem to swap the limits and integrals (Recall that d P(X = x, E = e, b(X, Ct) = θ) is bounded say by say ς and the volume of the support of E is bounded say by φ. As a result, d P(X = x, b(X, Ct) = θ)H(Y |X = x, b(X, Ct) = θ) ςφ log(2).). From equation 28, it follows that limt H(Y |X = x, b(X, Ct) = θ) = H(Y |X = x, E = e), where e is any environment in Ex,θ. We use this in the above to get

lim t H(Y |X, b(X, Ct)) = Z

e Ex,θ d P(X = x, E = e) H(Y |X = x, E = e)

e Ex,θ d P(X = x, E = e) H(Y |X = x, E = e)

x, e d P(X = x, E = e)H(Y |X = x, E = e) = H(Y |X, E).

We combine the above with equation 21 to obtain limt H(Y |X, Ct) = H(Y |X, E). Finally, observe that if I(Y ; E|X) > 0 = H(Y |X, E) < H(Y |X). Since limt H(Y |X, Ct) = H(Y |X, E), ICRM impoves over ERM that attains a cross-entropy loss of H(Y |X).

This completes the proof.

A.3 PROOF OF THEOREM 2

Theorem 2 (Partial iid zoom-in). Suppose the joint distribution ((X1, Xt), (Y1, . . . , Yt), E) is Markov with respect to a Bayesian network. The query X and the environment E are statistically dependent and form the Markov blanket of Y . Then ICRM partially zooms-in on the environment risk minimizer, improving over the performance of the global empirical risk minimizer in terms of the cross-entropy loss. Further, the improvement is strictly monotonic in context length t.

Proof. Let us consider the setting where the context is of length one. We denote the current query as X with corresponding label Y and environment E. The example in the context is X which has corresponding label Y and it shares the same environment E. Recall that as part of the context, the learner only sees X and not Y . Both Y and E are real-valued scalars and X is a d dimensional vector.

Published as a conference paper at ICLR 2024

Following the assumption in the theorem, the distribution of ( X, Y , X, Y, E) is Markov with respect to a Bayesian network. We first establish that E cannot be a child of any variable in the directed acyclic graph (DAG). The assumption (X, Y ) ( X, Y )|E implies X X|E and Y Y |E. Suppose E is a child variable of Y . Due to the symmetry, (X, Y, E) and ( X, Y , E) follow the same distribution. As a result, E is also a child variable of Y , which implies Y Y |E (since E is a collider on the path from Y to Y ). This contradicts Y Y |E. Suppose E is a child variable of some component of X say Xi. Due to symmetry, E is also a child variable of Xi, which implies Xi Xi|E. This contradicts X X|E. Therefore, E cannot be a child of any of the variables in the DAG.

Since both X and E form the Markov blanket of Y , there are two possible cases. Either E is directly connected to Y or E is connected to Y through some element of X.

In the first case, E can only have an arrow into Y and not the other way around as E is not a child of any other node. Let us consider the setting when E is one of the parents of Y and denote it as E Y . Since X ( X) is on the Markov Blanket of Y ( Y ), we claim that each component of X is either a parent of Y or a child of Y . Suppose this was not the case. This implies that there exists a component of X say Xi, which is on the Markov Blanket as a parent of E. But that would make E a child of Y . However, E cannot be a child variable as shown above. As a result, each component of X is either a parent or a child of Y . We now consider two subcases.

Let us consider the setting when there exists a child Xi of Y . Observe that Xi is a child of Y and it has a path to E and as a result it has a path to Y . This path from elements of Xi to Y passes through E. This path has no colliders and does not contain any element of X on it (We show this case in Figure 3a). As a result, Y Xi|X. Thus I(Y ; X|X) > 0 (use chain rule of mutual information).

Let us consider the other setting when each Xi is a parent of Y (shown in Figure 3b). In this case, E has to have a path to some element of X, say Xj as otherwise E X, which contradicts the assumption that E X. Consider the path Xj to E to Y . Observe that this path is not blocked. As a result, I(Y ; X|X) > 0.

Let us consider the other possibility when Y is connected to E through X. Here the only way this is possible is if some element of X say Xi is a child of Y and E is a parent of that element (as shown in Figure 3c). Therefore, we know that Xi is connected to Y through E and Xi.

Observe that this path from Xi to Y is not blocked as Xi is a collider. Therefore, I(Y ; X|X) > 0. We showed the result so far assuming that the context length was one. Suppose that the context has k 1 examples denoted as Ck = [X1 , Xk 1]. The chain rule of mutual information tells us I(Y ; Ck|X) = I(Y ; Xk 1|X) + I(Y ; Ck 1|X, Xk 1). The proof above already demonstrates that the first term I(Y ; Xk 1|X) is strictly positive. Since mutual information is non-negative, we can conclude that I(Y ; Ck|X) > 0.

Next, we want to argue that entropy strictly reduces as context length increases. In other words,

H(Y |X, Ck) < H(Y |X, Ck 1) I(Y ; Xk|X, Ck 1) > 0.

We want to show Y Xk|(X, Ck 1). In the proof above, we had three cases shown in Figure 3. In each of these cases, we argued that the path from Xk to Y is not blocked. Even if we condition on contexts Ck 1 this continues to be the case. In the first two cases, the path from Xk to Y is direct and does not contain any element from the conditioning set. In the third case, the direct path involves a collider X from the conditioning set and thus is also not blocked. As a result, Y Xk|(X, Ck 1). This completes the proof.

Remark on the Theorem 2 It is possible to extend Theorem 2 to the case when only a subset of X and E form the Markov blanket. Observe that the analysis of Case a) and Case c) in Figure 3a, Figure 3c does not change. The analysis of Case b) is more nuanced now. In Case b), we used the fact that E is connected to X that is on the Markov blanket. This need not be the case if only a subset of X is on the Markov blanket. Suppose XMB denote the set of X that are on the Markov Blanket. If E is connected to any member of XMB, the same analysis as Case b) continues to hold. Consider the

Published as a conference paper at ICLR 2024

(a) Case 1.

(b) Case 2.

(c) Case 3.

Figure 3: Illustrating the different key cases for Theorem 2.

case when E is connected to some other member of X that is not in XMB. Denote this member as Xi. Observe that the same element Xi from X will have a direct path into Y through E that is not blocked. As a result, even in this case conditioning on X helps.

A.4 PROOF OF THEOREM 3

Theorem 3 (Full ood zoom-in) Consider data triplets (x, y, e) generated from z N(µy e, Σy e) and x g(z), for all environments e E, where g is the identity map. There exists an ICL algorithm that in the limit of infinitely long contexts produces Bayes optimal predictions for all the test environments that fall in the Voronoi cells of the training environments.

Proof. The learning algorithm works as follows. For each e, y pair in the training data, define the set of x s as De,y x . Maximize the likelihood of De,y x assuming that the underlying distribution is Gaussian. This can be stated as

ˆµy e, ˆΣy e = arg min µy e,Σy e

h x µy e 2 (Σy e) 1 i log(det(Σy e)) .

The solution to the above are standard sample mean based estimators of means and covariance. Also, use a sample mean based estimator to estimate the probability of each class in environment e and denote it as ˆpy e. Define ˆγe = [(ˆp0 e, ˆµ0 e, ˆΣ0 e), (ˆp1 e, ˆµ1 e, ˆΣ1 e)]. The model at test time works as follows.

We are given samples De x at test time from some environment e Ete. Estimate the parameters of Gaussian mixture model with two mixture components to maximize the likelihood of observing De x . We denote the estimated parameters as θe = [pe , µe , Σe , pe , µe , Σe ]. Define a permutation of θe as βe = [ pe , µe , Σe , pe , µe , Σe ].

Find the closest environment to the estimated parameters in the training set.

min{ θe ˆγe , βe ˆγe } (31)

Suppose e is the closest training environment that solves the above. If θe is closer to ˆγ e than βe , then pe , µe , Σe correspond to the label 0 and pe , µe , Σe correspond to the label 1. For the query x, the probability assigned to label 0 is

c(x) = pe e x µe 2 (Σe ) 1

pe e x µe 2 (Σe ) 1 + pe e x µe 2 ( Σe ) 1 .

If βe is closest to this environment, then pe , µe , Σe correspond to the label 1 and pe , µe , Σe is the label 0. For the query x, the probability assigned to label 0 is 1 c(x).

For the training environments, in the limit of infinitely long contexts the estimated parameters take exact values, i.e., ˆγe = γe, for all e Etr.

Published as a conference paper at ICLR 2024

Voronoi cell of training environment

Figure 4: Illustration of Voronoi cells of training environment.

For the test environment, the true set of parameters that generate the data are γe , where γe = (p0 e , µ0 e , Σ0 e ), (p1 e , µ1 e , Σ1 e ) . Define the permutation of γe as δe = (p1 e , µ1 e , Σ1 e ), (p0 e , µ0 e , Σ0 e ) .

There can be two types of test environments. One in which the mean and covariance for both classes are identical. The method above assigns a probability of 1

2 to both the classes, which is the Bayes optimal prediction. Let us consider the latter environments, where the class conditional parameters for x are not the same. In the limit of infinitely long contexts at test time, there are two possible values θe can take, either θe = γe or θe = δe . This follows from identifiability of Gaussian mixtures, Yakowitz and Spragins (1968).

Consider the first case, θe = γe . In this case, the equation 31 becomes

min{ γe γe , δe γe } .

Suppose some environment e solves the above optimization. Following the assumption in we know that γe falls in the Voronoi region of some γ e and thus γe is closer to γ e than δ e (see Figure 4). As a result, p0 e , µ0 e , Σ0 e is associated with class 0, which is actually correct and thus the final predictor would match the Bayes optimal predictor for the test environment. In the second case, θe = δe . Therefore, βe = γe and p1 e , µ1 e , Σ1 e would be correctly associated with class one thus leading to Bayes optimal predictions. This completes the argument we set out to prove.

We now briefly explain how the method fails if test parameter is outside the Voronoi cell of training parameters. Suppose θe = γe but γe is in Voronoi region of some δe. In this case, βe would be closest to γe and p0 e , µ0 e , Σ0 e would be incorrectly associated with class 1. This shows that beyond the Voronoi region the proposed algorithm fails.

A.5 EXTENSION OF THEOREM 3

In the previous theorem, we assumed that g is identity. We now describe how the result can be extended to general non-linear mixing maps g. For this result, we leverage the theoretical results from identifiable variational autoencoders (i-VAE) (Khemakhem et al., 2020).

A short review of identifiable variational autoencoders (Khemakhem et al., 2020) We are provided with observations x s that are generated from a latent variable z using an injective map g,

Published as a conference paper at ICLR 2024

where x g(z). The theory of i-VAE provides with a method and the conditions under which the underlying true latent variables z can be identified up to permutation and scaling. In i-VAEs, it is assumed that along with each sample x, we are provided with auxiliary information, which they term as u. For our results, auxiliary information is available to us in the form of the environment index and the label of the data point. In the theory of i-VAE, the distribution of the latent variables are assumed to follow a conditionally factorial exponential distribution stated as follows.

p T,λ(z|u) = Y

Qi(zi) Mi(u) exp k X

j=1 Ti,j(zi)λi,j(u) (32)

where Ti = (Ti,1, , Ti,k) are the sufficient statistics, λi(u) = (λi,1(u), , λi,k(u)) are the parameters of the distribution that vary with u, Qi is a base measure and Mi is a normalizing constant. We concatenate Ti s and λ is across d latent dimensions to make construct dk dimensional vectors denoted as λ(u) and T(z). Thus the data generation process is summarized as

z p T,λ( |u), x g(z), (33)

where g, T, λ are the parameters. We now revisit the data generation process that we consider and explain how it falls under the umbrella of the data generation processes considered in i-VAE. For all e E,

z|y, e N(µy e, Σy e), x g(z), (34)

where the latent variables z are sampled conditional on the label y and environment e from a Normal distribution whose mean and covariance depend on both y, e. Define X as the image of g, i.e., X = g(Rd). We further assume that the covariance matrix has a diagonal structure as stated below.

Assumption 1. Each Σy e is a diagonal matrix.

Since Σy e is a diagonal matrix, we denote the ith diagonal element as (σy e(i))2. Similarly, the ith component of µy e is denoted as µy e(i). Observe that the distribution of z conditional on y, e belongs to the family conditionally factorial exponential distributions studied in i-VAE (Khemakhem

et al., 2020). If we substitute Qi(zi) = 1

2π, Mi(y, e) = e (µy e(i))2/(σy e (i))2 , λi,1(y, e) = 2µy e(i) (σy e (i))2 , λi,2(y, e) = 1 (σy e (i))2 , Ti,1(z) = z and Ti,2(z) = z2, then we obtain the distribution of z described by equation 34.

Definition 1. We define an equivalence relation between sets of parameters of the model as follows.

(g, T, λ) ( g, T, λ) A, c | T(g 1(x)) = A T( g 1(x)) + c, x X. (35)

If A is invertible, then we denote the relation by A. If A is a block permutation matrix, then we denote it by P .

We now state some key results from (Khemakhem et al., 2020).

Theorem 5. Assume that the data is sampled from the data generation in equation 33 according to with parameters (g, T, λ). Assume the following holds

The mixing function g is injective

The sufficient statistics Ti,j are differentiable almost everywhere, and (Ti,j)1 j k are linearly independent on any subset of X of measure greater than zero.

There exists dk + 1 distinct points u0, , udk such that the matrix

L = (λ(u1) λ(u0), , λ(udk) λ(u0))

of size dk dk is invertible.

Published as a conference paper at ICLR 2024

then the parameters (g, T, λ) are A identifiable. Theorem 6. Assume the hypotheses of the Theorem 5 holds, and k 2. Further assume:

The sufficient statistics Ti,j are twice differentiable.

The mixing function g is C2-diffeomorphism.

then the parameters (g, T, λ) are P identifiable.

We can leverage the above two theorems (Theorem 5, Theorem 6 and Theorem 4 from Lachapelle et al. (2022)) and arrive at the following corollary for the Gaussian data generation process from equation 34. Theorem 7. If the data generation process follows equation 34, where g is a C2-diffeomorphism. Suppose there exist 2d + 1 points u0 = (y0, e0), , u2d = (y2d, e2d) in the support of (y, e) observed in training distribution such that

(λ(u1) λ(u0), , λ(u2d) λ(u0))

is invertible. If pg,T,λ( |y, e) = p g, T , λ( |y, e) for all y, e in the support of (y, e) in the training distribution, then z = ΛΠz + r, where z = g 1(x) and z = g 1(x).

Proof. We equate the probability of observations x under two models g, T, λ and g, T, λ for each y, e. Consider a z p T,λ( |y, e) and the corresponding x = g(z). These x s follow p g, T , λ( |y, e) since pg,T,λ( |y, e) = p g, T , λ( |y, e). Define z = g 1(x) and these z follow p T , λ( |y, e). We can write z = a(z), where a = g 1 g.

Observe pz(z|y, e) = p z(a(z)|y, e)det(Da(z)) and

log pz z|yk, ek = log p z(a(z)|yk, ek) + log det(Da(z)),

log pz z|y0, e0 = log p z(a(z)|y0, e0) + log det(Da(z)),

log pz z|yk, ek log pz(z|y0, e0) = log p z(a(z)|yk, ek) log pˆz(a(z)|y0, e0) .

Substituting the exponential form we obtain that

T(z) [λ(yk, ek) λ(y0, e0))] = T( z) [ λ(yk, ek) λ(y0, e0))]

If we use sufficient variability conditions, we obtain T(z) = AT( z) + c. We now use the fact that sufficient statistics T(z) = (z, z2) are minimal to conclude that

T(z) = AT( z) + c

where A is invertible. In the above, we use the line of reasoning used in in the proof of Theorem 4 in (Lachapelle et al., 2022).

After this point, we leverage Theorem 6 to conclude that

Ti(zi) = ATj( zj) + c.

We can expand the above to write

= D zi z2 i

Note that the above relationship holds for all z Z. If zj depends on z2 i , then z2 j would be a degree four polynomial in zi and it would be equated to a degree 2 polynomial zi stated in the RHS. This cannot be true for all zi in the support. As a result, zj is a scalar multiple of zi. Since for every i there is such a j, it follows that z = ΛΠz + r.

Published as a conference paper at ICLR 2024

Theorem 8. (Zoom-in [ood]) Consider the data generation process in equation 34. We make a few additional assumptions on the data generation stated below.

Each Σy e is a diagonal matrix

There exist 2d + 1 points u0 = (y0, e0), , u2d = (y2d, e2d) in the support of (y, e) observed in training distribution such that

(λ(u1) λ(u0), , λ(u2d) λ(u0))

is invertible.

g is a C2-diffeomorphism.

Under the above assumptions, we can guarantee that there exists an in-context learning algorithm that in the limit of infinitely long contexts generates Bayes optimal predictions for all the test environments that fall in Voronoi cells of training parameters weighted by a certain vector.

Proof. The training proceeds as follows. Train an autoencoder on training data under the constraint that the output of the encoder follow a Gaussian distribution with independent components conditional on each y, e. This is stated as the following minimization.

ˆg, ˆf, ˆµy e, ˆΣy e = arg min g, f,{µy e,Σy e} E[ ( g f(x) x) 2] + α X

y,e KL p z( |y, e) N(µy e, Σy e) (37)

where z = f(x), p z( |y, e) is the distribution of z. The first term is standard reconstruction loss and the second term is the KL divergence between distribution of z and a Normal distribution with independent components. Also, estimate the class probabilities for each environment and denote them as ˆpy e. Similar to the proof of Theorem 3 define ˆγe = [(ˆpy e, ˆµy e, ˆΣy e)y {0,1}]

The model at test time works as follows. We first use the trained encoder ˆf and generate z for test time inputs. After this the model operates in exactly the same way on z s as in the proof of Theorem 3. Basically the output of encoder takes place of raw x s in the procedure described in proof of Theorem 3.

The assumptions in this theorem along with following i) z follows a Gaussian distribution with independent components, ii) g( z) follows distribution of x conditional on y, e for each y, e, implies we can use the previous result in Theorem 7 to conclude that z = ΛΠz + r. Observe that z also follows a Gaussian distribution with independent components conditional on each y, e. In the limit of infinitely long contexts, ˆγe is equal to scaled means of original training environments and covariances also scaled componentwise according to the transform ΛΠ. We can now apply the previous Theorem 3 on z s as follows. If the parameters of the test environment are in the Voronoi cell of the train distribution of z s, then the procedure described above continues to generate Bayes optimal predictions in those environments.

A.6 COMPARING ICRM AND ERM UNDER THE LENS OF INVARIANCE

The label y is related to x1 and mean of x2 in environment e as follows.

y αx1 + βµ2 e + ε (38)

ERM learns a linear model on two dimensional feature vector x = (x1, x2). The closed form solution for linear regression is Λ 1ρ, where Λ = E[XX ], which is assumed to be invertible, and

ρ = E[XY ]. The covariance matrix of X is defined as Σ = σ2 1 σ12 σ12 σ2 2

Proposition 2. Let E[X1|E = e] = 0 for all e E. If Σ is invertible, β = 0, σ12 = 0, µ2 e = 0 for some e Etr, then the coefficient estimated by ERM for x1 is not the same as the invariant coefficient α.

Published as a conference paper at ICLR 2024

Proof. We compute ρ first.

ρ = E[XY ] = αE[(X1)2] + βE[µ2 EX1] αE[X1X2] + β[µ2 EX2]

= α σ2 1 σ12 + β

where δ = E[(µ2 E)2].

Next, we compute Λ.

Λ = E[XX ] = σ2 1 σ12 σ12 σ2 2 + δ

The solution to ERM is

= α (σ2 2 + δ)σ2 1 σ2 12

σ2 2 + δ σ12 σ12 σ2 1

σ2 1 σ12 + β

Simplifying the above, we obtain the coefficient for x1 to be

α = α σ12βE[(µ2 E)2] σ2 1 σ2 2 + E[(µ2 E)2] σ2 12 . (42)

Owing to the assumptions, β = 0, σ12 = 0 and µ2 e for some e we obtain that the second term in the above is not zero. As a result, the estimate computed by ERM for α is biased.

Proposition 3. Let E[X1|E = e] = 0 for all e E. If Σ is invertible, β = 0, σ12 = 0, µ2 e = 0. The error of ERM in test environment increases in σ2 1

Proof. The error of ERM is given as

E[(αX1 + βµ2 e α X1 β X2)2] + σ2 ε = (α α )2σ2 1 + β2E[(µ2 E)2] + (β )2E[(X2)2] 2ββ E[(µ2 E)2] 2(α α )βσ12 + σ2 ε, (43)

where σ2 ε is the variance of the noise variable ε. If we take the derivative of the above error w.r.t σ2 1, we obtain (α α )2, which is positive. This completes the proof.

ICRM learns a linear model on (x1, x2, µ1 e, µ2 e). We study two settings to analyze the error of ICRM at test time. If at test time, the model has seen sufficiently long contexts, then it knows the means corresponding to x1 and x2 and the model achieves the test error of σ2 ε. On the other hand, if the context is empty, then also note that the expected error of the model is β2 µ2 e 2 (assuming the model uses a default value of zero for the mean in the absence of any context), where µ2 e is the mean of x2 in environment e . Since the error of ICRM in the absence of any context is independent of variance of x1, the error of ERM can be much worse than that of ICRM in this setting as well.

Extending the above example beyond linear settings. Let us consider a more general setting.

y = u(x1, µ2 e) + ε,

x2 = v(µ2 e, ϑ), (44)

where u( ) and v( ) are maps (potentially non-linear), ε and ϑ are independent zero mean noise variables. Following the same line of thought as the above example. ICRM learns a non-linear model on (x1, x2, µe 1, µe 2) and learns E[Y |x1, x2, µ1 e, µ2 e]. From equation 44, it follows that

Y (X2, µ1 E)|(X1, µ2 E) = E[Y |x1, x2, µ1 e, µ2 e] = E[Y |x1, µ2 e] = u(x1, µ2 e).

Published as a conference paper at ICLR 2024

From the above it follows that ICRM learns u(x1, µ2 e). In comparison, consider standard ERM learns a non-linear model on (x1, x2). Consider the DAG corresponding to setting equation 44. We assume that the joint distribution described in equation 44 is Markov w.r.t to the following DAG X1 Y µ2 E X2. As a result, Y X2|X1. This follows from the fact there is a path Y to X2 through µ2 E and is not blocked by X1. From Y X2|X1 it follows that ERM learns a predictor that relies on both x1 and x2. Therefore, ICRM learns the right invariant model and does not rely on x2 and ERM relies on spurious feature x2.

A.7 ILLUSTRATION OF FAILURE OF EXISTING MTL METHODS

In this section, we provide a simple example to show the failure mode of marginal transfer learning (MTL) methods that are based on averaging 1 |c| P

xi c Φ( ) to summarize information about the environment. These methods can be summarized to take the following form:

xi c Φ(xi), x . (45)

We are only going to consider maps Φ that are differentiable.

Example. Suppose we want to learn the following function

w(x, c) = 1

xi C I(x < xi), (46)

where xi is the ith input in the context and x is the current query and I( ) is indicator function that takes a value of one if the argument inside is true and zero otherwise. We claim that if

xi c Φ(xi), x = w(x, c) for all x R, c R|c|, then the output dimension of Φ grows in

context length |c|. Suppose this was not the case. If Φ s output dimension is smaller than |c|, then Φ cannot be a differentiable bijection. As a result, there exists two contexts c and c of same length for which P

xi c Φ(xi) = P

xi c Φ(xi). We argue that there exists an x such that w(x, c) = w(x, c ).

This would lead to a contradiction as f 1 |c| P

xi c Φ(xi), x = w(x, c) for all x, c. Without loss

of generality, suppose that the smallest value of context c is smaller than that in context c . If x is larger than smallest value of c but lesser than smallest value of c , then w(x, c ) = 1 on the other hand w(x, c) 1 1

We can translate the insight from the above example into more general settings. Consider any map w(x, c), that satisfies the following property. For no two distinct contexts c and c , w(x, c) = w(x, c )

for all x R. Maps of the form f 1 |c| P

xi c Φ(xi), x can only approximate such w s provided

dimension of Φ grows in length of c.

We explain how the above example can be described by attention-based architectures with much

fewer parameters. First take the current query x and transform it through a linear map x = x 1

transform the past context values through a linear map as well to obtain a transform for xi to xi = 1 xi

. We set the Query Q and Key K matrices such that Q K = 1 0 0 1

and thus x Q K xi =

( x + xi). Instead of softmax, if we pass the output through a sigmoid, we obtain σ(τ x Q K xi) = 1 1+e τ(xi x) . If τ is sufficiently large, then this approximates I(x < xi). Therefore, one layer linear attention with sigmoid and sufficiently high τ achieves the target, i.e., P

xi c σ(τ x Q K xi) P

xi C I(x < xi).

Published as a conference paper at ICLR 2024

B RELATED WORK

A brief tour of domain generalization. Muandet et al. (2013) developed kernel methods to learn transformations such that the distance between the feature distributions across domains is minimized and the information between the features and the target labels is preserved. The pioneering work of Ganin et al. (2016) proposes a method inspired from generative adversarial networks to learn feature representations that are similar across domains. Sun and Saenko (2016) developed a method based on a natural strategy to match the means and covariances of feature representations across domains. Li et al. (2018) went a step further to enforce invariance on the distribution of representations conditional on the labels. In a parallel line of work, led by Peters et al. (2016); Rojas-Carulla et al. (2018); Arjovsky et al. (2019), the proposals sought to learn representations such that the distribution of labels conditional on the representation are invariant across domains. These works were followed by several interesting proposals to enforce invariance (Teney et al., 2020; Krueger et al., 2020; Ahuja et al., 2020; Jin et al., 2020; Chang et al., 2020; Mahajan et al., 2020; Koyama and Yamaguchi, 2020; M uller et al., 2020; Parascandolo et al., 2021; Robey et al., 2021; Wald et al., 2021; Chen et al., 2022; Wang et al., 2022; Zhang et al., 2023; Eastwood et al., 2022; Rame et al., 2022; Veitch et al., 2021; Makar et al., 2022; Wald et al., 2022; Salaudeen and Koyejo, 2022; Eastwood et al., 2023) which is an incomplete representative list. See Shen et al. (2021) for a more comprehensive survey of these works. Most of the above works have focused on learning features that enable better generalization. Recently there been an intriguing line of work from Kirichenko et al. (2022); Izmailov et al. (2022) that shifts the focus from feature learning to last layer retraining. These works show that under certain conditions (e.g., avaiability of some data that does not carry spurious correlations) one can carry out last layer retraining and achieve significant out-of-distribution performance improvements.

In the main body of the paper, we already discussed the other prominent line of work in domain generalization on marginal transfer learning, where the focus is to leverage the distributional features and learn environment specific relationships. This line of work was started by the notable work of Blanchard et al. (2011) and has been followed up by several important proposals such as Zhang et al. (2020); Bao and Karaletsos (2023).

Context-supported prediction frameworks. Existing works have exploited contextual information to develop a variety of prediction frameworks. The works on neural processes and conditional neural processes (Garnelo et al., 2018b;a) combined the uncertainty estimation capabilities of Gaussian processes with function aprpoximation capabilities of neural networks and showed promising results on meta-learning. These works were later improved through transformer based archictectures in attentive neural processes (Kim et al., 2019; Nguyen and Grover, 2022). Context-based architectures have also been used to study in-context learning in (Garg et al., 2022; Aky urek et al., 2022; Von Oswald et al., 2023). These works use labeled data in the context to enable adaptation. In contrast, our work adheres to the constraints of domain generalization and only leverages unlabeled data.

C SUPPLEMENTARY EXPERIMENTAL DETAILS AND ASSETS DISCLOSURE

We do not introduce new data in the course of this work. Instead, we use publicly available widely used image datasets for the purposes of benchmarking and comparison.

C.2 HARDWARE AND SETUP

Each experiment was performed on 8 NVIDIA Tesla V100 GPUs with 32GB accelerator RAM for a single training run. The CPUs used were Intel Xeon E5-2698 v4 processors with 20 cores and 384GB RAM. All experiments use the Py Torch deep-learning framework.

C.3 DATASETS

C.3.1 FEDERATED EXTENDED MNIST (FEMNIST)

Building on the Extended MNIST (EMNIST) dataset, which includes images of handwritten uppercase and lowercase alphabets along with digits, FEMNIST (Zhang et al., 2020) enriches

Published as a conference paper at ICLR 2024

this data by attributing each data point to its originating writer. This extension associates each 28 28-sized image in the dataset to one of the 62 classes. In our setup, each writer serves as a distinct environment. We evaluate the performance of each method based on both worst-case and average accuracy across a set of 35 test users, who are distinct from the 262 training users and 50 validation users. Unlabelled data from an environment in this dataset could provide cues about the writing style of the user and disambiguate data points.

C.3.2 ROTATED MNIST

We employ a customized version of the MNIST dataset as in Zhang et al. (2020). The dataset contains images rotated in increments of 10 degrees, ranging from 0 to 130 degrees. Each degree of rotation constitutes a separate environment, effectively acting as a distinct value. The training set for the two most extreme rotations, 120 and 130 degrees, contains only 108 data points each. For rotations between 90 and 110 degrees, each environment includes 324 data points. The total training set comprises 32,292 points. For evaluation, test images are generated from the MNIST test set, and are duplicated for each environment. Performance metrics include both worst-case and average accuracy across these testing domains. Analogous to FEMNIST, unlabeled samples from an environment within this dataset can assist in distinguishing images that may seem similar due to their rotated orientations.

C.3.3 WILDS CAMELYON17

We use the Camelyon17 dataset, part of the WILDS benchmark (Koh et al., 2021), which features image patches derived from whole-slide lymph node sections of patients with potential metastatic breast cancer. Each patch is labeled to indicate the presence or absence of a tumor. In our experimental design, each participating hospital is treated as a distinct environment. The dataset is partitioned in alignment with the official WILDS configuration: three hospitals contribute to the training set, a fourth is designated for validation, and the remaining hospital s data is used for testing.

C.3.4 CIFAR10-C AND TINY IMAGENET-C

Adapting the methodology from Hendrycks and Dietterich (2019), we introduce 56 distinct distortions to the training set, treating each as a separate environment. For evaluation, we use a non-overlapping set of 22 test distortions, largely differing in nature from those used in training. For Tiny Image Net-C, each 64 64-sized distorted image is associated with one of the 200 classes in the dataset. The same set of corruptions are employed to augment the CIFAR10 dataset, resulting in 32 32-sized images for the CIFAR10-C dataset.This setup permits an investigation into whether exposure to distortions during training equips the model to better manage novel distortions during testing. We assess performance through both worst-case and average accuracies across these test distortions.

C.3.5 IMAGENET R

Image Net-R comprises a diverse range of artistic and creative content, encompassing art, cartoons, deviant art, graffiti, embroidery, graphics, origami, paintings, patterns, plastic objects, plush objects, sculptures, sketches, tattoos, toys, and video game interpretations of Image Net classes. This dataset consists of renditions for 200 Image Net classes, totaling 30,000 images. For our experiments, we utilize images from categories of cartoons, paintings, stickers, graphics, sculptures, sketches, tattoos, toys, and video games for training. Validation is conducted using images from embroidery, miscellaneous, and graffiti categories, while the test environments incorporate images from art, deviant art, and origami categories. This segregation introduces extreme real-world distribution shifts, where the data features during testing differ significantly from those observed during training.

Published as a conference paper at ICLR 2024

C.4 EXPERIMENTAL PROTOCOLS

To ensure a fair comparison across different algorithms for each dataset, we use a standardized neural network backbone. The details for these architectures are provided in Table 4 and Table 5. We use the Conv Net architecture as outlined in Zhang et al. (2020).

For ICRM, the same backbone is used to featurize the input, which is then processed by the decoderonly Transformer (Vaswani et al., 2017) architecture from the GPT-2 Transformer family (Radford et al., 2019). Our model is standardized to have 12 layers, 4 attention heads, and a 128-dimensional embedding space across all datasets. Linear layers are employed to map both the input sequence to the transformer s latent embedding and the model s predicted output vector to the output label. For training ICRM on larger datasets like Image Net R, CIFAR10-C, WILDS Camelyon17 and Tiny Image Net-C, we start with a Res Net50 model pre-trained on Image Net (as shown in Table 4) and freeze all batch normalization layers before fine-tuning.

We adopt the same Context Network as used in ARM, specifically retaining their choice of output channels one for smaller datasets like FEMNIST and Rotated MNIST, and three for the others.

For TENT, all reported metrics are based on its episodic version, where the model is reset to its trained state after processing each batch. This ensures a fair comparison with other methods. Additionally, during testing, the model s parameters are updated for 10 steps using stochastic gradient descent by minimization test entropy across all datasets.

Table 4: Network architectures for each dataset.

Dataset Architecture

ICRM Others

FEMNIST Conv Net + GPT2 Transformer Conv Net Rotated MNIST

CIFAR10-C Camelyon17 Res Net-50 + GPT2 Transformer Res Net-50 Tiny Image Net-C Image Net R

Table 5: Conv Net architecture for (Zhang et al., 2020). We use 2 2 kernels and same padding.

1 Conv2D (in=d, out=128) 2 Batch Norm2d (dim=129) 3 Re LU 4 Max Pooling (2) 5 Conv2D (in=128, out=128) 6 Batch Norm2d (dim=128) 7 Re LU 8 Max Pooling (2) 9 Global average-pooling

We list all hyperparameters, their default settings, and search boundaries for random sweeps in Table 6. The maximum context length, or support, is fixed at 100 for all algorithms. All models are optimized using the Adam optimizer (Kingma and Ba, 2014). To ensure a fair comparison, we perform a random search of 5 trials across the hyperparameter range (refer to Table 6) for each algorithm. The model with the highest validation set accuracy is selected for each run. We then report the average of this number across three independent runs of the entire sweep, and its corresponding standard error.

Table 6: Hyperparameters, their default values and distributions for random search.

Condition Parameter Default value Random distribution

Res Net learning rate 0.0001 10Uniform( 5, 3.5)

weight decay 0 10Uniform( 6, 2)

not Res Net learning rate 0.0001 10Uniform( 4.5, 2.5)

weight decay 0 10Uniform( 6, 2)

Published as a conference paper at ICLR 2024

D ADDITIONAL EXPERIMENTS

D.1 ADAPTATION CURVES OF VARIOUS ALGORITHMS

0 10 20 30 40 50 # In-context examples

Worst Group Test Accuracy

0 10 20 30 40 50 # In-context examples

Average Test Accuracy

ARM ERM ICRM TENT

0 10 20 30 40 50 # In-context examples

Worst Group Test Accuracy

0 10 20 30 40 50 # In-context examples

Average Test Accuracy

ARM ERM ICRM TENT

0 10 20 30 40 50 # In-context examples

Worst Group Test Accuracy

0 10 20 30 40 50 # In-context examples

Average Test Accuracy

ARM ERM ICRM TENT

0 10 20 30 40 50 # In-context examples

Worst Group Test Accuracy

0 10 20 30 40 50 # In-context examples

Average Test Accuracy

ARM ERM ICRM TENT

Figure 5: Accuracy adaptation curves for worst accuracy (left) and average accuracy (right) across the test environment as a function of increasing count of context samples. Showing results in order for FEMNIST(top), Rotated MNIST, WILDS Camelyon17 and Tiny Image Net-C(bottom). The average and worst-case accuracy plots for WILDS Camelyon17 are identical since the dataset contains only a single test environment.

Published as a conference paper at ICLR 2024

D.2 DOMAIN GENERALIZATION ACCURACIES PER ALGORITHM AND DATASET

D.2.1 ADAPTATION TO DISTRIBUTION SHIFT

In our experiments, we compare ICRM against marginal transfer methods such as Adaptive Risk Minimization (Zhang et al., 2020, ARM), test-time adaptation proposals such as TENT (Wang et al., 2020) and Empirical Risk Minimization (Vapnik, 1998, ERM). We also include comparisons with six additional baselines, including approaches that follow the invariance-based paradigm like Fish (Shi et al., 2021) and IB-IRM (Ahuja et al., 2021), alongside those that use contextual information differently, such as BN Adapt (Schneider et al., 2020; Li et al., 2016b) and Bayesian BN Adapt (Schneider et al., 2020). BN Adapt replaces the global batch normalization statistics learned during training with test batch statistics at inference. On the other hand, Bayesian BN Adapt assumes the global statistics of the training data as a prior and linearly interpolates between these statistics and the test batch statistics during inference. Additionally, we evaluate methods that employ classic regularization techniques such as Mixup (Yan et al., 2020) and IB-ERM (Ahuja et al., 2021). Table 2 shows the average performance attained by these methods across four benchmark datasets. Further, Table 7 and Table 8 demonstrate the average and worst group out-of-distribution performance, respectively, accompanied by the corresponding standard errors. These statistics are computed across three independent runs of the entire sweep, wherein the model selected for evaluation is the one with hyper-parameters yielding the highest validation accuracy.

Published as a conference paper at ICLR 2024

Table 7: Average out-of-distribution test accuracies along with their corresponding standard errors for various counts of context samples. The methods compared include Adaptive Risk Minimization (ARM), Empirical Risk Minimization (ERM), Test Entropy Minimization (TENT), Batch Norm Adaptation (BN Adapt), Bayesian Batch Norm Adaptation (Bayesian BN Adapt), Fish, IB-ERM, IB-IRM, Mixup and our method ICRM on FEMNIST, Rotated MNIST, WILDS Camelyon17, Tiny-Image Net-C, Image Net R and CIFAR10-C.

Dataset / algorithm Average test accuracy (by # in-context examples)

FEMNIST 0 25 50 75 100 ARM 49.5 1.0 83.9 0.5 84.4 0.5 84.7 0.6 84.6 0.3 TENT 78.1 1.2 77.9 1.2 81.2 0.9 82.5 0.9 83.3 0.8 BN Adapt 78.3 0.6 76.9 1.4 80.3 0.9 81.5 0.7 82.4 0.6 Bayesian BN Adapt 78.3 1.8 79.6 1.0 81.3 0.6 82.2 0.7 82.9 0.8 Fish 77.2 0.6 77.2 0.6 77.2 0.6 77.2 0.6 77.2 0.6 IB-ERM 79.0 1.5 79.0 1.5 79.0 1.5 79.0 1.5 79.0 1.5 IB-IRM 79.0 0.4 79.0 0.4 79.0 0.4 79.0 0.4 79.0 0.4 Mixup 78.6 0.9 78.6 0.9 78.6 0.9 78.6 0.9 78.6 0.9 ERM 79.3 0.4 79.3 0.4 79.3 0.4 79.3 0.4 79.3 0.4 ICRM 78.7 0.5 87.2 0.4 87.4 0.5 87.5 0.2 87.8 0.2

Rotated MNIST 0 25 50 75 100 ARM 36.5 5.2 94.2 0.7 95.1 0.4 95.3 0.4 95.5 0.3 TENT 94.1 0.3 88.0 0.4 91.9 0.3 93.8 0.2 94.3 0.2 BN Adapt 94.6 0.8 87.0 2.3 91.5 1.5 93.7 1.2 94.3 1.0 Bayesian BN Adapt 94.6 1.0 91.2 1.6 93.4 1.2 94.3 1.0 94.7 1.0 Fish 94.8 0.4 94.8 0.4 94.8 0.4 94.8 0.4 94.8 0.4 IB-ERM 92.2 0.5 92.2 0.5 92.2 0.5 92.2 0.5 92.2 0.5 IB-IRM 91.0 1.1 91.0 1.1 91.0 1.1 91.0 1.1 91.0 1.1 Mixup 93.6 0.0 93.6 0.0 93.6 0.0 93.6 0.0 93.6 0.0 ERM 94.2 0.3 94.2 0.3 94.2 0.3 94.2 0.3 94.2 0.3 ICRM 93.6 0.2 96.1 0.1 96.2 0.1 96.2 0.1 96.2 0.1

WILDS Camelyon17 0 25 50 75 100 ARM 61.2 5.2 59.5 4.2 59.7 4.2 59.7 4.3 59.7 4.2 TENT 67.9 7.6 81.8 1.1 87.2 1.1 89.4 1.1 89.4 1.0 BN Adapt 67.5 5.9 82.0 0.3 87.4 0.3 89.7 0.3 89.9 0.3 Bayesian BN Adapt 67.5 6.1 82.0 0.3 87.3 0.3 89.6 0.2 89.7 0.3 Fish 53.9 2.8 53.9 2.8 53.9 2.8 53.9 2.8 53.9 2.8 IB-ERM 51.8 1.0 51.8 1.0 51.8 1.0 51.8 1.0 51.8 1.0 IB-IRM 53.9 1.3 53.9 1.3 53.9 1.3 53.9 1.3 53.9 1.3 Mixup 62.8 5.7 62.8 5.7 62.8 5.7 62.8 5.7 62.8 5.7 ERM 68.6 7.8 68.6 7.8 68.6 7.8 68.6 7.8 68.6 7.8 ICRM 92.0 0.6 90.7 0.8 90.8 0.8 90.8 0.8 90.8 0.8

Tiny Image Net-C 0 25 50 75 100 ARM 30.8 0.2 31.0 0.2 31.0 0.2 31.0 0.2 31.0 0.2 TENT 31.7 0.5 1.6 0.1 1.7 0.1 2.0 0.1 2.1 0.1 BN Adapt 31.7 0.7 1.7 0.1 1.7 0.1 1.9 0.1 2.1 0.1 Bayesian BN Adapt 31.7 0.8 2.2 0.1 2.1 0.1 2.3 0.1 2.4 0.1 Fish 33.7 0.8 33.7 0.8 33.7 0.8 33.7 0.8 33.7 0.8 IB-ERM 35.5 0.4 35.5 0.4 35.5 0.4 35.5 0.4 35.5 0.4 IB-IRM 35.4 0.3 35.4 0.3 35.4 0.3 35.4 0.3 35.4 0.3 Mixup 35.5 0.3 35.5 0.3 35.5 0.3 35.5 0.3 35.5 0.3 ERM 31.8 0.6 31.8 0.6 31.8 0.6 31.8 0.6 31.8 0.6 ICRM 38.3 0.1 39.2 0.3 39.2 0.3 39.2 0.3 39.2 0.3

Image Net R 0 25 50 75 100 ARM 56.3 0.8 58.1 0.3 58.8 0.8 59.8 0.8 59.0 0.3 TENT 58.9 0.5 10.1 0.2 10.7 0.1 12.1 0.2 13.0 0.1 BN Adapt 58.9 0.5 9.9 0.1 10.9 0.1 12.2 0.2 13.1 0.1 Bayesian BN Adapt 58.9 0.5 11.9 0.1 12.3 0.1 13.9 0.3 14.6 0.1 Fish 58.6 1.0 58.6 1.0 58.6 1.0 58.6 1.0 58.6 1.0 IB-ERM 58.5 0.5 58.5 0.5 58.5 0.5 58.5 0.5 58.5 0.5 IB-IRM 57.8 0.0 57.8 0.0 57.8 0.0 57.8 0.0 57.8 0.0 Mixup 58.8 0.8 58.8 0.8 58.8 0.8 58.8 0.8 58.8 0.8 ERM 58.9 0.5 58.9 0.5 58.9 0.5 58.9 0.5 58.9 0.5 ICRM 57.4 0.4 59.7 0.4 59.6 0.6 59.4 0.4 60.5 0.3

CIFAR10-C 0 25 50 75 100 ARM 65.9 1.3 66.0 1.3 66.0 1.3 66.0 1.3 66.0 1.3 TENT 66.1 1.6 63.9 2.1 68.4 2.1 69.9 2.0 70.5 2.0 BN Adapt 66.1 1.6 63.9 2.1 68.4 2.1 69.9 2.0 70.1 2.0 Bayesian BN Adapt 66.1 1.6 65.1 2.1 68.9 2.0 69.8 2.0 70.0 2.0 Fish 72.3 1.0 72.3 1.0 72.3 1.0 72.3 1.0 72.3 1.0 IB-ERM 65.2 2.9 65.2 2.9 65.2 2.9 65.2 2.9 65.2 2.9 IB-IRM 64.3 2.6 64.3 2.6 64.3 2.6 64.3 2.6 64.3 2.6 Mixup 72.8 0.4 72.8 0.4 72.8 0.4 72.8 0.4 72.8 0.4 ERM 66.1 1.6 66.1 1.6 66.1 1.6 66.1 1.6 66.1 1.6 ICRM 70.6 0.2 71.0 0.2 71.0 0.2 71.0 0.2 71.0 0.3

Published as a conference paper at ICLR 2024

Table 8: Worst environment out-of-distribution test accuracies along with their corresponding standard errors for various counts of context samples. The methods compared include Adaptive Risk Minimization (ARM), Empirical Risk Minimization (ERM), Test Entropy Minimization (TENT), Batch Norm Adaptation (BN Adapt), Bayesian Batch Norm Adaptation (Bayesian BN Adapt), Fish, IB-ERM, IB-IRM, Mixup and our method ICRM on FEMNIST, Rotated MNIST, WILDS Camelyon17, Tiny-Image Net-C, Image Net R and CIFAR10-C.

Dataset / algorithm Worst case test accuracy (by # in-context examples)

FEMNIST 0 25 50 75 100 ARM 23.6 1.7 59.5 3.5 60.7 3.8 57.0 7.3 58.8 4.0 TENT 55.2 2.5 57.2 2.2 63.3 0.4 65.9 0.6 67.2 1.0 BN Adapt 52.7 6.2 56.2 2.5 61.9 0.1 64.7 2.5 65.3 0.9 Bayesian BN Adapt 54.3 2.6 60.4 1.2 64.7 0.9 65.5 2.2 66.3 1.2 Fish 52.8 1.2 52.8 1.2 52.8 1.2 52.8 1.2 52.8 1.2 IB-ERM 58.6 3.4 58.6 3.4 58.6 3.4 58.6 3.4 58.6 3.4 IB-IRM 57.3 2.6 57.3 2.6 57.3 2.6 57.3 2.6 57.3 2.6 Mixup 57.0 1.9 57.0 1.9 57.0 1.9 57.0 1.9 57.0 1.9 ERM 59.0 0.2 59.0 0.2 59.0 0.2 59.0 0.2 59.0 0.2 ICRM 59.8 0.7 69.3 0.0 70.6 2.3 70.6 1.5 70.6 0.7

Rotated MNIST 0 25 50 75 100 ARM 28.2 2.1 85.3 1.6 87.2 1.0 87.9 1.0 87.9 0.9 TENT 80.2 1.3 88.5 0.8 88.5 0.9 80.2 1.0 81.3 1.0 BN Adapt 80.5 2.4 70.9 2.8 76.9 2.5 79.8 2.7 80.9 2.3 Bayesian BN Adapt 80.5 2.9 75.4 3.1 79.2 2.6 80.7 2.8 81.3 2.5 Fish 83.2 1.9 83.2 1.9 83.2 1.9 83.2 1.9 83.2 1.9 IB-ERM 72.0 0.9 72.0 0.9 72.0 0.9 72.0 0.9 72.0 0.9 IB-IRM 69.9 3.4 69.9 3.4 69.9 3.4 69.9 3.4 69.9 3.4 Mixup 81.2 0.7 81.2 0.7 81.2 0.7 81.2 0.7 81.2 0.7 ERM 80.8 1.1 80.8 1.1 80.8 1.1 80.8 1.1 80.8 1.1 ICRM 82.5 0.5 88.5 0.5 88.5 0.5 88.8 0.5 88.8 0.4

WILDS Camelyon17 0 25 50 75 100 ARM 61.2 5.2 59.5 4.2 59.7 4.2 59.7 4.3 59.7 4.2 TENT 67.9 7.6 81.8 1.1 87.2 1.1 89.4 1.1 89.4 1.0 BN Adapt 67.5 5.9 82.0 0.3 87.4 0.3 89.7 0.3 89.9 0.3 Bayesian BN Adapt 67.5 6.1 82.0 0.3 87.3 0.3 89.6 0.2 89.7 0.3 Fish 53.9 2.8 53.9 2.8 53.9 2.8 53.9 2.8 53.9 2.8 IB-ERM 51.8 1.0 51.8 1.0 51.8 1.0 51.8 1.0 51.8 1.0 IB-IRM 53.9 1.3 53.9 1.3 53.9 1.3 53.9 1.3 53.9 1.3 Mixup 62.8 5.7 62.8 5.7 62.8 5.7 62.8 5.7 62.8 5.7 ERM 68.6 7.8 68.6 7.8 68.6 7.8 68.6 7.8 68.6 7.8 ICRM 92.0 0.6 90.7 0.8 90.8 0.8 90.8 0.8 90.8 0.8

Tiny Image Net-C 0 25 50 75 100 ARM 8.2 0.3 8.3 0.3 8.2 0.3 8.3 0.3 8.2 0.3 TENT 1.2 0.4 1.4 0.0 1.6 0.1 1.6 0.0 1.6 0.0 BN Adapt 9.4 0.7 1.3 0.0 1.4 0.0 1.6 0.0 1.7 0.0 Bayesian BN Adapt 9.4 0.7 1.6 0.2 1.6 0.1 1.8 0.0 1.8 0.0 Fish 11.1 0.1 11.1 0.1 11.1 0.1 11.1 0.1 11.1 0.1 IB-ERM 15.8 0.6 15.8 0.6 15.8 0.6 15.8 0.6 15.8 0.6 IB-IRM 15.9 0.4 15.9 0.4 15.9 0.4 15.9 0.4 15.9 0.4 Mixup 11.3 0.5 11.3 0.5 11.3 0.5 11.3 0.5 11.3 0.5 ERM 9.5 0.4 9.5 0.4 9.5 0.4 9.5 0.4 9.5 0.4 ICRM 18.8 0.2 19.2 0.1 19.5 0.2 19.5 0.1 19.4 0.2

Image Net R 0 25 50 75 100 ARM 47.4 1.1 45.3 0.4 47.2 1.9 49.8 1.2 47.4 1.0 TENT 48.0 1.0 8.6 0.1 8.4 0.1 8.9 0.1 9.1 0.1 BN Adapt 48.0 1.0 8.5 0.1 8.5 0.1 8.9 0.1 9.1 0.0 Bayesian BN Adapt 48.0 1.0 10.5 0.2 10.3 0.2 10.7 0.2 10.9 0.2 Fish 46.0 2.1 46.0 2.1 46.0 2.1 46.0 2.1 46.0 2.1 IB-ERM 47.2 1.3 47.2 1.3 47.2 1.3 47.2 1.3 47.2 1.3 IB-IRM 47.2 0.4 47.2 0.4 47.2 0.4 47.2 0.4 47.2 0.4 Mixup 47.9 2.1 47.9 2.1 47.9 2.1 47.9 2.1 47.9 2.1 ERM 48.0 1.0 48.0 1.0 48.0 1.0 48.0 1.0 48.0 1.0 ICRM 45.4 0.7 48.0 0.2 47.2 0.8 46.9 0.4 50.6 1.3

CIFAR10-C 0 25 50 75 100 ARM 39.3 1.7 39.3 1.7 39.4 1.7 39.3 1.7 39.4 1.7 TENT 39.8 2.5 45.4 2.1 48.9 2.1 49.7 2.0 52.6 2.0 BN Adapt 39.8 2.5 43.8 2.1 45.1 2.0 47.8 2.0 48.6 2.0 Bayesian BN Adapt 39.8 2.5 44.5 2.0 46.8 2.0 49.6 2.0 51.0 2.1 Fish 49.9 1.5 49.9 1.5 49.9 1.5 49.9 1.5 49.9 1.5 IB-ERM 44.9 3.4 44.9 3.4 44.9 3.4 44.9 3.4 44.9 3.4 IB-IRM 43.3 2.3 43.3 2.3 43.3 2.3 43.3 2.3 43.3 2.3 Mixup 53.9 2.4 53.9 2.4 53.9 2.4 53.9 2.4 53.9 2.4 ERM 39.8 2.5 39.8 2.5 39.8 2.5 39.8 2.5 39.8 2.5 ICRM 54.6 0.4 56.0 0.5 55.8 0.5 55.8 0.5 55.9 0.5

Published as a conference paper at ICLR 2024

D.2.2 ROBUSTNESS OF ICRM IN THE ABSENCE OF ENVIRONMENT LABELS

As outlined in Section 4, the training regimen of ICRM assumes a dataset D = {(xi, yi, ei)}n i=1 collected under multiple training environments ei Etr. However, in scenarios lacking such domain separation during training, does ICRM continue to show an edge over ERM baselines? To study this question, we modify the sampling strategy: rather than constructing context vectors containing examples from one environment, we construct context vectors containing iid samples from all of the environments pooled together. To continue to test for out-of-distribution generalization, however, we evaluate the performance on examples from a novel test environment. We term this modified approach ICRM-Mix.

Table 9 and Table 10 contrasts the performance of ICRM with ICRM-Mix. ICRM consistently outperforms ICRM-Mix across varying counts of in-context samples on both FEMNIST and Rotated MNIST. Surprisingly, ICRM-Mix and ICRM perform similarly on WILDS Camelyon17 and Tiny Image Net-C. Consider a setting where the model benefits the most attending to examples from the same class or related classes. If classes are distributed uniformly across domains, then ICRM and ICRM-mix are bound to perform similarly. Consider another setting where the model benefits the most by attending to environment-specific examples such as characters drawn by the same user. In such a case, ICRM and ICRM-mix have very different performances.

Table 9: Average out-of-distribution test accuracies along with their corresponding standard errors for ICRM and ICRM-Mix across FEMNIST, Rotated MNIST, WILDS Camelyon17 and Tiny-Image Net C. ICRM-Mix trains on sequences with samples drawn i.i.d. from the unified dataset comprising various environments.

Dataset / algorithm Average test accuracy (by # in-context examples)

FEMNIST 0 25 50 75 100 ICRM 78.7 0.5 87.2 0.4 87.4 0.5 87.5 0.2 87.8 0.2 ICRM-Mix 77.6 0.8 81.1 0.2 81.1 0.2 80.9 0.3 80.9 0.1

Rotated MNIST 0 25 50 75 100 ICRM 93.6 0.2 96.1 0.1 96.2 0.1 96.2 0.1 96.2 0.1 ICRM-Mix 88.9 1.4 92.6 0.3 92.7 0.2 92.6 0.3 92.7 0.2

WILDS Camelyon17 0 25 50 75 100 ICRM 92.0 0.6 90.7 0.8 90.8 0.8 90.8 0.8 90.8 0.8 ICRM-Mix 92.9 0.3 90.7 0.6 90.8 0.5 90.7 0.5 90.7 0.5

Tiny Image Net-C 0 25 50 75 100 ICRM 38.3 0.1 39.2 0.3 39.2 0.3 39.2 0.3 39.2 0.3 ICRM-Mix 38.4 0.2 39.3 0.2 39.3 0.2 39.3 0.2 39.3 0.2

Imagenet R 0 25 50 75 100 ICRM 57.4 0.4 59.7 0.4 59.6 0.6 59.4 0.4 60.5 0.3 ICRM-Mix 54.9 1.0 54.9 1.0 57.8 1.0 57.8 1.0 57.8 1.0

CIFAR10-C 0 25 50 75 100 ICRM 70.6 0.2 71.0 0.2 71.0 0.2 71.0 0.2 71.0 0.3 ICRM-Mix 69.2 0.2 69.4 0.3 69.4 0.3 69.4 0.3 69.4 0.3

D.2.3 UNDERSTANDING THE IMPACT OF ARCHITECTURE

Table 3 presents the average performance of both ERM+ and ARM+ relative to ERM and ARM, across the four datasets. Further, Table 11 and Table 12 demonstrate the average and worst group out-of-distribution performance of these approaches, respectively, along with the corresponding standard errors. These statistics are computed across three independent runs of the entire sweep, wherein the model selected for evaluation is the one with hyper-parameters yielding the highest validation accuracy.

Published as a conference paper at ICLR 2024

Table 10: Worst environment out-of-distribution test accuracies along with their corresponding standard errors for ICRM and ICRM-Mix across FEMNIST, Rotated MNIST, WILDS Camelyon17 and Tiny-Image Net-C. ICRM-Mix trains on sequences with samples drawn i.i.d. from the unified dataset comprising various environments.

Dataset / algorithm Worst case test accuracy (by # in-context examples)

FEMNIST 0 25 50 75 100 ICRM 59.8 0.7 69.3 0.0 70.6 2.3 70.6 1.5 70.6 0.7 ICRM-Mix 57.5 1.4 62.7 1.1 65.0 0.3 64.1 1.5 62.9 2.3

Rotated MNIST 0 25 50 75 100 ICRM 82.5 0.5 88.5 0.5 88.5 0.5 88.8 0.5 88.8 0.4 ICRM-Mix 68.8 3.8 77.1 0.7 76.8 0.9 76.4 0.9 76.6 0.9

WILDS Camelyon17 0 25 50 75 100 ICRM 92.0 0.6 90.7 0.8 90.8 0.8 90.8 0.8 90.8 0.8 ICRM-Mix 92.9 0.3 90.7 0.6 90.8 0.5 90.7 0.5 90.7 0.5

Tiny Image Net-C 0 25 50 75 100 ICRM 18.8 0.2 19.2 0.1 19.5 0.2 19.5 0.1 19.4 0.2 ICRM-Mix 18.7 0.2 19.2 0.2 19.4 0.1 19.5 0.1 19.4 0.1

Imagenet R 0 25 50 75 100 ICRM 45.4 0.7 48.0 0.2 47.2 0.8 46.9 0.4 50.6 1.3 ICRM-Mix 44.4 1.7 46.9 0.4 48.1 1.6 46.6 0.6 48.7 1.0

CIFAR10-C 0 25 50 75 100 ICRM 54.6 0.4 56.0 0.5 55.8 0.5 55.8 0.5 55.9 0.5 ICRM-Mix 53.3 0.0 54.2 1.1 54.2 1.1 54.3 1.1 54.2 1.1

Published as a conference paper at ICLR 2024

Table 11: Average out-of-distribution test accuracies along with their corresponding standard errors for ARM+ and ERM+ in contrast to their base algorithms, ARM and ERM across FEMNIST, Rotated MNIST, WILDS Camelyon17 and Tiny-Image Net-C.

Dataset / algorithm Average test accuracy (by # in-context examples)

FEMNIST 0 25 50 75 100 ARM 49.5 1.0 83.9 0.5 84.4 0.5 84.7 0.6 84.6 0.3 ARM+ 71.4 1.2 83.4 0.2 84.0 0.2 83.8 0.2 83.5 0.1

ERM 79.3 0.4 79.3 0.4 79.3 0.4 79.3 0.4 79.3 0.4 ERM+ 77.4 1.3 77.4 1.3 77.4 1.3 77.4 1.3 77.4 1.3

Rotated MNIST 0 25 50 75 100 ARM 36.5 5.2 94.2 0.7 95.1 0.4 95.3 0.4 95.5 0.3 ARM+ 86.9 2.0 92.6 0.7 92.7 0.6 92.8 0.6 92.8 0.6

ERM 94.2 0.3 94.2 0.3 94.2 0.3 94.2 0.3 94.2 0.3 ERM+ 94.3 0.4 94.3 0.4 94.3 0.4 94.3 0.4 94.3 0.4

WILDS Camelyon17 0 25 50 75 100 ARM 61.2 5.2 59.5 4.2 59.7 4.2 59.7 4.3 59.7 4.2 ARM+ 55.8 0.8 55.1 1.7 55.0 1.7 55.0 1.8 55.0 1.8

ERM 68.6 7.8 68.6 7.8 68.6 7.8 68.6 7.8 68.6 7.8 ERM+ 50.1 0.1 50.1 0.1 50.1 0.1 50.1 0.1 50.1 0.1

Tiny Image Net-C 0 25 50 75 100 ARM 30.8 0.2 31.0 0.2 31.0 0.2 31.0 0.2 31.0 0.2 ARM+ 5.5 0.2 5.7 0.2 5.7 0.2 5.7 0.2 5.7 0.2

ERM 31.8 0.6 31.8 0.6 31.8 0.6 31.8 0.6 31.8 0.6 ERM+ 29.7 0.3 29.7 0.3 29.7 0.3 29.7 0.3 29.7 0.3

Imagenet R 0 25 50 75 100 ARM 56.3 0.8 58.1 0.3 58.8 0.8 59.8 0.8 59.0 0.3 ARM+ 1.8 0.2 1.5 0.2 1.5 0.2 1.5 0.2 1.5 0.1

ERM 58.9 0.5 58.9 0.5 58.9 0.5 58.9 0.5 58.9 0.5 ERM+ 57.0 0.4 57.0 0.4 57.0 0.4 57.0 0.4 57.0 0.4

CIFAR10-C 0 25 50 75 100 ARM 65.9 1.3 66.0 1.3 66.0 1.3 66.0 1.3 66.0 1.3 ARM+ 42.7 0.1 44.2 0.1 44.3 0.1 44.3 0.1 44.3 0.1

ERM 66.1 1.6 66.1 1.6 66.1 1.6 66.1 1.6 66.1 1.6 ERM+ 66.5 1.2 66.5 1.2 66.5 1.2 66.5 1.2 66.5 1.2

Published as a conference paper at ICLR 2024

Table 12: Worst environment out-of-distribution test accuracies along with their corresponding standard errors for ARM+ and ERM+ in contrast to their base algorithms, ARM and ERM across FEMNIST, Rotated MNIST, WILDS Camelyon17 and Tiny-Image Net-C.

Dataset / algorithm Worst case test accuracy (by # in-context examples)

FEMNIST 0 25 50 75 100 ARM 23.6 1.7 59.5 3.5 60.7 3.8 57.0 7.3 58.8 4.0 ARM+ 51.7 2.2 63.0 2.1 64.0 0.8 60.7 1.6 62.0 0.8

ERM 59.0 0.2 59.0 0.2 59.0 0.2 59.0 0.2 59.0 0.2 ERM+ 53.3 2.7 53.3 2.7 53.3 2.7 53.3 2.7 53.3 2.7

Rotated MNIST 0 25 50 75 100 ARM 28.2 2.1 85.3 1.6 87.2 1.0 87.9 1.0 87.9 0.9 ARM+ 71.4 2.6 80.9 1.8 81.0 1.8 81.2 1.9 81.1 1.8

ERM 80.8 1.1 80.8 1.1 80.8 1.1 80.8 1.1 80.8 1.1 ERM+ 81.9 0.7 81.9 0.7 81.9 0.7 81.9 0.7 81.9 0.7

WILDS Camelyon17 0 25 50 75 100 ARM 61.2 5.2 59.5 4.2 59.7 4.2 59.7 4.3 59.7 4.2 ARM+ 55.8 0.8 55.1 1.7 55.0 1.7 55.0 1.8 55.0 1.8

ERM 68.6 7.8 68.6 7.8 68.6 7.8 68.6 7.8 68.6 7.8 ERM+ 50.1 0.1 50.1 0.1 50.1 0.1 50.1 0.1 50.1 0.1

Tiny Image Net-C 0 25 50 75 100 ARM 8.2 0.3 8.3 0.3 8.2 0.3 8.3 0.3 8.2 0.3 ARM+ 1.9 0.1 1.9 0.1 1.9 0.1 1.9 0.1 1.9 0.1

ERM 9.5 0.4 9.5 0.4 9.5 0.4 9.5 0.4 9.5 0.4 ERM+ 8.3 0.3 8.3 0.3 8.3 0.3 8.3 0.3 8.3 0.3

Imagenet R 0 25 50 75 100 ARM 47.4 1.1 45.3 0.4 47.2 1.9 49.8 1.2 47.4 1.0 ARM+ 1.4 0.2 0.9 0.3 1.0 0.4 0.8 0.3 1.0 0.3

ERM 48.0 1.0 48.0 1.0 48.0 1.0 48.0 1.0 48.0 1.0 ERM+ 45.3 1.0 45.3 1.0 45.3 1.0 45.3 1.0 45.3 1.0

CIFAR10-C 0 25 50 75 100 ARM 39.3 1.7 39.3 1.7 39.4 1.7 39.3 1.7 39.4 1.7 ARM+ 24.5 0.7 24.7 0.6 24.8 0.5 24.8 0.5 24.8 0.6

ERM 39.8 2.5 39.8 2.5 39.8 2.5 39.8 2.5 39.8 2.5 ERM+ 40.4 1.8 40.4 1.8 40.4 1.8 40.4 1.8 40.4 1.8

Published as a conference paper at ICLR 2024

D.3 COMPARISON OF ICRM WITH IN-CONTEXT LEARNING

The first and most popular conception of In-context learning (Brown et al., 2020) involves providing the model with contextual information, typically a few sample (x, y) pairs that represent a specific task . In contrast, our approach ICRM introduces an alternative perspective on in-context learning, where unlabeled inputs x act as the contextual backdrop for a task, also known as an environment. Note that, in order to benefit from in-context learning in domain generalization, the context itself must be a sequence of unlabeled inputs since at test-time the learner only has access to the unlabeled x s from the test environment and not the labels y s.

Our method, ICRM, can seamlessly adapt to supervised settings, functioning with input sequences containing both (x, y) pairs, instead of just x. We refer to this approach as Supervised ICRM or ICL. We evaluate both Supervised ICRM and ICRM on FEMNIST and Rotated MNIST datasets. As anticipated, Supervised ICRM demonstrates superior performance compared to ICRM. However, it is not suitable for domain generalization settings where data labels are unavailable at inference.

Table 13: Average/worst OOD test accuracy for different context lengths, for ICRM and Supervised ICRM on FEMNIST and Rotated MNIST. Supervised ICRM refers to ICRM trained on labeled input sequences containing (x, y) pairs as context.

Data / method Average test accuracy Worst case test accuracy

FEMNIST 0 25 50 75 100 0 25 50 75 100 ICRM 78.7 87.2 87.4 87.5 87.8 59.8 69.3 70.6 70.6 70.6 Supervised ICRM 79.0 87.8 87.7 88.2 87.9 61.2 72.2 73.5 74.5 74.9

Rotated MNIST 0 25 50 75 100 0 25 50 75 100 ICRM 93.6 96.1 96.2 96.2 96.2 82.5 88.5 88.5 88.8 88.8 Supervised ICRM 93.3 96.3 96.3 96.3 96.3 82.0 89.0 89.0 89.1 89.3

D.4 INVESTIGATING THE FEATURES LEARNED BY ICRM

Figure 2 presents attention maps for two randomly sampled sequences for FEMNIST and Tinu Image Net-C datasets. Figure 6 shows similar visualization for two other random sequences. Particularly in the second row, attention is predominantly allocated to lines of length similar to that of the query (also in green), thereby disregarding shorter lines (shown in red). Similarly, the third row shows that the model, when presented with a query image of a train , attends not only on other trains but also on a bus indicating a semantic understanding of similarity.

The key takeaway from this visualization is that ICRM effectively learns to attend to a select few samples in an input sequence. Interestingly, these samples either belong to the same class or exhibit similar features, despite potentially belonging to different classes.

To gain deeper insights into the features learned by ICRM, we examine its capability to transition from broad domain indices to nuanced, compositional contextual descriptions of environments. This analysis is crucial for understanding how ICRM facilitates amortization across similar environments. In particular, we extract embeddings from the penultimate layer of our trained model, for data from every environment within the training set. Subsequently, we train a linear classifier to predict the corresponding environment from each embedding. We repeat the experiment for models trained using both ICRM and ICRM-Mix.

As illustrated in Figure 7, the linear model using embeddings from ICRM attains an accuracy of up to 75% on FEMNIST and 98% on Rotated MNIST. This efficacy suggests that ICRM embeds representations that are linearly separable with respect to their environmental origins. Additionally, ICRM not only exhibits superior accuracy but also a faster rate of convergence in comparison to ICRM-Mix. This advantage is likely due to ICRM s i.i.d data sampling from each environment. In

Published as a conference paper at ICLR 2024

0.08 0.15 0.10 0.04 0.21 0.03 0.00 0.19 0.15 0.04

0.12 0.06 0.32 0.10 0.15 0.10 0.00 0.00 0.11 0.05 Query

0.16 Beach Wagon

0.24 Academic Gown

0.12 Walking stick

0.15 Volleyball

0.00 0.00 0.00 0.11 Volleyball Query Volley ball

0.15 School bus

0.13 Mantis

0.13 Rocking chair

0.23 Bullet train

0.16 Bullet train

0.00 0.00 0.00 0.00 Query Bullet Train

Query Image

Figure 6: Attention scores for random test sequences, for ICRM on FEMNIST (top two rows) and Tiny Image Net-C (bottom two rows).

contrast, a linear model trained on embeddings from a ICRM-Mix model also achieves significant accuracy, reaching 70% on FEMNIST and 91% on Rotated MNIST. This further explains ICRMMix s robust performance in out-of-distribution generalization even in the absence of explicit domain separation during training, as analyzed in Appendix D.2.2.

0 1000 2000 3000 4000 5000 # Training epochs

ICRM ICRM-Mix

0 1000 2000 3000 4000 5000 # Training epochs

ICRM ICRM-Mix

Figure 7: Evolution of the classification accuracy of the linear model trained on the retrieved embeddings as a function of training epochs for (a) FEMNIST (b) Rotated MNIST