# when_is_transfer_learning_possible__b5e6a1d5.pdf

When is Transfer Learning Possible?

My Phan Kiant e Brantley * 1 Stephanie Milani * 2 Soroush Mehri * 3 Gokul Swamy * 2 Geoffrey J. Gordon 2

We present a general framework for transfer learning that is flexible enough to capture transfer in supervised, reinforcement, and imitation learning. Our framework enables new insights into the fundamental question of when we can successfully transfer learned information across problems. We model the learner as interacting with a sequence of problem instances, or environments, each of which is generated from a common structural causal model (SCM) by choosing the SCM s parameters from restricted sets. We derive a procedure that can propagate restrictions on SCM parameters through the SCM s graph structure to other parameters that we are trying to learn. The propagated restrictions then enable more efficient learning (i.e., transfer). By analyzing the procedure, we are able to challenge widely-held beliefs about transfer learning. First, we show that having sparse changes across environments is neither necessary nor sufficient for transfer. Second, we show an example where the common heuristic of freezing a layer in a network causes poor transfer performance. We then use our procedure to select a more refined set of parameters to freeze, leading to successful transfer learning.

1. Introduction

Transfer learning is a well-studied problem with many variations such as covariate shift, representation learning, imitation learning, and finding invariant predictors. In this work, we present a framework that unifies multiple transfer learning problems, and can be used broadly for analyzing transfer learning problems. As a result, we challenge commonly-held assumptions about transfer, such as when sparse mechanism shift and layer freezing can be helpful.

*Equal contribution 1Cornell University 2Carnegie Mellon University 3Elementera AI. Correspondence to: My Phan <mail.myphan@gmail.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

Our framework also helps us design models and decide which components to transfer.

We model the transfer learning problem by viewing the world as a structural causal model (SCM, Pearl et al., 2000) with unknown constraints on its parameters. Learning about these constraints is what enables transfer. Within a world, the learner interacts with a sequence of environments. In each environment, the parameters of the SCM are chosen from the constrained domains, and the data is generated from the SCM with the chosen parameters. That is, each environment can be seen as a conditional intervention on a base SCM that represents the world. From each environment s data, we learn its parameters, and thus gain information about the unknown constraints. This view can be used to describe covariate shift, representation learning, and transfer learning in supervised learning, reinforcement learning, and imitation learning (see Section 3).

Therefore the environments generation from the world follows the Independent Causal Mechanisms (ICM) Principle (Sch olkopf et al., 2021), where the model is composed of independent modules. Within each module, we can have constraints on its parameters (called local constraints). We show how constraints on parameters across the model (called global constraints) can be computed from local constraints, allowing us to learn faster in new environments.

Contributions. In this work, we make the following contributions. First, we present a novel formalism for transfer learning that helps unify previously distinct areas of research, including covariate shift, representation, few-shot learning, and zero-shot learning, in which environments are generated from worlds by choosing local parameters of an SCM from constrained domains (Section 3). Second, we show how the constraints of global distributions can be computed from the constraints of local distributions, which enables us to analyze if transfer is feasible and derive a meta-algorithm to perform transfer (Section 4). Finally, we present case studies to illustrate how our work may be used to analyze and even challenge widely-held beliefs about transfer learning. We show that sparse mechanism shifts are neither necessary nor sufficient for transfer, and we show that freezing a layer of a network may either succeed or fail at transfer (Section 5).

When is Transfer Learning Possible?

Figure 1. Relationships among problems in Section 3.1. We focus on transfer learning through the world update problem.

2. Motivation

Transfer learning means using knowledge from previous environments to speed up learning in a similar but different environment. For example, an agent can be trained to classify cows using photos where cows are often associated with pasture, and tested on photos where cows are often associated with sand (Beery et al., 2018). A robot may be trained to walk on Earth, then asked to perform on Mars where the gravity constant is different. Using our formalism, a world can consist of a set of possible associations between cows and the background, or a set of possible gravity constants, and environments correspond to specific values of the association or the gravity constant. Parameters learnt from previous environments can therefore be used as data points to learn about the constrained domain specified by the world. We call this the world update problem (Figure 1) and provide more details in Section 3.1. Formally, we define our transfer learning problem as the study of speedups we can get by learning constraints from previous environments to reduce the hypothesis class in the current environment ruling out out possible parameter values, so that we need less data or less computation for learning.

3. Problem Formalism

To that end, define a world as an SCM with a set of constrained domains for its parameters. We generate an environment from the world by choosing parameters from their domains. The learner starts out with some but not all information about the world: it might know something about the structure and/or the constraints. It then sees data from a sequence of environments. Parameters learnt in these environments become data points to estimate the domains, so that we can speed up learning in new environments.

For one example, the learner might know that some parameters of a model are constant in all environments but might not know their shared value. After seeing enough data in one environment, the learner can transfer these parameters to the next (parameter freezing). In this example, the transfer strategy is clear. It is not obvious in general how to learn unknown constraints for transfer; we discuss this question in App. A, and give a more complex example next.

X1 X2 X3 w 2 w 3

Figure 2. Causal graph for Example 3.1

Example 3.1. In the SCM of Figure 2, X3 = w 3X2 R1, X2 = w 2X1 R2, and X1 R3 is drawn from a Gaussian distribution with mean w 1. In each environment, the parameters w 1, w 2, and w 3 are selected from sets W 1, W 2,, and W 3. Here, W 1 is the set of all 2 1 matrices, W 2 = {[1, 2, 3; 4, 5, 6]} is a singleton 2 3 matrix (same in all environments), and W 1 is R3. Suppose the learner aims to predict X3 from X1 when X2 is unobserved; it knows w 2 is constant in all environments but doesn t know its value. Starting from a conditional linear model, X3 = w3|1X1, the optimal (unknown) parameter value is w3|1 = w 3w 2 in each environment. Since w 2 is the same in all environments, w3|1 is always in the row space of w 2. Using examples of w3|1 learnt from previous environments, the learner can discover this row space, then use it as the hypothesis class for w3|1 in new environments to expedite learning.

We now present the formal definition of worlds and environments, extending Buesing et al. (2019). Upper-case letters denote random variables, lower-case letters their values, and bold-faced letters denote tuples.

Definition 3.1 (World). Let X = (X1, . . . , Xn) Ω1 Ωn be n variables where Ωi denotes the domain of Xi. Let O X denote the observable variables and Z X denotes the unobservable ones. Let U = (U1, . . . , Un) where Ui is a vector of di independent uniform noise variables.1 Let fw be a parameterized function class and Pi {X1, . . . , Xi 1} be the set of possible causal parents of Xi, so that Xi = fw i (Pi, Ui), 1 i n. (We call w i an atomic parameter, to distinguish from aggregated parameters like w3|1 in Ex. 3.1.) Let D i be the unconstrained domain of w i and W i D i be the constrained domain of values for w i . (When relevant, we call these atomic domains.) The world is the set of all possible SCMs obtained by selecting atomic parameters from their constrained domains: w i W i , i, 1 i n.

The learner sees a sequence of k environments {Ee}k e=1.

Definition 3.2 (Environment). To make an environment Ee, we divide the observable variables O into ones the learner cannot intervene on (states) Se O and ones the learner can intervene on (actions) Ae O. We then select the parameter w e i from W i for all i (1 i n).2

1Ui provides randomness for Xi. Using inverse CDF transform sampling, joint distributions can be expressed as SCMs (Buesing et al., 2019). 2This can be interpreted as conditional interventions, described in (Correa & Bareinboim, 2020).

When is Transfer Learning Possible?

In each environment, the learner learns about conditional distributions in order to make predictions. We leave the exact prediction task unspecified; instead we focus on the conditionals, and assume that discovering these will lead to task success. We represent conditional with parameter vectors:

Definition 3.3 (Description). Let I and K be tuples of indices, and write XI and XK for the corresponding tuples of variables. The learner represents the conditional distribution of XI given XK, P I|K w I|K( ), using a vector of parameters w I|K.3 Many parameter vectors might be equivalent in a given environment Ee, either because they yield identical conditional distributions, or because Ee s distribution over XK puts no weight on places where they differ (see Example 3.2 below). We write P I|K w P I|K w , and we say that all of these vectors are descriptions of the same conditional distribution in Ee. We use a bar to denote a set of equivalent descriptions: e.g., w is the set of descriptions that are equivalent to w. If the conditional is for a single variable Xi given its parents, we call it an atomic description. For brevity, we sometimes write the conditional distribution as XI|XK or I|K, leaving the parameters implicit.

After learning descriptions of some conditional distributions, the learner can update its estimates of the corresponding constrained domains. It can use these estimates to try to transfer to future environments. We discuss this process in more detail below (Sec. 4); there we address issues like how to reconcile descriptions learnt in different environments, and how to reason about the way different constrained domains relate to one another. To this end, we will give a toolbox of basic operations that can combine descriptions to build new ones step by step. These operations let us compute general domains from atomic ones: that is, w I|K lies in a domain WI|K DI|K that we can compute from the domains W i .

As mentioned above, whether two conditional distributions for XI|XK are distinct can depend on the domain of XK. The following example illustrates this:

Example 3.2 (Covariate Shift). We consider the SCM X2 = w 2X1 where w 2 = (1, 1). Let D 1, the unrestricted domain of w 1, be {u, v}. If w 1 = u, X 1 is uniformly distributed on the line segment from (0, 0) to (1, 2), and if w 1 = v, X 1 is normally distributed on R2. Therefore in Environment 1 when w 1 = u, w 2 = {w | w(1, 2)T = 3} and in Environment 2 when w 1 = v, w 2 = {(1, 1)}. Note that w 2 depends on w 1 even when w 2 is chosen independently from w 1. From Environment 1, the agent might learn ˆw 2 = (2, 0.5), which is a description of w 2 and incorrectly transfers it to the next environment.

3For example, with discrete variables, w I|K could be the conditional probability table; or with Gaussian variables, w I|K could be the best linear predictor.

In Section 3.2 and Section 4.3, for simplicity, we assume that each w results in a unique distribution (Assumption 3.4); for example, this happens when the joint distribution is Gaussian. We discuss the case when the assumption does not hold in Section 4.4.

Assumption 3.4 (Unique Description). For any w 1, . . . , w n D 1, . . . , D n, for any two sets of disjoint indexes I, K, w , w DI|K, if P I|K w (.) P I|K w (.) then w = w .

3.1. Landscape around Transfer Learning

In this section, we differentiate transfer learning from related problems (Figure 1) and discuss how it relates to common concepts. Let XQ be some query variables that we wish to estimate: for example, the label in supervised learning or the expected return (Q function) in reinforcement learning. To help predict XQ, the learner can solve several problems:

Causal discovery: Learn the graphical structure of the underlying SCM.

Predictor selection: Choose evidence variables XE to estimate XQ|XE.4

Parameter learning: Given a hypothesis set c WQ|E, learn we Q|E c WQ|E.

World update: c W0 Q|E is the initial knowledge of the learner about w Q|E before any environment. Given estimated values of w1 Q|E, . . . , we Q|E learnt from previous environments

E1, . . . , Ee, compute c We Q|E as an estimate of WQ|E to be used in parameter learning.

In this work, we focus on the world update problem, but the other problems are also important. The causal discovery problem is widely studied in the causal inference literature. Predictor selection is studied by works that focus on finding the invariant predictor (Zhang et al., 2020; Peters et al., 2015; Arjovsky et al., 2019). (Note that our framework is applicable even when there is no invariant predictor (see the Complex Colored MNIST example in Section 5.2.2).) Supervised learning and reinforcement learning methods focus on parameter learning.

Independent Causal Mechanism (ICM) principle and Sparse Mechanism Shift (SMS) hypothesis. The ICM principle (Sch olkopf et al., 2021) states that the causal generative model consists of modules that do not inform or influence each other. The SMS hypothesis (Sch olkopf et al., 2021) states that changes affect the causal factorization locally in-

4Evidence variables can be chosen to be either effects or causes of the query variables Q. For example, medical tests (which measure effects) are often performed to diagnose an illness (the cause). We discuss in Example 5.3 how choosing the evidence variables can be a trade-off between complexity and achievable accuracy.

When is Transfer Learning Possible?

stead of affecting all factors simultaneously. Our framework follows the ICM principle, but does not assume the SMS hypothesis. In Section 5.1 we show that the SMS hypothesis is not enough to guarantee transfer.

Few-shot and Zero-shot Learning. Zero-shot learning happens when the constrained domain learnt from previous environments contains only one point (or a small neighborhood), and therefore can be used directly in the new environment. Few-shot learning happens when the learnt domain has low complexity, and therefore it takes a small number of samples to learn the parameter.

Covariate Shift. This is a special case of our setting where the model is X Y and the conditional distribution P(Y | X) is the same in all environments. In equations, Y = fw Y (X, UY ); X = fw X(UX), where w Y is the same in all environments and w X changes in different environments.

Representation Learning. Representation learning is a special case of our setting where X is the evidence and there exists a representation variable Z. Either Z = fw Z(X, UZ) and w Z is the same in all environments or X = fw X(Z, UX) where w X is the same in all environments.

Transfer in Supervised Learning. There is no action variable.

Time series. We have an SCM that consists of repeated copies of a base structure, one per time step; parameters and their domains are often shared across time steps. Examples include hidden Markov models and dynamic Bayes nets.

Transfer in Reinforcement Learning. There are action variables. There may be multiple time steps. The setting includes Markov decision processes (MDPs, Puterman, 1990) when there are no hidden variables. When there are hidden variables the setting includes partially-observable MDPs (POMDPs, Kaelbling et al., 1998).

Imitation Learning. Action variables change from nonintervenable in a previous environment to intervenable in the current environment.

3.2. Feasibility of Transfer Learning

In order for transfer to be feasible, our world updates must somehow reduce the difficulty of learning in new environments. Depending on the setting, we can measure this difficulty in different ways: e.g., estimates of the number of samples needed or of the computational resources required. For example, it is common to bound the required number of samples in terms of a complexity measure such as VC dimension in supervised learning (Vapnik et al., 1994) or Bellman-Eluder dimension in reinforcement learning (Jin et al., 2021). Transfer is feasible when our updated domain c We Q|E improves on the original one c W0 Q|E according to the appropriate complexity measure.

We formalize this idea below. For simplicity, we assume unique generators (Assumption 3.4), and we assume realizability, i.e., WQ|E c W0 Q|E. We believe these assumptions can be lifted, but we leave this direction to future work.

Definition 3.5 (Transfer feasibility of the world). Given a complexity measure comp : 2DQ|E R we say that the world is transfer feasible if comp(WQ|E) < comp(c W0 Q|E).

In a transfer-feasible world, our initial knowledge of the domains is meaningfully weaker than what we might eventually learn. This doesn t guarantee that learning will actually happen; for that, we need a transfer algorithm.

Definition 3.6 (Transfer feasibility of an algorithm). Let c We I|K DQ|E be the estimate of WQ|E after environments E1, . . . , Ee output by algorithm A. Given a complexity measure comp : 2DQ|E R, if WQ|E c We Q|E and

comp(c We Q|E) < comp(c W0 Q|E) then A is called transfer feasible after e environments.

Zero-shot and few-shot learning are examples of transfer feasibility: they happen when c We Q|E is extremely simple, so that parameter learning is unnecessary (zero-shot) or takes only a small number of samples (few-shot).

4. Main Results

Following the ICM principle, the generator w e i s are chosen independently from their domains W i . In this section we show that the generator of interest, we Q|E, is also chosen from a constrained domain WQ|E which can be computed from W i . We present a procedure to describe how to compute the constrained domain WQ|E from atomic constrained domains W i (Sections 4.1 to 4.3). Therefore, given some knowledge about the atomic constrained domains W i s, we present a meta-algorithm to compute the constrained domain WQ|E (Section 4.4) for the world update problem. The meta-algorithm helps to design the architecture of the model and identify the component to transfer, which we demonstrate in Section 5.

4.1. Operations and Generators

In this section we define operations to compute a description of the distribution Q|E from the atomic parameters w i of the SCM. Using the operations, we define the computation procedure in Section 4.2.

First, we assume there exist operations to compute the product, sum-out and conditional , J and J on the descriptions of the distributions. Using these operations, we show in Theorem 4.2 that we can compute a description of Q|E from descriptions of Xi given its parents. Second, we show a description of Xi given its parents that only depends on w i . Therefore, we can compute a description of Q|E from

When is Transfer Learning Possible?

the atomic parameters w i . In Section 4.3 we show how to use this computation to compute the constraint on this description of Q|E from the atomic constraints W i .

First we define the description of the distribution of Xi given its parents that only depends on w i . We call this the generator. Definition 4.1 (Generator). Let Ωi and Ωp i denote the domains of Xi and Pi (the parents of Xi) defined by the world. We assume that there exists a function P i w (., .) : Ωi Ωp i R parameterized by w such that for any p Ωp i , Pw(., p) : Ωi R represents the distribution of Xi = fw(p, Ui).5 Then P i w i (., p) : Ωi R represents the distribution of Xi = fw i (p, Ui) defined by the SCM. We call P i w i (., .) the canonical (atomic) conditional distribution of Xi given its parents, and w i its (atomic) generator.

Operation Explanation

w |{z} I,J|K = u |{z} I|J,K v |{z} J|K Computing the product (I, J|K)

w |{z} I|K = J u |{z} I,J|K Summing out J

w |{z} I|J,K = J u |{z} I,J|K Conditioning on J

w |{z} I|K = u |{z} I|J,K v |{z} J|K Computing I|K from I|J, K and J|K

Table 1. Operations. w |{z} I|K is equivalent to (w; I|K).

Let the expression (u; I|K), where u DI|K, denote that u is a description of I | K, and return u.

Given (u; I|J, K) where u DI|J,K and (v; J|K) where v DJ|K, returns (w; I, J|K) where I, J|K specifies the distribution computes. Assuming that u is a description of the distribution I|J, K and v is a description of the distribution J|K, returns a description w of I, J|K. If u is the generator of I|J, K and v is the generator of J|K then we call w the generator of I, J|K.

Given (u, I, J|K) where u DI,J|K , J returns (w, I|K) where I|K specifies the distribution J computes. Assuming that u is a description of the distribution I, J|K , J returns a description of w of I|K. If u is the generator of I, J|K then we call w the generator of I|K.

Example: If w I,J|K is a probability table then J creates the probability table w I|K by summing over the

5For example, with a linear-Gaussian SCM, P i w (., p) could be the Gaussian pdf and w could be the linear coefficient.

X5 X2 X1 X0

w 5 w 2 w 1

w 7 w 6 w 6

Figure 3. The causal graph and the scope of w7|6,2 in yellow. The formula for w7|6,2 only uses w i of variables in the scope.

appropriate entries in w I,J|K.

Given (u, I, J|K) where u DI,J|K , J returns (w, I|J, K) where I|J, K specifies the distribution J computes. Assuming that u is a description of the distribution I, J|K , J returns a description of w of I|J, K. If u is the generator of I, J|K then we call w the generator of I|J, K.

Example: Let (X, Y ) Norm(µX,Y , ΣX,Y ) where

µX,Y = µX µY and ΣX,Y = ΛXX ΛXY ΛY X ΛY Y

Then w X,Y = (µX,Y , ΣX,Y ) and w Y |X = Xw Y,X = µY Λ 1 Y Y ΛY X(x µX), Λ 1 Y Y

(Bishop, 2006).

For convenience, we define the operation from and :

(u; I|J, K) (v; J|K) := J((u; I|J, K) (v; J|K)) (1)

= (w; I|K). (2)

Given (u; I|J, K) where u DI|J,K and (v; J|K) where v DJ|K, returns (w; I|K) where I|K specifies the distribution computes. Assuming that u is a description of the distribution I|J, K and v is a description of the distribution J|K, returns a description w of I|K. If u is the generator of I|J, K and v is the generator of J|K then we call w the generator of I|K.

We summarize the operations in Table 1. To simplify notations, we omit the indexes I, J and K from the formula and write u v instead of (u; I|J, K) (v; J|K) whenever the context is clear. Similarly we write Ju, Ju and u v without the indexes. We reuse w I|K from Definition 3.3 to denote the generator of I|K6.

4.2. Scopes and Computation Trees

The result in this section does not depend on whether Assumption 3.4 holds. We present a theorem to compute the generator w Q|E from atomic generators w i s using the defined operations. Using the computation procedure, we can compute the constraint of this description of Q|E from the constraints W i of the atomic parameters in Section 4.3.

6Note that w i denotes the generator of Xi given its parents while wi denotes the generator marginal of Xi.

When is Transfer Learning Possible?

Naively we can compute w Q|E from the joint distribution w Q,E, which could be calculated from the set of all w i using variable elimination. However we derive a smaller set of variables w i from which w Q|E can be computed, called the scope of w Q|E and denoted Scope(Q|E). We illustrate the scope in Figure 3 and Figure 4. Our method recursively computes w Q|E by building a computation tree top-down. It starts with w Q|E as the root, recursively chooses a node in the tree and decomposes it into its children using the operations until all of the leaf nodes are of the form w i . Let w Scope(Q|E) := {w i |i Scope(Q|E)}.

Variable elimination is a special case of our method if we perform w Q|E = Ew Q,E in the first step and then compute the joint w Q,E from the atomic generators using and in the remaining steps. We compare our method with variable elimination in Figure 4 and Appendix E.

Theorem 4.2 (Top-down computation tree, informal). w Q|E can be calculated from w i s in the scope of w Q|E using , and so that each w i appear at most once in reverse topological order. We denote the resulting formula g Q|E(w Scope(Q|E)), also called a computation tree of w Q|E7.

For clarity, we show some examples of the computation trees in Figure 4. We also show other examples (including the computation tree for Q-learning) in App. E.

4.3. Computing Global Constraints from Local Constraints

Given the computation tree, we can compute the set WQ|E that w Q|E is chosen from. Let W Scope(Q|E) = {W i | w i w Scope(Q|E)} be the corresponding constrained domain of w Scope(Q|E). From Theorem 4.2, w Q|E is computed by selecting w Scope(Q|E) and computing g Q|E(w Scope(Q|E)). Therefore WQ|E is the set of all possible w Q|E computed by selecting w Scope(Q|E) from W Scope(Q|E) and then applying the computation tree.

We define a notation for the set of function values over a set of inputs: for any set S and a function f such that S is in the domain of f, f(S) := {f(s)|s S}. Another approach to compute g Q|E(W Scope(Q|E)) in the case of supervised learning is discussed in Appendix E.4.

Corollary 4.3. WQ|E can be computed from W Scope(Q|E) {W 1, . . . , W n}: WQ|E = g Q|E(W Scope(Q|E)). Therefore the transfer feasibility of the world can be determined from W 1, . . . , W n by computing comp(WQ|E)8.

When the learner does not know the exact formula, we in-

7There can be multiple computation trees depending on the choice the learner made. 8The second statement of Collorary 4.3 assumes Assumption 3.4 holds while the first statement does not depend on Assumption 3.4.

troduce the following properties to analyze how the local constraints can propagate throughout the network. We illustrate in Sec. 5 how to use the properties to derive constraints on the quantity we would like to estimate.

Property 4.3.1 (Propagatable). Let D1, D2, and D1 2 be the unconstrained domain of w1, w2, and w1 w2 respectively where : D1 D2 D1 2 be an operation. We define how a constraint can propagate to the left and to the right. Let F D2 D1 2. An operation is called F-left-propagatable if for any w2 F, w1 w2 F. Now, let F D1 D1 2. An operation is called F-right-propagatable if for any w1 F, w1 w2 F.

4.4. Transfer Process

Now we are equipped to derive a process to learn c We Q|E from bw1 Q|E, . . . , bwe Q|E learnt from multiple environments

E1, . . . , Ee. Let c W i s for some i, 1 i n be some local information the learner knows about W i (if the learner does not know anything about W i it can assume c W i = D i ). Let \ Scope(Q|E) be an estimation of Scope(Q|E) such that Scope(Q|E) \ Scope(Q|E). Without any knowledge we can set \ Scope(Q|E) to be all variables. Let bg Q|E be an approximation of g Q|E obtained by knowledge or by the propagation properties such that g Q|E(WScope(Q|E)) bg Q|E(W [ Scope(Q|E)) for any input W. For example, if all operations are F-right propagatable and the left-most term is in F, bg Q|E can output F. We call a set W parameterized by a parameter θ if W is determined by θ. For example, if W is the set of all w such that fθ(w) = 0 for a known class of function fθ, then W is parameterized by θ. Algorithm 1 captures our procedure to compute c We Q|E. In App. A, we describe an algorithm to learn the constraint from the approximations of the parameters in previous environments.

If Assumption 3.4 does not hold, Algorithm 1 needs to be modified. Consider the case in Example 3.2, when w 2 = (1, 1) but w 2 = {w : w T (1, 2) = 3}, and w 2 is the same in all environments. From Environment 1, the agent might learn ˆw 2 = (2, 0.5), which is a description of w 2 and incorrectly transfers it to the next environment. Instead, for all environments e, the agents need to learn the set of all possible descriptions in c We Q|E, therefore in this case bwi Q|E does not denote just a point but a set of all possible, equivalent descriptions. In Example 3.2, it would be equivalent to learn w 2 from Environment 1. The learner than uses the set of all possible descriptions bwi Q|E learned from previous environment to learn the set Θ of θ such that θ Θ: i = 1, , e, wi bwi Q|E such that fθ(wi) = 0.

Let c We Q|E = θ Θ{w : fθ(w) = 0}. Then the agent will

learn the set of all possible descriptions bwe+1 Q|E in c We Q|E.

When is Transfer Learning Possible?

w 3 w 1 w 2

w 3 w 1 w 2 w 0

Figure 4. Causal graph and the computation trees for w3|0,2. For clarity, we do not draw the uniform noise variables. (Left): Causal graph and the scope of w3|2,0 in yellow. (Middle) A computation tree of w3|2,0 constructed from Thm. 4.2, which is equivalent to calculating p(X3|X2, X0) = P

x1 p(X3|X0, X1 = x1)p(X1 = x1|X2, X0); p(X1 = x1|X2, X0) = p(X1=x1,X2|X0) P x 1 p(X1=x 1,X2|X0) and

p(X1 = x1, X2|X0) = p(X2|X0, x1)p(x1|X0) for discrete distributions. Variable elimination cannot construct this computation tree because conditioning ( ) is performed to compute p(X1 = x1|X2, X0) before sum-out and product ( ). w 3 goes through only 1 operation. (Right) A computation tree of w3|2,0 constructed by naive variable elimination on the joint distribution w0,2,3. Each leaf goes through at least 2 operations (at least one to construct the joint distribution and then the conditional ).

Algorithm 1 Meta-Algorithm for Transfer

Input: Estimated constrained domain c W 1, . . . , c W n; parameters learnt from previous environments bw1 Q|E, . . . , bwe Q|E; estimated \ Scope(Q|E); estimated computation tree bg Q|E to compute w Q|E from w i s Output: c We Q|E, an estimation of WQ|E Calculate c We Q|E = bg Q|E(W [ Scope(Q|E)) parameterized by unknown parameter θ; Learn θ using bw1 Q|E, . . . , bwe Q|E and compute c We Q|E from θ.

5. Case Studies

We now demonstrate how our framework can be used to analyze widely-held beliefs about transfer learning.

5.1. Sparse Mechanism Shift

The Sparse Mechanism Shift (SMS) Hypothesis states that changes are often sparse and local in a causal graph (Sch olkopf et al., 2021; Lopez et al., 2023; Lachapelle et al., 2022; Bereket & Karaletsos, 2024). We demonstrate that SMS does not always imply transfer feasibility and vice versa.

Example 5.1 (The Chain Problem). Assume X0 is normally distributed on R2. For i = 1, . . . , 100, Xi = fw i (Xi 1). X1, . . . , X99 are unobserved. Consider the linear model where fw i (Xi 1) = w i Xi 1. The learner observes X = X0 and Y = X100 and would like to predict Y from X.

The learner fits a linear model Y = w X to predict Y from X, where w is any 2 2 matrix, making c W0 100|0 the set of all 2 2 matrices. The complexity measure comp is the number of parameters that need to be learned, with comp(c W0 100|0) = 4. App. F shows that if w i ( 2 i 99)

is the same in all environments (sparse changes), transfer learning may not be feasible. However, if only w 1 is the same in all environments (non-sparse changes), transfer may be feasible, reducing comp(c W1 100|0) the dimension of the hypothesis class after environment 1 to 2.

5.2. Freezing Layers of Neural Networks

Freezing a layer of a neural network learnt from previous environments is standard in transfer learning (Yosinski et al., 2014; Long et al., 2015; Kirichenko et al., 2023; Tajbakhsh et al., 2016; Guo et al., 2019). Our framework helps to explain when this practice works (Section 5.2.1). In the next few subsections, we show that this standard practice may be ill-advised: performance can depend strongly on exactly which parameters we freeze.

5.2.1. THE CHAIN PROBLEMS

We sketch an example where freezing layers of a NN is correct for transfer learning see Appendix F for more details. We use the result in (Rolnick & Kording, 2020), where it was shown that if the Linear Region Assumption is satisfied, Re LU networks can be identified up to permutation and scaling. Example 5.2 (Freezing layers for neural networks). Modify the SCM in Example 5.1 so that fw i is a single Re LU layer with weights w i . Assume that in all environments Ee, the Linear Region Assumption (Rolnick & Kording, 2020) is satisfied and that X0 s distribution satisfies that the description set of we 100|0 consists of only equivalent neural networks up to permutation and scaling. is stacking layers: for i > j > k, wi|k = wi|j wj|k = (wi|j, wj|k). Therefore the generator w100|0 = (w 100, . . . , w 1). For any i, if w i is the same in all environments, W100|0 is the set of (w 100, . . . , w 1) up to permutation and scaling with the same w i in all environments. Thus, we can learn layer w i from the first environment and freeze it in later environments.

When is Transfer Learning Possible?

We show in App. F.2 that even with slight modifications from the above example, θ no longer corresponds to an entire layer, and therefore freezing a layer results in poor performance. In the next section, we present a more complex example to explore this phenomenon.

5.2.2. THE COMPLEX COLORED MNIST PROBLEM

We now show when freezing an entire layer of the network leads to poor transfer, while freezing a subset of the layer leads to successful transfer. This example also demonstrates our approach where we treat transfer learning as a new problem separated from finding an invariant predictor which are often pursued in previous works (Zhang et al., 2020; Peters et al., 2015; Arjovsky et al., 2019). It also demonstrates a case of representation learning, where the distribution of the image X given its parents is assumed to be the same in all environments.

Example 5.3 (Complex Colored MNIST). This binary classification problem is based on MNIST: instances are colorized images of digits, and the colors are spuriously correlated with labels. In the original dataset (Arjovsky et al., 2020), binary labels are based on the digit with probability 0.25 and flipped otherwise. In 2 environments, the image is colored with a color that correlated with the binary label with probability 0.1 and 0.9 respectively. We modify the environments to make it more difficult. In E1, the correlation between digit identity and label is 0.1, and the correlation between color and label is 0.2. In E2, the correlation is 1 between digit and label, and 0.4 between color and label.

We illustrate the trade-off between achievable accuracy and sample complexity for choosing different predictors (Section 3.1) using the original Colored MNIST problem. Predicting the label using the digit in Environment 1 allows zero-shot learning in Environment 2 with a maximum accuracy of 0.75. Predicting the label using the color in Environment 2 requires samples from Environment 2 but enables a maximum theoretical accuracy of 0.9. Therefore, predictor choice affects the achievable accuracy and sample complexity in the new environment. In the Complex Colored MNIST problem, the label distribution given the digit changes across environments, making previous methods (Zhang et al., 2020; Peters et al., 2015; Arjovsky et al., 2019) that identify the invariant predictors for zero-shot learning ineffective.

Transfer Algorithm. Let υ(X) denote the flattened vector of X. We first derive the model for c W1 Y |X. We assume that (1) the parents PX of the image X are discrete random variables (2) conditioning on PX = p, υ(X) is in the exponential family p(x|p) = 1

sx g(λp) exp 1

where the scale parameter s does not depend on p (Section 4 in (Bishop, 2006)); (3) X is d-separated from Y given PX, (4) the distribution X|PX is unchanged in all environ-

Via Theorem 4.2 we compute P(Y |X) (Appendix F.3):

P(Y |X) (3)

sλT p υ(X) + log g(λp) + log P(p, Y )

sλTp υ(X) + log g(λp) + log P(p, y)

p denotes the summation over all possible values p of the parents of X. The blue terms are affected by w X while the black terms are not. Eq. 3 can be implemented as a 1-layer network, but in this implementation, freezing a layer is the same as freezing the entire model. This would lead to poor performance because the label in Environment 2 is almost the opposite of Environment 1. So, to allow more flexibility in freezing and give the competing model where the entire layer is frozen a better advantage, we derive a 2-layer implementation of Eq. 3. Rearranging the terms we get the following 2-layer network: the first layer implements

P(p|X) = exp 1

sλT p υ(X) + log g(λp) + log P(p)

sλTp υ(X) + log g(λp) + log P(p) ,

and the second layer implements

P(Y |X) = P

p P(p|X) log P(Y |p).

To learn, we make two additional assumptions: (5) for any choice of the parameters in the constrained domain, the description set of the any network in the constrained domain is the set of all equivalent weights in which the linear coefficients differ only by permutation (the biases could differ by other means) and (6) the optima of the loss function are in the description set.

Given these assumptions, our method recommends transferring by freezing the linear coefficient in the first layer: from (6), minimizing the loss will result in a description. From (5), similar to the discussion in Section F.2, we can freeze the first layer s linear coefficient to transfer. This reflects the fact that the linear coefficient represents P(X|PX), which is the same in all environments. We compare to other transfer strategies that might seem reasonable without our derivations, such as freezing the entire first layer.

Transfer Feasibility of the Algorithm. Let the complexity measure comp be the number of parameters to be learned. Assuming that np = 20, without transfer learning, comp = 47040 + 20 + 40; with transfer learning, comp = 20 + 40. The details are provided in Appendix F.3.

Experiment. Fig. 5 compares transfer performance of several methods: learning the model in the new environment without transfer, freezing the first layer of the two-layer

9As an example, a detailed hypothetical model that satisfies our assumptions is given in Appendix F.3.

When is Transfer Learning Possible?

Figure 5. The Complex Colored MNIST problem, Ex. 5.3. (Left): Zero-shot transfer performance of model trained in Env. 1. Accuracy in Env. 2 is poor because the labels in Env. 1 and Env. 2 are very different. (Right) Performance of various transfer methods as they train on Env. 2. Transferring the first layer s linear coefficient results in the best performance.

model, and freezing just the linear coefficient of the first layer. App. F.3 gives implementation details using Ray Tune (Liaw et al., 2018). We show that the strategy recommended by our derivation (freezing just the linear coefficient of the first layer) performs best. We also compare to a random transfer strategy (freezing the first layer s linear coefficient to a random value instead of a learned one) to provide a baseline. Note that when the sample size available in Env. 2 grows larger than in the plot (5000) it might be possible that the model without transfer might outperform the model that transfers layer 1 s coefficient due to limited samples in the first environment, failure of the model to find the true coefficient in the first environment or other factors.

5.3. Reinforcement Learning

We now illustrate how our framework applies to reinforcement learning with a simple example. Suppose the dynamics P(st+1|st, at) remain constant across environments, and the expected reward is E[Rt|st, at] = w T ϕ(st, at), where ϕ is constant across environments and w varies (Barreto et al., 2017). Suppose that states and actions are discrete, so that the dynamics are a table of size ns nsna and ϕ is a table of size d nsna, where ns, na and d are the number of states, actions and dimension of ϕ. Observe that the Q function of any policy in any environment can be written as Qπ(s, a) = E[Ri|s, a] + γE[V (st+1|s, a)]. Considered as a vector of length nsna indexed by (s, a), the first term on the RHS is in the row space of the ϕ table, and the second term is in the row space of the dynamics table. So Qπ is in the span of these two row spaces.

Suppose there are na = 4 actions and ns = 5 states, and ϕ has dimension d = 2. Then Qπ (of dimension 20) is in a subspace of dimension ns + d = 7. Once we evaluate

enough policies in different environments, we can a basis for this subspace. In a new environment, we can constrain Q(s, a) to be in this subspace, potentially saving both data and computation: e.g., we can estimate the environment s w from observations (saving data compared to estimating the vector E[R(s, a)]), then run value iteration or Q iteration entirely in the subspace (saving data by not estimating the dynamics, and computation by working in the subspace). Barreto et al. (2017) compute an improved policy but not the optimal policy in the new environment. Brantley et al. (2021) compute the optimal policy, but do not address learning.

6. Conclusion

Our paper provides a framework for transfer learning that can guide how to perform transfer and analyze if common heuristics such as freezing a layer may or may not succeed. Our framework can be used to model a wide range of transfer scenarios including those in supervised learning, imitation learning and reinforcement learning. It can also help users design models that enable transfer or select components to transfer.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

When is Transfer Learning Possible?

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization, 2019. URL https: //arxiv.org/abs/1907.02893.

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization, 2020.

Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H., and Silver, D. Successor features for transfer in reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pp. 4058 4068, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.

Beery, S., Van Horn, G., and Perona, P. Recognition in terra incognita. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.

Bereket, M. and Karaletsos, T. Modelling cellular perturbations with the sparse additive mechanism shift variational autoencoder. Advances in Neural Information Processing Systems, 36, 2024.

Bishop, C. M. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg, 2006. ISBN 0387310738.

Brantley, K., Mehri, S., and Gordon, G. J. Successor feature sets: Generalizing successor representations across policies. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 11774 11781, 2021.

Buesing, L., Weber, T., Zwols, Y., Heess, N., Racaniere, S., Guez, A., and Lespiau, J.-B. Woulda, coulda, shoulda: Counterfactually-guided policy search. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=BJG0vo C9YQ.

Chen, Y. and B uhlmann, P. Domain adaptation under structural causal models. The Journal of Machine Learning Research, 22(1):11856 11935, 2021.

Correa, J. and Bareinboim, E. A calculus for stochastic interventions:causal effect identification and surrogate experiments. Proceedings of the AAAI Conference on Artificial Intelligence, 34(06):10093 10100, Apr. 2020. doi: 10. 1609/aaai.v34i06.6567. URL https://ojs.aaai. org/index.php/AAAI/article/view/6567.

Dietterich, T. G. Hierarchical reinforcement learning with the MAXQ value function decomposition, 1999. URL https://arxiv.org/abs/cs/9905014.

Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks, 2016.

Gulrajani, I. and Lopez-Paz, D. In search of lost domain generalization. Co RR, abs/2007.01434, 2020. URL https://arxiv.org/abs/2007.01434.

Guo, Y., Shi, H., Kumar, A., Grauman, K., Rosing, T., and Feris, R. Spottune: Transfer learning through adaptive fine-tuning. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4800 4809, Los Alamitos, CA, USA, jun 2019. IEEE Computer Society. doi: 10.1109/CVPR.2019.00494. URL https://doi.ieeecomputersociety. org/10.1109/CVPR.2019.00494.

Huang, B., Feng, F., Lu, C., Magliacane, S., and Zhang, K. Adarl: What, where, and how to adapt in transfer reinforcement learning, 2021. URL https://arxiv. org/abs/2107.02729.

Jin, C., Liu, Q., and Miryoosefi, S. Bellman eluder dimension: New rich classes of rl problems, and sampleefficient algorithms, 2021.

Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99 134, 1998.

Kirichenko, P., Izmailov, P., and Wilson, A. G. Last layer re-training is sufficient for robustness to spurious correlations. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=Zb6c8A-Fghk.

Lachapelle, S., Rodriguez, P., Sharma, Y., Everett, K. E., Le Priol, R., Lacoste, A., and Lacoste-Julien, S. Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ica. In Conference on Causal Learning and Reasoning, pp. 428 484. PMLR, 2022.

Li, X., Grandvalet, Y., and Davoine, F. Explicit inductive bias for transfer learning with convolutional networks. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2825 2834. PMLR, 10 15 Jul 2018. URL https://proceedings.mlr.press/v80/ li18a.html.

Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J. E., and Stoica, I. Tune: A research platform for distributed model selection and training. ar Xiv preprint ar Xiv:1807.05118, 2018.

When is Transfer Learning Possible?

Long, M., Cao, Y., Wang, J., and Jordan, M. I. Learning transferable features with deep adaptation networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML 15, pp. 97 105. JMLR.org, 2015.

Lopez, R., Tagasovska, N., Ra, S., Cho, K., Pritchard, J., and Regev, A. Learning causal representations of single cells via sparse mechanism shift modeling. In 2nd Conference on Causal Learning and Reasoning, 2023. URL https: //openreview.net/forum?id=IOWJs PJ2x Gd.

Magliacane, S., van Ommen, T., Claassen, T., Bongers, S., Versteeg, P., and Mooij, J. M. Causal transfer learning. Co RR, abs/1707.06422, 2017. URL http://arxiv. org/abs/1707.06422.

Myung, S., Huh, I., Jang, W., Choe, J. M., Ryu, J., Kim, D., Kim, K.-E., and Jeong, C. PAC-net: A model pruning approach to inductive transfer learning. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 16240 16252. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr. press/v162/myung22a.html.

Pearl, J. and Bareinboim, E. Transportability of causal and statistical relations: A formal approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, pp. 247 254, 2011.

Pearl, J. et al. Models, reasoning and inference. Cambridge, UK: Cambridge University Press, 19(2), 2000.

Peters, J., B uhlmann, P., and Meinshausen, N. Causal inference using invariant prediction: identification and confidence intervals, 2015.

Puterman, M. L. Markov decision processes. Handbooks in operations research and management science, 2:331 434, 1990.

Rolnick, D. and Kording, K. Reverse-engineering deep Re LU networks. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 8178 8187. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr.press/ v119/rolnick20a.html.

Saengkyongam, S., Thams, N., Peters, J., and Pfister, N. Invariant policy learning: A causal perspective. Co RR, abs/2106.00808, 2021. URL https://arxiv.org/ abs/2106.00808.

Sch olkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., and Bengio, Y. Toward causal representation learning. Proceedings of the IEEE, 109(5): 612 634, 2021.

Shen, K., Jones, R., Kumar, A., Xie, S. M., Hao Chen, J. Z., Ma, T., and Liang, P. Connect, not collapse: Explaining contrastive learning for unsupervised domain adaptation, 2022.

Sohn, K., Berthelot, D., Li, C.-L., Zhang, Z., Carlini, N., Cubuk, E. D., Kurakin, A., Zhang, H., and Raffel, C. Fixmatch: Simplifying semi-supervised learning with consistency and confidence, 2020.

Tajbakhsh, N., Shin, J. Y., Gurudu, S. R., Hurst, R. T., Kendall, C. B., Gotway, M. B., and Liang, J. Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE Transactions on Medical Imaging, 35:1299 1312, 2016. URL https://api. semanticscholar.org/Corpus ID:32710.

Vapnik, V., Levin, E., and Le Cun, Y. Measuring the vcdimension of a learning machine. Neural computation, 6 (5):851 876, 1994.

Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. How transferable are features in deep neural networks? In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips. cc/paper_files/paper/2014/file/ 375c71349b295fbe2dcdca9206f20a06-Paper. pdf.

Zhang, A., Lyle, C., Sodhani, S., Filos, A., Kwiatkowska, M., Pineau, J., Gal, Y., and Precup, D. Invariant causal prediction for block MDPs. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 11214 11224. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr. press/v119/zhang20t.html.

When is Transfer Learning Possible?

All possible parameter values W

Figure 6. Constraint learning problem without noise. Each marker represents a possible environment parameter (a value of w), and each teal curve represents a possible constraint (a value of θ).

Supplementary Material: When is Transfer Learning Possible?

In Section A we describe an algorithm to learn the constraint using we s from previous environment. In Section B we discuss related works. In Section C we discuss future works. In Section D we discuss more details of the setting related to reinforcement learning. In Section E we provide details for Section 4 including the proof of Theorem 4.2 and more examples of computation trees including Q-Learning. In Section F we provide details for the case studies (Section 5) including the details of the Colored MNIST experiments. In Section G we discuss the limitation of our paper.

A. Constraint Learning

In this section we describe a simple method for learning constraints from examples. This is an important prerequisite for transfer learning: the tools we describe in the main paper can help us propagate constraints, so that we know how a constraint on one parameter of our model might imply a constraint on other parameters, but we have to combine this sort of propagation with a way to extract constraints from observed data.

There are many ways that one might hope to improve on the constraint learning method we describe here. The exact method of constraint learning is not central to the current paper; all that is important here is that some learning algorithm proposes constraints that fit into our tools for transfer. So, our intent is not to be state-of-the-art, but merely to demonstrate that this kind of learning is possible. We leave it to future work to refine and optimize our learning method.

We have two main goals in designing our constraint learning method. First, we d like to be agnostic to how the environments are generated: for example, we don t want to assume that they are sampled i.i.d. from a fixed distribution over the set of allowed environments. Second, we d like to maximize the benefits we see from transfer: we d like to learn helpful constraints that narrow down the set of parameters that we have to learn from, but we d like to avoid costly mistakes where incorrect constraints limit performance in new environments.

We ll build up the constraint learning problem in several steps, introducing new issues at each step. The first version of our learning problem is illustrated in Figure 6. We are trying to learn some parameter w that takes a different (unknown) value wi in each new environment Ei. We assume a family of possible constraints that might apply to w: fθ(w) = 0, where θ is an (unknown) parameter vector. If we want to enforce multiple constraints on w, we can have f return a vector; in this case θ contains the parameters of all constraints, concatenated together.

As we gain experience in a sequence of environments E1, E2, . . ., we can hope to learn a constraint that describes the feasible values of w1, w2, . . .. For example, after we see data from E1, we can learn an estimate of w1, which lets us narrow down the possible values of θ. In the figure, θ0 is inconsistent with w1, since w1 does not lie on the curve fθ0(w) = 0. On the other hand, θ1, θ2, and θ3 are all consistent with w1.

When is Transfer Learning Possible?

All possible parameter values W

from w3 ℰ3 fθ3(w) = 0

Figure 7. Constraint learning problem with noise.

As we see additional environments, we continue to narrow down the possible values of θ. If we next learn w2 by seeing data from environment E2, we can rule out θ1: the constraint fθ1(w) = 0 passes through w1 but not w2. If we then learn w3 by visiting environment E3, we rule out θ2, leaving fθ3(w) = 0 as the only one of the illustrated constraints that is still consistent. If we assume that θ3 is the correct constraint, then in E4 we can use this information to learn w4 faster: the set fθ3(w) = 0 has lower dimension than the full parameter space W, so we will need fewer samples to distinguish the value of w4 that is the best fit within the constraint.

If we leave aside computational considerations, we can write a very simple algorithm that learns the correct constraint in our setup so far: we just keep track of which values of θ are still consistent after seeing w1, w2, . . .. When only a single value of θ remains consistent, we are done.

Of course, this algorithm will not work very well in practice, since we have ignored an important aspect of constraint learning: due to estimation error, we will not get a perfect estimate of wi from any finite amount of data in Ei. So, the picture will look more like Figure 7: the true constraint fθ3(w) = 0 will not pass exactly through our estimates w1, w2, w3, but will only pass close to them.

To handle estimation errors like this, we propose a simple no-regret algorithm. We start from a set Θ of possible constraint parameters. Before seeing each environment Ei, we propose a hypothesis θi Θ as described below. Then we find out wi, and we get a penalty

Pi(θi) = fθi(wi) 2

This penalty is zero if and only if w satisfies the constraint fθ(wi) = 0. If we fail to satisfy the constraint, then Pi(θ) is strictly positive, with a larger value if we are farther from satisfying the constraint.

Our goal is to get a small total penalty P(θ) = P

i Pi(θ). To accomplish this, we can choose θi using any no-regret algorithm that is, any algorithm where the regret

regret = max θ Θ

i Pi(θi) P(θ)

grows sublinearly in T.

Depending on the sort of guarantees we want, we can pick our no-regret algorithm one of two ways. The first possibility is a learner like online gradient descent. Depending on the form of fθ, OGD gives us different guarantees: if we have linear constraints, then Pi(θ) will be convex, and we can get convergence to the regret of the global optimum. On the other hand, for general fθ, the functions Pi(θ) may not be convex, in which case we might get stuck in local optima. Either way, OGD

When is Transfer Learning Possible?

True constraint

Conﬁdence set

from C1 ℰ1 Conﬁdence set

All possible parameter values W

Figure 8. Constraint learning problem with inequality constraints and confidence sets.

is computationally cheap: the main cost is computing the gradient or subgradient of the penalty Pi(θ).

The second possibility is to choose a no-regret algorithm that has strong guarantees in the face of non-convex learning problems. One such approach is to tile the space of constraints Θ with an ϵ-net, and run a no-regret method such as multiplicative weights to pick among all of the points in the net. The size of the ϵ-net will be exponential in the dimension of Θ, so this approach will typically only be computationally feasible for low-dimensional constraint sets.10 On the other hand, we will get a much stronger guarantee: for MW, after T environments, we will propose at least one parameter θi whose penalty is at most

P(θi) P(θ) + O(

T ln | Θ |)

where θ is the best (post-hoc) parameter value in Θ. Therefore, the average penalty P(θi)/T for the best proposal will approach the optimal value P(θ)/T as T .

No matter how we choose our no-regret learner, as our estimate θi improves, we become more and more likely to see a benefit from transfer. But early in the learning process we may see negative transfer: if we enforce a poorly-estimated constraint, it can reduce our performance at estimating wi in some environment Ei. To guard against negative transfer, we can calculate both the best constrained wi (the one that minimizes loss subject to fθi(wi) = 0) as well as the best unconstrained wi (the one that minimizes loss over the entire parameter space W). We can then choose between the constrained and unconstrained estimates: e.g., we can run an additional no-regret learner inside each new environment to predict wi nearly as well as the better of the two estimates.

At this point we have a constraint learning method that can work well in practice. But, there are two additional wrinkles that can affect performance. So, for the final version of our constraint learning problem, we ll make two changes to our setup.

First, instead of just equality constraints of the form fθ(w) = 0, we ll allow inequality constraints like fθ(w) 0. This makes it easier for us to represent a wider variety of constraint shapes. Second, we will let our learning algorithm give us more information: instead of just a point estimate wi, we will let it return a confidence set Ci, with the intention that Ci covers the correct value of wi with high probability. (We can still handle point estimates by treating them as singleton sets.)

With these changes, the picture looks like Figure 8. The biggest change here is that there is a more complicated relationship between the confidence sets and the constraint set. First, there will likely be parameter values in Ci that don t satisfy the constraint, due to limited data in environment Ei. And second, there will likely be parameter values that satisfy the constraint but fall outside of Ci, since the constraint describes possible parameter values for all environments, not just Ei.

To handle this situation, we can make some simple changes to our no-regret algorithm. First, the penalty for proposing θi

10If there is structure that we can take advantage of in the set of constraints Θ, we might be able to get the best of both worlds: computational efficiency and strong guarantees. We leave this sort of potential extension to future work.

When is Transfer Learning Possible?

becomes the penalty for the best value of w that falls within Ci. For an equality constraint, this is

Pi(θ) = min w Ci fθ(w) 2

Second, if some coordinates of fθ correspond to inequality constraints, we threshold so that negative values of fθ are not penalized. If we write σ for the thresholding function, so that

σj(z) = zj if zj 0 or coordinate j is an equality constraint 0 otherwise

then the penalty is

Pi(θ) = min w Ci σ(fθ(w)) 2

Finally, inequality constraints require a bit of additional care: different inequalities can be stronger or weaker, and therefore more or less desirable to learn. If we ignore this issue, we will tend to learn very weak constraints: in the limit, the trivial constraint that allows all of W will get zero penalty no matter what Ci is. (This issue is also present with some kinds of equality constraints, since we can equivalently incorporate σ into fθ and treat every constraint as an equality; but essentially every family of inequality constraints needs to take account of this problem.)

Ideally we would directly measure the strength of transfer that each constraint provides: tighter constraints will get higher Pi(θ) but have the potential for stronger transfer. Unfortunately, this sort of measurement could be expensive; so, we leave it to future work to explore this avenue. Instead, here we suggest modifying Pi(θ) by adding an extra penalty that measures the weakness of the constraint that corresponds to θ: e.g., we could use a penalty proportional to the volume covered by the constraint. The scaling of such a penalty adds a hyperparameter; we hope that future work might show a way to eliminate the need for such a hyperparameter, or at least tune it at low cost.

To summarize, our final learning method is to run a no-regret algorithm over the set Θ of possible constraint parameters. For each environment Ei, this outer-loop algorithm predicts a constraint θi. Given θi, we run two learners for wi: one that enforces the predicted constraint, and one that is unconstrained. We switch between the two estimates of wi using an inner-loop no-regret algorithm. We then use the unconstrained estimate of wi to find a confidence set Ci, and pass this confidence set back to our outer-loop no-regret algorithm. To calculate this algorithm s loss, we find the most favorable value of w within Ci: the value that minimizes the penalty σ(fθi(w)) 2. Optionally, if different values of θ correspond to tighter or looser constraints, we add an additional penalty to the outer algorithm s loss, in order to favor tighter constraints.

B. Related Works

There are a number of previous works that implicitly follow a special case of our setting and that approach our problem differently (e.g., with qualitatively different assumptions or goals). Huang et al. (2021) is a special case of our setting. This work assumes that a small set of variables is intervened on and therefore represents the changes in different environments. Therefore, they freeze the rest of the model learnt from previous environments, and only learn the changes in the new environment. Li et al. (2018); Myung et al. (2022) consider the case where there is limited data in the target domain but ample data in the source domain. Both papers assume we know what to transfer across environments, while our framework allows us to explicitly determine that from expert knowledge or the data.

Max Q (Dietterich, 1999) approaches transfer differently: it is concerned with both computational and statistical benefits. Their setting for statistical benefits is reminiscent of ours: they propose a way to constrain the parameters of a policy in a new environment by freezing part of a MAXQ hierarchy. This same strategy also provides computational benefits, since decision-theoretic planning can be extremely expensive if we have to start from scratch. Our setup could address this benefit if we had a practical measure of the computational complexity of RL in different hypothesis classes, but such a measure is a subject for future research.

Deep Learning Heuristics for Transfer. Several heuristics for learning representations for domain adaptation have been proposed in the literature, including adversarial training (Ganin et al., 2016), self-supervised techniques (Sohn et al., 2020),

When is Transfer Learning Possible?

and contrastive learning (Shen et al., 2022). In contrast to these works, we focus on devising a framework via a causal perspective. However, it would be very interesting to apply our framework to these ideas.

Learning Invariant Features. Various works have explored the idea of learning invariant representations that use information available in all environments with the hope that this will lead to better transfer to downstream tasks (Arjovsky et al., 2020; Peters et al., 2015; Saengkyongam et al., 2021; Zhang et al., 2020; Magliacane et al., 2017). In our framework, taking advantage of an invariance is similar to choosing predictors E such that WQ|E contains only one point, thus allowing zero-shot learning. As we discuss in Section 3.1 and Section 5.3, we consider the world update problem, which is different from choosing predictors.

Causal Transfer Learning. Our work builds on prior studies of transfer learning from a causal modeling perspective, without assuming causal faithfulness or sufficiency. Unlike Chen & B uhlmann (2021), who assume a linear SCM setting, our framework does not. Pearl & Bareinboim (2011) use selection variables S to model environmental differences, a special case of our framework. They allow any distribution selection, not constraining the variable given its parents. Our framework provides more details about when transfer learning is possible. Consider the causal graph among variables X, Y, Z, S with edges Z X, Z Y, X Y and S Y . This example is a case of Figure 4(b) in Pearl & Bareinboim (2011) where it was determined that transfer learning of P(Y |do(X)) is not possible (Example 6 in Pearl & Bareinboim (2011)). Variable S here could be interpreted as determining the selection of the distribution of Y given its parents X and Z in different environments. However if P(Y |X, Z) was selected from a constrained set, transfer learning of P(Y |do(X)) might still be feasible, which can be analyzed in an extension of our framework.

C. Future Works

We describe a more general approach to learn WQ|E.

We assume that the agent might have some expert knowledge about some WI|K even though I or K might not be observed. Thus, the agent wants to learn WQ|E from the set {WI|K} for some I, K where it has some knowledge about WI|K. In this paper we only consider a special case where the agent has knowledge about some atomic domains {W i }, and using Algorithm 1 we express WQ|E as a formula of W i . In the general case of learning WQ|E from the set {WI|K}, the problem can be divided into 2 steps:

1. Choosing the terms. Choosing the sets of (I, K) s to compute WQ|E from WI|K s.

2. Calculation. Compute WQ|E from the set of chosen WI|K s.

While it might be intuitive to always choose W i s to compute WQ|E, we show an example where it is not the best way. Consider the graph X Y Z where both Y | X and Z | Y are poorly conditioned and therefore estimating them yields higher errors while Z | X is the identity. Therefore if the goal is to compute Z | X, it is better to compute it directly rather than from Z | Y and Y | X.

Once the terms w I|K s are chosen, there are possible different approaches to calculate the target w Q|E. One approach is to break both w I|K s and w Q|E into formulae of the atomic term w i s using Theorem E.3 and then to reason about how the constraint on WI|K s implies constraint on the atomic W i , and then how the constraint on the atomic W i s implies constraint on WQ|E, converting the problem to the special case we already addressed in this paper. Another approach is to directly express w Q|E as a formula of the chosen term w I|K s.

We leave the extension to the future works.

Model Selection. If the agent has no expert knowledge to select the hypothesis class of function gθ that describes WI|K, or that the expert knowledge is not specific enough, selecting the hypothesis class of gθ using the expert knowledge can result in a large over-estimate of WI|K. In this case the agent can employ model selection strategies to select the hypothesis class from a nested set of hypothesis class using the incoming data we. If the agent first assumes that we is in a large set but later notices that all incoming we s lie in a small subset, the agent can narrow down the hypothesis class. The expert knowledge, if exists, place an upper bound on the model selection strategy. For example, suppose that w100|0 = w 100 w 1 is unchanged in all environments but the agent s expert knowledge is only that w 1 is unchanged in all environments, using which it infers that we 100|0 have the same row space in all environment. Then the agent can search through a nested hypothesis class for W100|0 with the largest class being the set of matrices with the same row space with unknown bases and the smallest class

When is Transfer Learning Possible?

being the set of an unknown single matrix. After the observing that all of the incoming data w1 100|0, , we 100|0 are the same matrix the agent can pick the smallest hypothesis class and learn the value of the unknown matrix. If the agent has no expert knowledge at all, it has to search through a nested hypothesis class with the largest class being the set of all matrices and the smallest class being the set of a single matrix.

We leave this extension to future works.

D. Details about Section 3

The SCM and the ordering λ generate a causal graph. Let T be the time horizon. Let λ be an ordering of the n variables X. At time t = 0, each variable is initialized with an initial value X0. At time t, at step i, 1 i n, the SCM generates the variables Xt λ(i). Let P t 1,t i Xt 1 Xt denote the set of parents of Xt i. Then we have: Xt i = fwt i (P t 1,t i , U t i ) where wt i = w i .

For examples, consider the following SCM:

X1 = fw 1(U1) (4)

X2 = fw 2(X1, U2) (5)

If the ordering λ = (1, 2), then the parents of X1 2 is P t 1,t 2 = {X1 1} and X1 2 = fw1 2 (X0 1, U 1 2 ) since X1 1 is generated before X1 2. If the ordering λ = (2, 1), then the parents of X1 2 is P t 1,t 2 = {X0 1} and X1 2 = fw1 2 (X0 1, U 1 2 ) since X0 1 is generated before and X1 1 is generated after X1 2.

The SCM and the ordering λ induces a causal graph. The causal graph is defined such that there are directed edges from each node in the set of parents P t 1,t i of Xt i to Xt i (even when W i contains the value of w i such that Xt i does not depend on its parents). We provide an example of the graph for Q Learning in Figure 10b. The causal graph induces a partial order a partial order on the variables Xt i s: for any A, B {Xt i, 1 t T, 1 i n}, A B if there exists a directed path from A to B. To simplify the notations, in the main text we remove the superscript t by re-indexing the variables according to such that Xi Xj if i j.

E. Details about Section 4

E.1. Computation Trees

In this section we provide the proof of Theorem 4.2, which is restated by Theorem E.3.

d-separation between variables (denotes d) and the partial order is defined on the causal graph (Section D. For any 2 sets X and Y, X Y if X Y X X, Y Y. For any two variables X = Y , X is called an ancestor of Y and Y a descendant of X if there is a directed path from X to Y in the causal graph. Let the ancestor set A(I) of I be the set of all ancestors of variables in I, excluding variables in I. Let A (I) := A(I) I. The parent set of I is the set of all parents of variables in I, excluding variables in I.

A set J is called a sufficient ancestor set of a set I if J A(I), J I and I d A(J) | J. The empty set is considered a sufficient ancestor set of any other set. Conditioning on a sufficient ancestor set J of I, we can compute I from the atomic parameters without needing the ancestors of J (Theorem E.3). Given I, K where I K = , a set J is called a sufficient ancestor set of I | K if J K and J is the sufficient ancestor set of I (K\J).

We show first that there is a maximal sufficient ancestor set of I | K. Lemma E.1. There exists a maximal sufficient ancestor set J of I | K such that J is a sufficient ancestor set of I | K and J contains any other sufficient ancestor set of I | K.

Proof. We will show that if J1 and J2 are sufficient ancestor sets of I | K then J := J1 J2 is also a sufficient ancestor set of I | K. Since J1 K, J2 K, we have J K.

By definition, I K\Ji is d-separated from A(Ji) given Ji for i = 1, 2. Therefore I K\J is d-separated from A(Ji) given J for i = 1, 2 by weak union by moving J1\J2 from I K\J2 to J2 (or J2\J1 from I K\J1 to J1). Therefore I K\J is d-separated from A(J1) A(J2) given J by composition. Since A(J) = A(J1 J2) A(J1) A(J2), I K\J is d-separated from A(J) given J by decomposition. By definition, J is a sufficient ancestor set of I | K.

When is Transfer Learning Possible?

Therefore the scope of I | K is defined by computing the scope of I (K\J) | J where J K is the maximal sufficient ancestor set of I | K.

Let K be an independent set of I | K if K K and I d K | K\K . There exists a set K such that K is an independent set of I | K and K contains any other independent set of I | K, called the maximal independent set of I | K. Such a maximal set exists because of the intersection property of d-separation: If A d B | C D and A d C | B D then A d B C | D. Suppose there exist K1 K and K2 K such that I d K1 | K\K1 and I d K2 | K\K2. Then I d K2\K1 | K\(K1 K2) by weak union.

Definition E.2 (Scope of w I|K). Let I and K be 2 disjoint sets. Let K K be the maximal independent set of I | K. Let J K\K be the maximal sufficient ancestor set of I | K\K . Let I := I (K\K )\J. Then Scope(I | K) := A (I )\A (J).

We show that w Q|E could be computed from variables in Scope(Q | E) using the operations , , and , where wt s s appear in reverse topological order. Let w Scope(Q|E) := {wt i | (i, t) Scope(Q | E)} and W Scope(Q|E) be the corresponding constrained domain of w Scope(Q|E).

Theorem E.3 (Top-down computation tree). Given Q E = , we initiate the formula with w Q|E. For each term of the form w I|K in the formula, the agent repeatedly performing the following steps. First it removes the maximal independent set K of I | K from K. To make the notations simple, we re-use K afterwards to denote the set K\K after removing K . The agent then chooses one of the following computations.

Choose J I, s.t. J I\J and J K\J is a sufficient ancestor set of I\J where J is the set of variables d-separated from I\J given (J K)\J . Then calculate the product:

w I|K = w I\J|(J K)\J w J|K. (6)

Choose J Scope(I | K) such that I J = , K J = and sum out J from the joint distribution:

w I|K = Jw I,J|K. (7)

Let J K be the maximal sufficient ancestor set of I | K. If J = K, the learner then compute I | K from the joint I, K\J | J by conditioning on K\J in addition to J:

w I|K = K\Jw I,K\J|J (8)

These options are executed until no such non-empty J satisfying either condition exists.

All variables in the formula are wt i Scope(Q | E) in reverse topological order and each wt i appear at most once in the formula. We denote the resulting formula g Q|E(w Scope(Q|E)), also called a computation tree11 of w Q|E.

Proof. We show that removing K K from K if I d K | K\K does not expand the scope, i.e. Scope(I | K\K ) Scope(I | K). Let P K and P K\K be the maximal sufficient ancestor set of I | K and I | K\K . Since I (K\P) d A(P) | P, we have I (K\K \P) d A(P) | P by decomposition. Therefore P is a sufficient ancestor set of I | K\K , and P P . Then we have A (I (K\K )\P )\A (P ) | {z } Scope(I|K\K )

A (I K\P)\A (P) | {z } Scope(I|K)

because I ((K\K )\P )

I (K\P) and P P.

We first show that if the agent terminates, then all the terms are in the form of wt i in reverse topological order. Since the agent cannot execute the operations, the term w I|K satisfies the condition that K is a the maximal sufficient ancestor set of I. Therefore Scope(I | K) = A (I)\A (K). Since the agent cannot execute , A (I)\A (K) = I, and therefore the parent set of I is a subset of A (K). Since K is a sufficient ancestor set of I and therefore I d A(K) | K, the parent set of I is a subset of K. If I has more than 1 element, the agent can choose J to be the smallest element topologically to execute the operation because J K will contain the parent set of I\J and therefore J K will d-separated I\J from A(J K). Therefore I has only one element denoted (i, t) and K contains all of its parents, and w I|K can be substituted by wt i . Since the operation ensures that J I\J, wt i comes before wt j if (j, t ) (i, t), and therefore the terms are in reverse topological order.

11There can be multiple computation trees depending on the choice the learner made.

When is Transfer Learning Possible?

We show now that for all operations , and the scopes of the terms on the right-handed side are subset of the scope of the term on the left-hand side, and therefore all of the terms wt i s in the formula are in the scope of w Q|E. We also show that in the operation the scopes of the 2 terms in the right-handed side are disjoint, and therefore each term wt i appears at most once.

First, consider the case when w I|K = w I\J|J,K w J|K. Let P K denote the maximal sufficient ancestor set of I | K and P K denote the maximal sufficient ancestor set of J | K. Since J (K\P) d A(P) | P by decomposition, P is a sufficient ancestor set of J | K. Therefore P P . We have A (J (K\P ))\A (P ) | {z } Scope(J|K)

A (I (K\P))\A (P) | {z } Scope(I|K)

We will now show that A (I\J)\A (J K\J ) | {z } Scope(I\J|J K\J )

A (I\J)\A (J K). We will show that for any mutually disjoint

sets X, Y, Z, if X A(Y) and Y d Z | X then A (Y)\A (X) A (Y)\A (X Z). To do so, we show that any node N A (Y)\A (X) is not in A (X Z). Since A (X) A (Y) (because X A(Y)) and N A (Y)\A (X), we have N / A (X). Suppose N is in A (X Z). Then there must exists Z Z such that N A (Z ) and there is a directed path from N to Z not going through X. Since N A (Y)\A (X), there is a directed path from N to Y not going through X. Therefore conditioning on X, there is a Bayes ball path from Y to Z through N. This is a contradiction, therefore N / A (X Z). We have A (I\J)\A (J K) | {z } A superset of Scope(I\J|J K\J )

A (I (K\P))\A (P) | {z } Scope(I|K)

because I\J I K\P and

J K P, and therefore Scope(I\J | J K\J ) Scope(I | K).

We also have Scope(I\J | J K\J ) Scope(J | K) = because A (J K), a superset of the second scope, is removed from the first scope. Therefore all leaves appear at most once in the formula.

Second, consider the case when w I|K = Jw I,J|K. Let P denote the maximal sufficient ancestor set of I | K. We will first show that P is a sufficient ancestor set of I, J | K. Let I := I (K\P). By the definition of sufficient ancestor set, I is d-separated from A(P) given P. Since J A (I )\A (P), there are directed paths from J to I not through P. Suppose there are Bayes ball s paths from A(P) to J given P. Then there are Bayes ball s paths from A(P) to I given P by combining the Bayes ball s paths from A(P) to J given P and the directed path from J to I not through P. Therefore A(P) is not d-separated from I given P, a contradiction. Therefore there is no Bayes ball path from A(P) to J given P, so J is d-separated from A(P) given P. Therefore I J = I J (K\P) is also d-separated from A(P) given P, which implies P is a sufficient ancestor set of I, J | K.

Let P be the maximal sufficient ancestor set of I, J | K, then we have P P . Therefore A (I J (K\P )) A (I J (K\P)) = A (I J). Since J A (I ), A (I J) = A (I ). Therefore A (I J (K\P )) A (I ), and we have A (I J (K\P ))\A (P ) | {z } Scope(I,J|K)

A (I ))\A (P) | {z } Scope(I|K)

Finally, when w I|K = K\Jw I,K\J|J, the scope of the term on the right-handed side is the scope of the term on the left-handed side by definition.

Therefore all of the terms wt i s in the formula are in the scope of w Q|E, and each term appear at most once.

E.2. Comparisons with Variable Elimination

There are several differences: (1) Variable elimination computes the tree from the bottom-up by starting with the leaves w i s and choosing several nodes to be combine to create their parent; (2) Variable elimination does not explicitly describe the intermediate nodes; (3) Variable elimination does not describe a smaller scope and (4) In variable elimination, each leaf goes through at least 2 operations: at least 1 to build the joint distribution and the conditional operation while in our method it is possible for a leaf to go through only one operation (w 3 in Figure 4), which could make it easier to analyze if w 3 is the variable of interest.

E.3. Details about Examples in Section 4.1

We provide another example of causal graph, scope and computation tree in Figure 9, Example E.1 and Example E.2.

When is Transfer Learning Possible?

(a) Causal graph with the scope of w3|0 in yellow.

w 3 w 1 w 2

(b) A computation tree for w3|0.

Figure 9. Causal graphs and their computation trees. For clarity, we do not draw the uniform noise variables.

Example E.1 (Q Learning). Let At, St, Rt, and Gt denote the action, state, reward, and return at time t. The Q(s, a) function is the expected return at time t given the state at t = 0 parameterized by w Gt|S0,A0. The SCM for Q-Learning is the following:

A = fw A(UA) (9)

S = fw S(A, S, US) (10)

R = fw R(A, S, UR) (11)

G = fw G(R, G) (12)

and the ordering to generate the variables λ = (S, A, R, G).

At, St, Rt and Gt are interpreted as the action, state, reward and return at time t. The Q(s, a) function is the expected return at time Gt given S0 and A0. The causal graph is shown in Figure 10a and the below computation tree for w G2|A0,S0 is shown in Figure 10b. For any variable Xi, since wt i = w i for any t, we can compress all nodes wt i into a single node w i .

w G2|S0,A0 = w G2|G1,R1,A1,S1 h w G1,R1,A1,S1|G0,R0,A0,S0 w G0,R0|S0,A0) i (13)

= (w G (w R (w S w A))) h (w G (w R (w S w A))) (w G w R) i (14)

Example E.2 (Figure 3). Scope(7 | 6, 3) = {3, 4, 5, 6, 7}. We show below a computation tree with variables in the scope:

w7|6,2 = 6w7,6|2 (15)

= 6(w7,6|5,4,3,2 w5,4,3|2) (16)

= 6((w 7 w 6) (w 5 w 4 w 3)) (17)

E.4. Details about Section 4.3

If each atomic parameter w i appear at most once after the substitution of wt i by w i as in the case of supervised learning, the formula will be simple, and constrained domain can be computed iteratively: if w Y |X = (w 3 w 2) w 1 where denote some operations, then we can first compute the constrained domain W32 of w 3 w 2, and then compute WY |X = W32 W 1 where W32 W 1 is define to be the set {u v | u W32, v W 1} since W32 are independently chosen according to the Independent Causal Mechanism. If w Y |X = (w 1 w 2) w 1 where w 1 appears twice, we cannot compute WY |X = W12 W 1 because W12 and W1 are not chosen independently. Instead we need to define WY |X = {u v u | u W 1, v W 2}.

F. Details about Section 5

F.1. Details about Example 5.1

Sparse mechanism shifts do not imply transfer feasibility. Consider the first scenario where 2 i 99, w i is the same matrix satisfying Π99 i=2w i has rank 2 in all environments, while w 1 and w 100 are re-selected from all possible 2 2 matrices in each new environment. Since only w 1 and w 100 change in different environments, the changes are sparse. We show below that the constraints on w 2, . . . , w 99 do not result in any constraint on we 100|0 by showing the following property of .

Lemma F.1. Let Mm n denote the set of all matrices with dimension m n. Let w be a 2 2 matrix of rank 2. Then: {Aw B | A M2 2, B M2 2} = M2 2.

When is Transfer Learning Possible?

A0 S0 A1 S1 A2 S2

(a) The variables created by the SCM of Example E.1 described in Eq. 9 and the ordering λ = (S, A, R, G). Gt is the return at time t. For clarity, we do not draw the uniform noise variables.

w G w R w A w S

(b) A computation graph of w G2|A0,S0 for Example E.1. For all variables X, we compress all nodes wt X into a single node w X.

Figure 10. The causal graph and a computation tree for Q-Learning in Example E.1.

Since w 99 w 2 has rank 2, {Aw 99 w 2B | A M2 2, B M2 2} = M2 2. The constraint that w 99, . . . , w 2 is the same matrix in all environments does not enforce any constraint on we 100|0.

Transfer feasibility does not imply sparse mechanism shifts. Consider the second scenario in Example 5.1 where in each environment, 2 i 100, w i is selected from the set of all 2 2 matrices while w 1 is the same matrix of rank 1. Since w 1 is the right-most term in the computation graph, we analyze the left-propagatable property for . Matrix multiplication is F-left-propagatable where F is the set of all matrices with rows in the row space of the right term w 1. Therefore, we 100|0 is in the set of all matrices with rows in the row space of w 1. Since w 1 has rank 1, θ in Algorithm 1 is the 1 2 basis vector of the row space, which can be learnt from w1 100|1 of Environment 1.

Transfer feasibility of the algorithm. Without transfer learning, comp = 4 since the matrix w100| is 2 2. With our transfer learning procedure, the learner needs to learn a 2 1 vector to compute w100|0 after knowing the 1 2 basis θ of its row space, therefore comp = 2.

Representation Learning. The distribution of X1 | X0 where X0 is the predictor could be consider the representation. In the second scenario, the representation is the same across the environments, and transfer feasibility is analyzed by the left-propagatable property of the operation since the representation is the right-most term.

Proof of Lemma F.1. Given w M2 2 of rank 2, we will show that for any C M2 2, there exist A, B M2 2 such that:

Aw B = C. (18)

Let w = UΣV T and C = U Σ V T be the SVD decompositions of w and C. Since rank(C) rank(w), there exists a

When is Transfer Learning Possible?

diagonal matrix D such that Σ = ΣD. Let A = U U T and B = V DV T . Then:

Aw B = (U U T )(UΣV T )(V DV T ) (19)

= U ΣDV T (20)

= U Σ V T (21)

F.2. Details about Example 5.2

We derive a computation graph:

w100|0 = w 100 w 99 w 1 (23)

where the computations are performed from left to right.

For neural network. is stacking layer: for i > j > k, the generator wi|k = wi|j wj|k = (wi|j, wj|k). Therefore w100|0 = (w 100, , w 1)

(Rolnick & Kording, 2020) show that under the Linear Regions Assumption, except for a measure-zero set of networks, Re LU networks can be recovered up to permutations and rescaling if the domain of the input is the entire multidimensional real space. The description set w can be obtained from w by replacing w(.i) k , w(i.) k+1 by cw(.i) k , 1

cw(i.) k+1 and replacing wk, wk+1 by a permutation σ of the rows of wk and the columns of wk+1. Under the Linear Region Assumption, in all environments, from Eq. 23, the description set w100|0 = {((w 100, , w 1), σ, c) | σ Σ100|0, c C100|0} where Σ100|0 and C100|0 are sets of permutations and scaling applying to (w 100, , w 1) to generate the descriptions.

For any i, 1 i 100, if w i is the same in all environments, then for any description u = (u 1, , u 100) of w100|0, u i can be obtained from w i by permutation or rescaling, and therefore is layer i of a description of wk 100|0 in another environment Ek. Therefore we can learn layer u i of a description from the first environment and freeze it in later environments. We do not use the exact algorithm described at the end of Section 4.4 (when Assumption 3.4 does not hold) because this discussion is shorter, but in both the algorithm in Section 4.4 and this discussion, we still need to learn or infer all possible descriptions from previous environments in order to transfer correctly.

We now show below that even with slight modifications from the above example, θ no longer corresponds to an entire layer, and therefore freezing a layer learnt from previous environments lead to sub-optimal performance. Example F.1. Consider the following SCM: X0, X1, X2 R2 and Xi = fw i (Xi 1) + ϵi, i {1, 2} where ϵi s are independent noise satisfying E[ϵi] = 0 and fw i is a 1-layer Re LU neural network where w i denotes the weight. In all

environments, w 2 = 1 0 0 1 is the same and w 1 can be selected from all 2 2 matrices. Suppose the leaner wants to

transfer knowledge of w2|0, which is the weight of a 2-layer Re LU network, from Environment 1 to Environment 2. Consider the following cases:

1. In environment E1, w 1 1 = 1 0 0 0 and X0 R2 is a multivariate Gaussian on R2.

2. In environment E1, w 1 1 = 1 0 0 1 and X0 R2 is a multivariate Gaussian on the line {[1, 1]T c|c R}.

In Case 1, in Environment 1, the following 2-layer weights are both descriptions of w2|0 = (w 1, w 2) : 1 0 0 0 , 1 0 0 1

and 1 0 0 0 , 1 0 0 6

. In Case 2, in Environment 1, the following 2-layer weights are both descriptions of w2|0 =

(w 2, w 1) on the domain of X0: 1 0 0 1 , 1 0 0 1

and 1 0 0 1 , 0.3 0.7 0.5 0.5

. Therefore in Case 1 and Case 2 the

When is Transfer Learning Possible?

agent cannot freeze w 2 learnt from Environment 1 because it may freeze an incorrect value such as 1 0 0 6 in Case 1 and

0.3 0.7 0.5 0.5 in Case 2. This is because in Case 1 and Case 2, the constrained domain that can be learnt from Environment 1

is no longer the set of multi-layer networks with a common layer w 2. Instead, the agent needs to learn the set of all possible descriptions from Environment 1.

F.3. Details about Section 5.2.2

In this section we provide details for the case study of the Colored MNIST problem in Section 5.2.2.

F.3.1. DERIVATION OF THE ARCHITECTURE

We apply Theorem E.3 to derive the following architecture:

P(Y | X) = P(X, Y ) P

y P(y, X) (24)

p P(X, p, Y ) P

p P(X, p, y) (25)

p P(X | p)P(p, Y ) P

p P(X | p)P(p, y) (26)

sλT p υ(X) + log g(λp) + log P(p, Y )

sλTp υ(X) + log g(λp) + log P(p, y) (27)

F.3.2. DATA GENERATION

Dataset and the pre-processing code to generate multiple environments of colored images from the original MNIST dataset from Gulrajani & Lopez-Paz (2020)12 were used with image size s = 28 and number of color channels k = 2. We add one channel to make k = 3. Environment 1 and 2 has 60000 and 10000 samples respectively. We divide Environments 1 to train, val and test set with ratio 0.8, 0.1, 0.1. In Environment 2, we use 5000 samples for training and validation and 5000 samples for testing. Among the 5000 samples for training and validation, we gradually take the first 100, 200, 500, 1000, 2000 and 5000 samples as the number of samples available to the agent. For each available sample size 100, 200, 500, 1000, 2000 and 5000, we divide the train and val sets with ratio 7 : 3.

F.3.3. TRANSFER FEASIBILITY OF THE ALGORITHM

To learn c W1 Y |X, we need to learn the linear coefficient θ of υ(x). There are np coefficients θ Rn n k for each value of p. There are ny np biases biasy,p R for each value of y and p. Since we do not know np, we use np = 20 as an upper bound. The number of cells for θ is np n n k = 20 28 28 3 = 47040. The number of cells for the bias of the first layer is np = 20. The number of cells of the weight of the second layer is np ny = 20 2 = 40. Let the complexity measure comp be the dimension of the parameters needed to learn. Then, without transfer learning, comp = 47040 + 20 + 40; with transfer learning, comp = 20 + 40.

F.3.4. IMPLEMENTATION

In all of our training procedures, we use stochastic gradient descent and the negative log likelihood loss. We repeat for 10 times the following procedure and plot the average and the standard deviation values in the plot. First we create Environments 1 and 2 as described above. We then train a model on the train set of Environment 1 and plot its accuracy the test set of Environment 1 and Environment 2 in Figure 5. Next we perform transfer from Environment 1 to Environment 2 in Figure 5. Each line is an average over 10 runs with the shaded region denoted the standard deviation. We plot the accuracy on the test set of Env. 2.

The weight in the second layer is P(y | p) and therefore needs to satisfy P

y P(y | p) = 1. We satisfy this requirement by placing a softmax function on the weight.

12URL: https://github.com/facebookresearch/Domain Bed/blob/main/domainbed/datasets.py

When is Transfer Learning Possible?

The linear coefficients of the first layer and second are initialized uniformly from (0, 1). The bias of the first layer is initialized uniformly from (

k) where k is the number of incoming features.

F.3.5. HYPERPARAMETERS TUNING

We use Ray Tune to tune the hyperparameters. We sample the parameters from the configuration range in the table below, run the training process until the number of epochs is 100 or the standard deviation of the validation loss of the last 10 epoch is at most 0.01. We select the hyperparameter with the lowest validation loss, and retrain the model with the selected hyperparameter on the combined train and validation set for 100 epochs. We use the retrain model to predict the test set and to transfer. Due to limited computation resource, we use less resource to tune our method by using a smaller hyperparameter range (and therefore a smaller number of sampled hyperparameters) and use more resource to tune competing methods by using a larger range and more samples.

Train in Env. 1

Train in Env.3 (No Transfer)

Transfer Coefficient

Transfer Layer, Transfer Random Coefficient Learning rate loguniform(0.01, 5) Momentum 0.9 Weight decay 0.0001 loguniform(0.00001, 0.01)

0.0001 loguniform(0.00001, 0.01) np 20 20 or 50 uniformly Batch size 512 Number of sampled hyperparameters

F.3.6. PLOT LEGEND

Without transfer. The model is train from scratch in Env. 2 the available samples 100, 200, 500, 1000, 2000, 5000 successively.

Transfer coefficient. We freeze the first layer s coefficient learnt from Env. 1 and retrain the first layer s bias and the second layer in Env. 2 with the available samples 0, 100, 200, 500, 1000, 2000, 5000 successively.

Transfer layer. We freeze the first layer s coefficient and bias learnt from Env. 1 and retrain the second layer in Env. 2 with the available samples 0, 100, 200, 500, 1000, 2000, 5000 successively.

Transfer random coefficient. We generate the first layer s coefficient randomly, freeze it and retrain the first layer s bias and the second layer in Env. 2 with the available samples 0, 100, 200, 500, 1000, 2000, 5000 successively.

F.3.7. COMPUTE

The experiments that produce the figures in this paper were performed on a personal laptop.

F.3.8. HYPOTHETICAL MODEL SATISFYING THE ASSUMPTIONS

We present a hypothetical model that satisfies the 4 assumption stated in Section 5.2.2. Note that this model is only for illustration and to show that the 4 assumptions could be reasonably satisfied. We did not use this model in our analysis or experiment.

Example F.2 (A hypothetical model for Colored MNIST). We present a hypothetical generative model that satisfied all of our assumptions (Figure 11). Let X Rs s k denote the image, D {0, . . . , 9} denote the digit, Y {0, 1} denote the binary digit label, C {0, 1} denote the color, M Rs s n denote the digit matrix and N Rk k denote the color matrix. The binary digit Y is constructed by flipping D with probability 0.25. The color C is constructed by flipping Y with probability 0.1, 0.2, and 0.9 respectively in environments 1, 2, and 3. Each image x is a s s k tensor where s is the image size and k is the number of color channels. For each digit d, there is a digit matrix md Rs s k which draws the digit in red while other color channels of md are 0. For each color c, there is an associated color matrix nc Rk k. The image x is constructed by first computing md nc where denotes multiplication where each s k slice of md is multiplied with nc. Then independent standard Gaussian noise is added to each cell.

When is Transfer Learning Possible?

binary label y

color c digit matrix md

color matrix nc

Figure 11. A hypothetical generative model of Complex Colored MNIST dataset. Ellipse variables are observable while square variables are not observable.

G. Limitations

Our work assumes realizability, a common assumption which does not always hold in practice. Nevertheless, experiments on the Colored MNIST example show that our method still works. Relaxation of the assumption is a topic for future works.

Our model does not include the case in reinforcement learning where the reward Rt at time t is a function of both the previous state St 1 and the current state St.