# effectinvariant_mechanisms_for_policy_generalization__50490ae1.pdf

Journal of Machine Learning Research 25 (2024) 1-36 Submitted 6/23; Revised 11/23; Published 01/24

Eﬀect-Invariant Mechanisms for Policy Generalization

Sorawit Saengkyongam sorawit.saengkyongam@stat.math.ethz.ch Seminar for Statistics ETH Zürich Zürich, Switzerland

Niklas Pﬁster np@math.ku.dk Department of Mathematical Sciences University of Copenhagen Copenhagen, Denmark

Predrag Klasnja klasnja@umich.edu School of Information University of Michigan Ann Arbor, MI, USA

Susan Murphy samurphy@g.harvard.edu Department of Statistics Department of Computer Science Harvard University Cambridge, MA, USA

Jonas Peters jonas.peters@stat.math.ethz.ch Seminar for Statistics ETH Zürich Zürich, Switzerland

Editor: Aapo Hyvarinen

Policy learning is an important component of many real-world learning systems. A major challenge in policy learning is how to adapt eﬃciently to unseen environments or tasks. Recently, it has been suggested to exploit invariant conditional distributions to learn models that generalize better to unseen environments. However, assuming invariance of entire conditional distributions (which we call full invariance) may be too strong of an assumption in practice. In this paper, we introduce a relaxation of full invariance called eﬀect-invariance (e-invariance for short) and prove that it is suﬃcient, under suitable assumptions, for zeroshot policy generalization. We also discuss an extension that exploits e-invariance when we have a small sample from the test environment, enabling few-shot policy generalization. Our work does not assume an underlying causal graph or that the data are generated by a structural causal model; instead, we develop testing procedures to test e-invariance directly from data. We present empirical results using simulated data and a mobile health intervention dataset to demonstrate the eﬀectiveness of our approach. Keywords: distribution generalization, policy learning, invariance, causality, domain adaptation

. Part of this work was done while SS and JP were at the University of Copenhagen.

c 2024 Sorawit Saengkyongam, Niklas Pﬁster, Predrag Klasnja, Susan Murphy, Jonas Peters.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-0802.html.

Saengkyongam, Pfister, Klasnja, Murphy, Peters

1. Introduction

When learning models from data, we often use these models in scenarios that are assumed to have similar or the same characteristics as the ones generating the training data. This holds for prediction tasks such as regression and classiﬁcation but also for settings such as contextual bandits or dynamic treatment regimes. When we observe diﬀerent regimes under training, we can hope to exploit this information to construct models that adapt better to an unseen environment (or task). Such problems are usually referred to as multi-task learning domain adaptation or domain generalization (Caruana, 1997; Crammer et al., 2008; Muandet et al., 2013; Wang et al., 2022); the nomenclature sometimes diﬀers depending on whether one observes labeled and/or unlabeled data in the test domain. For prediction tasks, it has been suggested to learn invariant models by exploiting invariance of the conditional distributions. Under suitable assumptions, such models generalize better to unseen environments if the changes between the environments can be modeled by interventions (e.g., Rojas-Carulla et al., 2018; Magliacane et al., 2018; Christiansen et al., 2021). A similar approach has been applied in policy learning (Saengkyongam et al., 2023), where one searches for policies that yield an invariant reward distribution. We refer to the invariance of conditional distributions as full invariance . More precisely, given covariates Xe and outcome Ye from diﬀerent environments e E, the full invariance assumption posits the existence of a set of covariates XS e such that for all e1, e2 E : Ye1|XS e1 and Ye2|XS e2 are identical. (1)

Full invariance, however, may be too strong of an assumption in practice. In prediction tasks, it has been suggested to relax the requirement of full invariance, such as vanishing empirical covariance, and instead use invariance as a form of regularization (e.g., Rothenhäusler et al., 2021; Jakobsen and Peters, 2022; Arjovsky et al., 2019). This approach comes with theoretical guarantees regarding generalization to bounded interventions, for example, but these results are often limited to restricted classes of models and interventions. In this paper, we relax the full invariance assumption in a diﬀerent direction and show how it can be applied to inferring optimal conditional treatments in policy learning. We illustrate our proposed relaxation based on an example. Consider the following class of structural causal models (SCMs, Pearl, 2009) indexed by environments e E := {1, 1}, with the corresponding graph1 shown in Figure 1,

U := ϵU X := e U + ϵX T := 1(1 + X + ϵT > 0) Y := T(1 + X) + X | {z } Yf

+ 2e + U + ϵY | {z } Yg

where (ϵU, ϵX, ϵT , ϵY ) are independent standard normal random variables. Here, Y R represents the outcome or reward, T {0, 1} corresponds to the treatment or action, and X R and U R are observed and unobserved covariates, respectively. The mechanism

1. Visually, the graphical representation in Figure 1b is similar to SWIGs (Richardson and Robins, 2013), and we use tikz-swigs La Te X package for drawing the graph; the interpretation, however, is diﬀerent.

Effect-Invariant Mechanisms for Policy Generalization

(a) Graphical representation indicating that full invariance does not hold

(b) Graphical representation indicating that partial invariance holds

Hypothesis P-value

Full-invariance, see (1) 0.01 Eﬀect-invariance, see (5) 0.64

(c) Comparing invariance tests

Figure 1: Y is the outcome (or reward), X and U are observed and unobserved context variables, T is the treatment (or action), and e represents diﬀerent environments. In example (2), the outcome mechanism is generally not invariant (as the environment enters Y directly), see (a). This paper introduces a type of partial invariance called e-invariance (Deﬁnition 3), which does hold here, see (b): when conditioning on X, the treatment eﬀect is invariant across environments. The concepts of the paper are applicable even if the data generating process does not allow for a graphical representation. Instead, we propose testing procedures to test for e-invariance. (Bottom) Comparing the test result obtained from one of our proposed e-invariance tests, applied to a sample taken from (2), with the result of the full invariance test. While X does not satisfy the full invariance condition (as in (1)), it does satisfy the e-invariance condition (as in (5)).

for T can be considered as a ﬁxed policy. Since the environment has a direct eﬀect on the outcome, there is no subset satisfying the full invariance condition (1): regardless of whether we condition on or {X}, the outcome distribution is not independent of the environment. Consequently, methods that rely on the full invariance assumption such as the one proposed by Saengkyongam et al. (2023) would lead to a vacuous result. However, the criterion of full invariance is not necessary when the goal is to learn an optimal policy. Instead it may suﬃce to ﬁnd models that are partially invariant: In the above example, see (2), the outcome Y can be additively decomposed into two components: one being a function of U, e, and ϵY , and another being a function of T and X. In this case, although the outcome mechanism is not entirely invariant, it contains an invariant component. When conditioning on X, the eﬀect of the treatment is the same in all environments. More speciﬁcally, the conditional average treatment eﬀect does not depend on e, that is,2

x X : Ee[Y | X = x, T = 1] Ee[Y | X, T = 0] = 1 + x. (3)

2. Using do-notation (Pearl, 2009), we have for all x X and t T that Ee[Y | X = x, do(T = t)] = Ee[Y | X = x, T = t].

Saengkyongam, Pfister, Klasnja, Murphy, Peters

We say that {X} satisﬁes eﬀect-invariance (e-invariance). This condition suﬃces that, for an unseen test environment, we can still infer the optimal treatment among policies that only depend on X without having access to the outcome information in the test environment. In addition, if the environments are heterogeneous enough (see Assumption 3), such a policy is worst-case optimal. We refer to this setup as zero-shot generalization. We state the class of data generating processes and provide formal results in Section 2 and Section 3 below. Moreover, if we can acquire a small sample including observations of the outcome from the test environment, we would want to optimize the policy using the data from the test environment. Ideally, this optimization also leverages information from training data from other environments to improve the ﬁnite sample performance of the learnt policy. We discuss that e-invariant information can be beneﬁcial in such settings. We refer to this scenario as few-shot generalization and present it as an extension of the zero-shot methodology, in Section 5. While SCMs provide a class of examples satisfying the assumptions of this work, we do not assume an underlying causal graph or SCM (but instead only require a sequential sampling procedure that ensures that the covariates X causally precede the outcome). In particular, e-invariance is not read oﬀfrom a known graph but instead tested from data. Figure 1c illustrates the testing result obtained by applying one of the proposed e-invariance tests to a sample from (2), where we also include a comparison with the full-invariance test as proposed in (Peters et al., 2016, Method II). The main contributions of this paper are four fold:

(1) Introducing e-invariance: In Section 2, we introduce the concept of e-invariance, which oﬀers a relaxation of the full invariance assumption. An e-invariant set ensures that the conditional treatment eﬀect function remains the same across diﬀerent environments.

(2) Utilizing e-invariance for generalization: Section 3 discusses the use of e-invariance in learning policies that provably generalize well to unseen environments. We prove two generalization guarantees: The proposed method (i) outperforms an optimal contextfree policy on new environments and (ii) outperforms any other policy in terms of worst-case performance.

(3) Methods for testing e-invariance: We propose hypothesis testing procedures, presented in Section 4, to test for e-invariance from data within both linear and nonlinear model classes.

(4) Semi-real-world case study: In Section 6, we demonstrate the eﬀectiveness of our proposed policy learning methods in the semi-real-world case study of mobile health interventions. An optimal policy based on an e-invariance set is shown to generalize better to new environments than the policy that uses all the context information.

1.1 Further Related Work

Our work builds upon the existing research that leverages the invariance of conditional distributions (full invariance) for generalization to unseen environments (Schölkopf et al., 2012; Rojas-Carulla et al., 2018; Magliacane et al., 2018; Arjovsky et al., 2019; Christiansen et al., 2021; Saengkyongam et al., 2023). Several relaxations of the full invariance have been

Effect-Invariant Mechanisms for Policy Generalization

suggested for the prediction tasks (Rothenhäusler et al., 2021; Jakobsen and Peters, 2022; Arjovsky et al., 2019; Guo and Bühlmann, 2022). In reinforcement learning, previous studies have suggested the use of invariance to achieve generalizable policies (Zhang et al., 2020; Sonar et al., 2021), however, they lack theoretical guarantees for generalization. Closely related to our work, Saengkyongam et al. (2023) has established the worst-case optimality of invariant policy learning based on the full invariance assumption, which may be too restrictive in practice. Transportability in causal inference (e.g., Pearl and Bareinboim, 2011; Bareinboim and Pearl, 2014; Subbaswamy et al., 2019) addresses the task of identifying invariant distributions based on a known causal graph and structural diﬀerences between environments, which can be used to generalize causal ﬁndings. However, our approach diﬀers in that we do not assume prior knowledge of the causal graph or structural diﬀerences between environments. Furthermore, our methods are applicable even if the data generating process does not allow for a graphical representation. Instead, we develop testing procedures to obtain invariant information from data. Additionally, methods based on causal graphs typically only capture full invariance information (through the Markov property), whereas our work relaxes the requirement of full invariance for policy learning.

2. Eﬀect-invariance

2.1 Multi-environment policy learning

In this work, we consider the problem of multi-environment policy learning (or multienvironment contextual bandit) (see also Dawid, 2021; Saengkyongam et al., 2023). Given a ﬁxed set of environments E, we assume that for each environment e E, there is a policy learning setup, where the distributions of covariates and outcome may diﬀer between environments. Each of the setups is modelled by a three-step sequential sampling scheme: First, covariates (X, U) are sampled according to a ﬁxed distribution depending on the environment, then X is revealed to an agent that uses it to select a treatment T (from a ﬁnite set T ) according to a policy π and, ﬁnally, an outcome Y is sampled conditionally on X, U and T. Formally, we assume the following setting throughout the paper.

Setting 1 (Multi-environment policy learning) Let E R be a collection of environments, Y R an outcome variable, X X Rd observed covariates, U U Rp

unobserved covariates and T T = {1, . . . , k} a treatment. Let (T ) denote the probability simplex over the set of treatments T and let Π := {π | π : X (T )} denote the set of all policies. Moreover, for all e E let Pe X,U be a distribution on X U and for all e E, x X, u U and t T let Pe Y |X=x,U=u,T=t be a distribution on R. Given e E and π Π, this deﬁnes a random vector (Y, X, U, T) by (X, U) Pe X,U, T π(X), and Y Pe Y |X=X,U=U,T=T , see Figure 1(a) for an example. Correspondingly, n observations (Yi, Xi, Ti, ei, πi)n i=1 from this model are generated by the following steps.

(i) Select an environment ei E and a policy πi Π.

(ii) Sample covariates (Xi, Ui) Pei X,U.

(iii) Sample the treatment Ti πi(Xi).

Saengkyongam, Pfister, Klasnja, Murphy, Peters

(iv) Sample the outcome Yi Pei Y |X=Xi,U=Ui,T=Ti.3

The sampling in (ii) (iv) is done independently for diﬀerent i. In particular, we consider ei and πi to be deterministic; they are not random variables that depend on other variables. (Our results in Section 3 remain valid even if πi depends on previous observations {j : j i}, see Remark 2.) Further denote by Etr E the set of observed environments within the n training observations and for each e Etr we denote by ne the number of observations from environment e. We assume that there exists a product measure ν such that for all e E the joint distribution of (Y, X, U, T) in environment e, under policy π has density pe,π with respect to ν and that Pe X has full support on X. Next, we deﬁne t0 T as a baseline treatment, which serves as the reference point for deﬁning the conditional average treatment eﬀect in (4). However, and importantly, our results hold for any choice of t0. Finally, we assume that the policies generating the training observations are bounded, i.e., for all i {1, . . . , n}, t T , and x X it holds that πi(x)(t) > 0.

Notation When writing probabilities and expectations of the random variables Y , X, U and T or the corresponding observations, we use superscripts to make explicit any possible dependence on the environment and policy, e.g., Pe,π and Ee,π. Moreover, by a slight abuse of notation, for a policy π Π with a density, we let π(x) denote the density rather than the distribution; we also use the commonly employed convention π(t|x) := π(x)(t). Finally, for all t T , we denote by πt Π the policy that always selects treatment t, that is, πt( |x) = 1(t = ). If we assume the existence of an underlying SCM (see Section 2.3), we have for all e E, t T and x X that Ee,πt[Y | X = x] = Ee[Y | X = x, do(T = t)].

Remark 2 Our results in Section 3 remain valid even if πi in Setting 1 depends on previous observations. In this case, the sampling step (iii) is replaced by Ti πi(Xi, Hi) with Hi := {(Xj, Yj, Tj) : j < i}. Furthermore, in the Zero-shot setting in Section 3, we consider Dtr := (ytr i , xtr i , ttr i , πtr i ( | , hi), etr i )n i=1, where (ytr 1 , xtr 1 , ttr 1 ), . . . , (ytr n , xtr n , ttr n ) are (jointly independent) realizations from Qtr 1 := Pe1,π1( | ) X,Y,T , Qtr 2 := Pe2,π2( | ,h2) X,Y,T , . . . , Qtr n := Pen,πn( | ,hn) X,Y,T respectively, with hi := {(ytr j , xtr j , ttr j ) : j < i} for i 2; in Appendix A.3, we replace πi by πi( | , hi).

2.2 Invariant treatment eﬀects

The concept of invariance has been connected to causality (Haavelmo, 1944; Pearl, 2009; Schölkopf et al., 2012) and it has been suggested to use it for causal discovery (Peters et al., 2016; Pﬁster et al., 2018; Heinze-Deml et al., 2018) or distribution generalization (Rojas Carulla et al., 2018; Rothenhäusler et al., 2021; Magliacane et al., 2018). In our setting, the standard notion of invariance would correspond to the invariance in the outcome mechanism (Saengkyongam et al., 2023). In practice, this notion may be too strong. E.g., it does not hold if the environment directly inﬂuences the outcome (see Figure 1 for an example). In what follows, we introduce the notion of (treatment) eﬀect-invariance, which relaxes the standard invariance condition.

3. Consequently, the distribution in (iv) is indeed the conditional distribution of Y , given X, U, and T, justifying the notation.

Effect-Invariant Mechanisms for Policy Generalization

To this end, we recall the notion of the conditional average treatment eﬀect (CATE) under diﬀerent environments e E. The CATE in environment e E for a subset of covariates S {1, . . . , d} is deﬁned for all x X S and t T as

τ S e (x, t) := Ee,πt[Y | XS = x] Ee,πt0[Y | XS = x]. (4)

When S = {1, . . . , d}, we simply denote τ S e by τe. In Setting 1, for some S {1, . . . , d}, the CATE functions, as deﬁned in (4), may diﬀer substantially from one environment to another. But there may exist a subset S {1, . . . , d} such that the CATE functions do not change across environments. In this work, we exploit the existence of such sets, which we call e-invariant (for eﬀect-invariant).4

Deﬁnition 3 (Eﬀect-invariant sets) Assume Setting 1. A subset S {1, . . . , d} is said to be eﬀect-invariant with respect to a set of environments E E ( e-invariant w.r.t. E for short) if the following holds e1, e2 E : τ S e1 τ S e2. (5)

For any E E, we denote by Se-inv E the collection of all e-invariant sets w.r.t. E .

The above deﬁnition does not depend on the choice of t0 in Setting 1: if condition (5) holds for one choice of t0, it holds for all other choices of t0 T . In this work, we focus on discrete treatments but, in principle, one could consider the continuous case by deﬁning the CATE function as (x, t) 7

t Ee,πt[Y | Xs = x] and deﬁne the eﬀect-invariance analogously to (5). We now provide a characterization for e-invariance based on the outcome mechanism.

Proposition 4 Assume Setting 1. A subset S {1, . . . , d} is e-invariant w.r.t. E if and only if there exists a pair of functions ψS : X S T R and νS : X S E R such that

e E , x X S, t T : Ee,πt[Y | XS = x] = ψS(x, t) + νS(x, e), (6)

and ψS( , t0) 0. In particular, we have for all e E that ψS τ S e .

Proof See Appendix A.1.

The two equivalent conditions (5) and (6) provide two diﬀerent viewpoints on e-invariant sets. The former shows that, when conditioning on an e-invariant set S, the CATE functions are invariant across environments, while the latter ensures that part of the conditional expected outcome Ee,πt[Y | XS] remains invariant across environments. In particular, the conditional expected outcome Ee,πt[Y | XS] can be additively decomposed into a ﬁxed eﬀect-modiﬁcation term (ψS) that depends on the treatment and an environment-varying main-eﬀect term (νS) that does not depend on the treatment. Here, the additivity stems from the deﬁnition of the CATE; diﬀerent causal contrasts correspond to other forms of decomposition.

4. As an alternative to e-invariance, one could deﬁne argmax-invariance by requiring that e1, e2 E : argmaxt T τ S e1( , t) argmaxt T τ S e2( , t). A similar notion called invariant action prediction has been introduced by Sonar et al. (2021). This condition would ensure that the optimal treatment is robust with respect to changes in the environment (even though the treatment eﬀect may not be). E-invariance implies argmax-invariance but the latter condition is not suﬃcient to show generalization properties that we develop in Section 3.

Saengkyongam, Pfister, Klasnja, Murphy, Peters

In Section 3, we propose a method that utilizes e-invariant sets for zero-shot generalization. If there are multiple e-invariant sets, we choose the one that yields the highest expected outcome. Our results rely on the existence of an e-invariant set. We therefore make this assumption explicit.

Assumption 1 In Setting 1, there exists a subset S {1, . . . , d} such that S is e-invariant w.r.t. E.

The subsequent section connects Assumption 1 to a class of structural causal models (Pearl, 2009; Bongers et al., 2021; Dawid, 2021; Saengkyongam et al., 2023). For such models, proposition 6 below shows that Assumption 1 is satisﬁed if the outcome mechanism is of a speciﬁc form and an independence assumption holds. Furthermore, using a test for einvariance, see Section 4, Assumption 1 is testable from data for the observed environments Etr.

2.3 Eﬀect-invariance in structural causal models

Assumption 1 is satisﬁed in a restricted class of structural causal models (SCMs). Formally, we consider the following class of SCMs inducing the sequential sampling steps (ii) (iv) in Setting 1.

U := se(X, U, ϵU) X := he(X, U, ϵX) T := ℓπ(X, ϵT ) Y := f(XPAf,X, UPAf,U, T) + ge(X, U, ϵY ),

where (U, X, T, Y ) X U T R, (ϵU, ϵX, ϵT , ϵY ) are jointly independent noise variables, (se, he, ge)e E, f and ℓπ are measurable functions such that, for all x X, ℓπ(x, ϵT ) is a random variable on T with distribution π(x), and PAf,X {1, . . . , d} and PAf,U {1, . . . , p}. We call PAf,X and PAf,U the observed and unobserved policy-relevant parents, respectively. To determine whether e-invariance holds, it is helpful to distinguish between the parents of Y that enter f (these are relevant to determine optimal policies) and those parents of Y that enter into ge.5 For building intuition, we therefore deﬁne a graphical representation, which splits Y into two nodes (as mentioned in Footnote 1, the graphical representation is similar to SWIGs (Richardson and Robins, 2013) but the interpretation is diﬀerent).

Deﬁnition 5 (E-invariance graph) We represent a class of SCMs of the form (7) by an e-invariance graph. This graph contains, as usually done when representing SCMs graphically, a directed edge from variables on the right-hand side of assignments to variables on the left-hand side, but with the exception that e is represented by a square node and the node Y is split into a part for Yf and a part for Yg; see Example 1 and also Figure 1.

5. The set PAf,X should be chosen as small as possible and, under this constraint, f and ge should be chosen, such that the set of parents of Y entering ge is as small as possible, too, see (see Bongers et al., 2021, Def. 2.6).

Effect-Invariant Mechanisms for Policy Generalization

Example 1 Consider the following SCMs

U1 := ϵU1, U2 := ϵU2 X3 := γ3 e U1 + ϵX3 X2 := γ2 e U2 + ϵX2 X1 := X2 + γ1 e U1 + ϵX1 T := ℓπ(X1, X2, X3, ϵT ) Y := T(1 + 0.5X2 + 0.5U1) + X2 + U1 | {z } f + µe + U2 + X3 + ϵY | {z } ge

where T = {0, 1}, ϵU1, ϵU2, ϵX1, ϵX2, ϵX3, ϵT , ϵY are jointly independent noise variables with mean zero, and γ1 e, γ2 e, γ3 e, µe are environment-speciﬁc parameters. Here, PAf,X = {2} and PAf,U = {1} are the policy-relevant parents; the e-invariance graph is shown on the right. While in this example, the environment changes the coeﬃcients γ1 e, γ2 e, γ3 e and µe, the generality of (7) allows for a change in the noise distributions, too.

Under the class of SCMs (7), the following proposition shows that an e-invariant set exists if the unobserved UPAf,U and the observed policy-relevant parents XPAf,X are independent, and the environments do not inﬂuence UPAf,U.

Proposition 6 Assume Setting 1 and that the sequential sampling steps (ii) (iv) are induced by the SCMs in (7). If (i) for all e E , UPAf,U XPAf,X in Pe X,U and (ii) Pe UPAf,U are identical across e E , we have that

PAf,X is e-invariant w.r.t. E . (8)

Proof See Appendix A.2.

Example 7 (Example 1 continued) Let e E. In this example, it holds that UPAf,U XPAf,X in Pe X,U, where UPAf,U = U1 and XPAf,X = X2. Therefore, PAf,X = {2} satisﬁes the e-invariance condition (5) by Proposition 6. To illustrate this, consider the expected outcome conditioned on X2,

Ee,πt[Y | X2] = Ee,πt[T(1 + 0.5X2 + 0.5U1) + µe + U1 + U2 + X2 + X3 | X2]

= 1(t = t0)(1 + 0.5X2 + 0.5 Ee[U1 | X2]) + µe + X2 + Ee[U1 + U2 + X3 | X2]

= 1(t = t0)(1 + 0.5X2) + X2 | {z } ψ{2}(X2,t)

+ µe + Ee[U2 + X3 | X2] | {z } ν{2}(X2,e)

since U1 X2.

Thus, by Proposition 4, {2} is e-invariant w.r.t. E.

Saengkyongam, Pfister, Klasnja, Murphy, Peters

3. Zero-shot policy generalization through e-invariance

In this section, we consider zero-shot generalization (sometimes called unsupervised domain adaptation). We aim to ﬁnd a policy that performs well (in terms of the expected outcome or reward) in a new test environment in which we have access to observations of the covariates but not the outcome. We formally lay out the setup and objective of zero-shot policy generalization and show that a policy that optimally uses information from e-invariant sets achieve desirable generalization properties.

Setting Zero-shot Assume Setting 1 and that we are given n N training observations Dtr := (Y tr i , Xtr i , T tr i , πtr i , etr i )n i=1 from the observed environments etr i Etr. During test time, we are given m N observations Dtst X := (Xtst i )m i=1 from a single test environment etst E. We denote by Qtr := Qtr 1 . . . Qtr n , where Qtr i := Pei,πi X,Y,T and Qtst X := Petst X the distributions of Dtr and Dtst X , respectively.

We seek to ﬁnd a policy that generalizes well to the test environment etst. As we only have access to the observed covariate distribution Petst X and since there may be multiple potential test environments e E with Pe X = Petst X , we propose to evaluate the performance of a policy π based on its expected outcome (relative to a ﬁxed baseline policy πt0 that always chooses t0) in the worst-case scenario across all environments with covariate distribution equal to Petst X . Formally, let [etst] := {e E | Pe X = Qtst X } be an equivalence class of environments under which the covariate distribution Pe X is the same as Qtst X . We then consider the following worst-case objective V [etst](π) := inf e [etst]

Ee,π[Y ] Ee,πt0[Y ] . (9)

The goal of (population) zero-shot generalization applied to our setting is then to ﬁnd a policy that (i) is identiﬁable from Qtr i (for an arbitrary 1 i n) and Qtst X and (ii) maximizes the worst-case performance deﬁned in (9). We now introduce a policy πe-inv that optimally uses information from e-invariant sets and show that πe-inv achieves the aforementioned goal under suitable assumptions. To this end, for all S Se-inv Etr (see Deﬁnition 3), we denote the set of all policies that depend only on XS by ΠS := {π Π | π : X S (T ) s.t. x X , π( |x) = π( |x S)} Π. Next, for all S Se-inv Etr , we deﬁne ΠS opt ΠS to be a set of policies such that each πS ΠS opt satisﬁes for all x X and t T that

πS(t|x) > 0 = t argmax t T

e Etr τ S e (x S, t ). (10)

That is, all the mass of πS( |x) is distributed on treatments that maximize the treatment eﬀect conditioned on XS. Since Se-inv Etr contains only e-invariant sets w.r.t. Etr, we also have that 1 |Etr| P e Etr τ S e τ S f for any ﬁxed f Etr (but for ﬁnite samples, we approximate the former). Finally, we denote by

Πe-inv opt := {π Π | S Se-inv Etr s.t. π ΠS opt}

the collection of all such policies.

Effect-Invariant Mechanisms for Policy Generalization

We now propose to use a policy from the collection of policies that are optimal among Πe-inv opt , i.e.,

argmax π Πe-inv opt Eetst,π[Y ]. (11)

Although the set (11) depends on the expected value of Y in the test environment, in Proposition 8 we show that we can construct a policy, denoted by πe-inv, that satisﬁes the argmax property (11) and is identiﬁable from the data available during training (i.e., i.i.d. observations from Qtr and Qtst X ). In Theorem 9, we then prove generalization properties of an optimal e-invariant policy πe-inv. This generalization result requires the following two assumptions.

Assumption 2 (Generalizing environments) It holds for all S {1, . . . , d} that

S is e-invariant w.r.t. Etr = S is e-invariant w.r.t. Etr [etst]. (12)

Assumption 2 imposes some commonalities between environments which allows a transfer of e-invariance from the observed to the test environments. Similar assumptions are used when proving guarantees of other invariance-based learning methods (e.g., Rojas-Carulla et al. (2018); Magliacane et al. (2018); Christiansen et al. (2021); Pﬁster et al. (2021); Saengkyongam et al. (2023)).

Assumption 3 (Adversarial environment) There exist e [etst] and S Se-inv E such that for all x X it holds that

max t T τe(x, t) = max t T τ S e (x S, t). (13)

Assumption 3 ensures that there exists at least one environment that does not beneﬁt from non-e-invariant covariates and facilitates the worst-case optimality result of our proposed optimal e-invariant policy πe-inv. Without Assumption 3, relying only on e-invariant covariates can become suboptimal if other (non-e-invariant) covariates are beneﬁcial across all environments. For example, consider Example 1 and assume that the coeﬃcients γ1 e and γ3 e in diﬀerent environments are relatively close, e.g., e E : γ1 e, γ3 e (0.9, 1). In this scenario, {X1, X3} is not e-invariant. Still, it is preferable to use these variables for policy learning as they provide valuable information for predicting U1, which modiﬁes the treatment eﬀect. In the above setting, Assumption 3 does not hold; it would be satisﬁed if there is at least one additional environment e [etst] where γ1 e = γ3 e = 0. The reason is that in such an environment the variables X1 and X3 do not oﬀer any relevant information for predicting U1. A similar assumption, known as confounding-removing interventions, is introduced in (Christiansen et al., 2021) in the prediction setting.

Proposition 8 (Identiﬁability) Assume Setting Zero-shot and Assumptions 1 and 2. Let e Etr be an arbitrary training environment, for all S Se-inv Etr let πS ΠS opt, that is, a policy that satisﬁes (10), and let S be a subset such that

S Sopt := argmax S Se-inv Etr Eetst P t T τ S e (XS, t)πS(t | X) . (14)

Saengkyongam, Pfister, Klasnja, Murphy, Peters

Deﬁne πe-inv := πS . Then, the following holds: (i) the set Sopt is identiﬁable from the distributions Qtr i (for an arbitrary 1 i n) and Qtst X (which makes it possible to choose S and πS during test time) and (ii) πe-inv is an element in (11).

Proof See Appendix A.3.

Theorem 9 (Generalizability) Assume Setting Zero-shot and Assumptions 1 and 2. Let πe-inv be as deﬁned in Proposition 8. Then, the two following statements hold.

(i) Let πt, as deﬁned in Section 2.1, be the policy that always chooses treatment t T . We have that max t T Eetst,πt[Y ] Eetst,πe-inv[Y ]. (15)

(ii) Given Assumption 3, we have that

π Π : V [etst](πe-inv) V [etst](π). (16)

Proof See Appendix A.4.

Theorem 9 provides two generalization properties of the policy πe-inv. First, Theorem 9(i) shows that πe-inv guarantees to outperform, in any (unseen) test environment, an optimal policy that does not use covariates X. In other words, it is always beneﬁcial to utilize the information from e-invariant sets when generalizing treatment regimes, compared to ignoring the covariates. Second, Theorem 9(ii) shows that πe-inv maximizes the worst-case performance deﬁned in (9), that is, it outperforms all other policies when evaluating each policy in the respective worst case environment if Assumption 3 holds true.

3.1 Estimation of πe-inv

As shown in Proposition 8, the policy πe-inv is identiﬁable from Qtr and Qtst X . We now turn to the problem of estimating πe-inv given data Dtr and Dtst X of Qtr and Qtst X , respectively. For now, assume we are given the collection Se-inv Etr of all e-invariant sets w.r.t. Etr. We discuss how to estimate Se-inv Etr in Section 4. Proposition 8 suggests a plug-in estimator of πe-inv based on (14). Speciﬁcally, the estimate can be obtained as follows.

(i) For all S Se-inv Etr , compute an estimate ˆτ S for τ S e , e Etr, by pooling the data from the training environments (as the τ S e s are equal across environments by eﬀectinvariance). There is a rich literature on estimating CATE from observational data (see Zhang et al. (2021b) for a survey), one can choose an estimator that is appropriate to a given dataset. In particular, the choice of the CATE estimator can diﬀer from the one employed in the testing step outlined in Section 4. Finally, once an estimate ˆτ S is obtained, we then plug ˆτ S into (10) to construct an estimate ˆπS for πS, that is, ˆπS satisﬁes for all x X and t T that

ˆπS(t|x) > 0 = t argmax t T ˆτ S(x S, t ). (17)

We distribute the probabilities equally if there are more than one t satisfying (17).

Effect-Invariant Mechanisms for Policy Generalization

(ii) Find an optimal subset S among Se-inv Etr , see (14):

ˆS argmax S Se-inv Etr

t T ˆτ S((Xtst i )S, t)ˆπS(t | Xtst i )

If there are multiple S satisfying (18), we randomly choose ˆS among one of them.

(iii) Return ˆπ ˆS which was already computed in step (i) as the estimate of πe-inv.

4. Inferring e-invariant sets

We now turn to the problem of testing the e-invariance condition (5) based on training observations Dtr := (Yi, Xi, Ti, πi, ei)n i=1 from the observed environments ei Etr. Throughout this section, we assume a ﬁxed initial (or training) policy πtr, i.e., i {1, . . . , n} : πtr = πi. The initial policy πtr can either be given or estimated from the available data (see, e.g., Algorithm 2). Our proposed testing methods remain valid even if the initial policies (πi)n i=1 are diﬀerent as long as they are both known and independent of all observed quantities6. Furthermore, we consider discrete environments, Etr = {1, . . . , ℓ}, and consider a binary treatment variable, T = {0, 1}. One can generalize to a multi-level treatment variable by repeating the proposed procedures for each level 1, . . . , k with the baseline treatment t0 = 0 and combining the test results with a multiple testing correction method. To begin with, we deﬁne for all S {1, . . . , d} the e-invariance null hypothesis

Htr 0,S : S is e-invariant w.r.t. Etr, (19)

see Deﬁnition 3. In Section 4.1, we propose a testing procedure under the assumption that, for all S Se-inv Etr , the functions (τ S e )e Etr can be modelled by linear functions and provide its statistical guarantees. In Section 4.2, we relax the linearity assumption by using a doubly robust pseudo-outcome learner (see, e.g., Kennedy, 2020).

4.1 Linear CATE functions

One way of creating e-invariance tests is to assume a parametric form of the CATEs. In this section, we rely on the following linearity assumption.

Assumption 4 (Linear CATEs) For all S Se-inv Etr , there exist coeﬃcients (γS t )t T Rk |S| and intercepts (µS t )t T Rk such that

e E, t T , x S X S : τ S e (x S, t) = µS t + γS t x S. (20)

Under Assumption 4, we now present a testing method for the e-invariance hypothesis Htr 0,S for a ﬁxed set S {1, . . . , d}. Let ue {0, 1}1 ℓand vt {0, 1} be the one-hot encodings of the environment e Etr and the treatment t T , respectively, and let α R1 (1+d),

6. Speciﬁcally, we do not allow for data collected with adaptive algorithms, which we leave for future work, see Section 7.

Saengkyongam, Pfister, Klasnja, Murphy, Peters

A Rℓ (1+d), β R1 (1+|S|) and B Rℓ (1+|S|) be model parameters. For notational convenience, we deﬁne Xi := [1 Xi] R(1+d) 1 and XS i := 1 XS i R(1+|S|) 1. We consider the following (potentially misspeciﬁed) response model under treatment t T and environment e Etr

α X + (ue A) X | {z } main eﬀect

+ (vtβ) XS | {z } treatment eﬀect

+ (vtue B) XS | {z } environment treatment eﬀect

In this model, we have that the CATE functions τ S e are identical across environments e Etr if and only if B = 0. Thus, testing (19) is equivalent to testing the null hypothesis H0 : B = 0. The model proposed in (21) is more restrictive than Assumption 4 as it additionally requires the main eﬀect to be linear. To avoid this requirement, we propose using a testing methodology that explicitly allows for the misspeciﬁcation in the main eﬀect, where we employ the centered and weighted estimation method proposed by Boruvka et al. (2018), which uses a Neyman orthogonal score (Neyman, 1959, 1979). (A standard approach of weighted least-squares using weights 1/πtr(Ti|Xi) may not yield a test with the correct asymptotic level for the null hypothesis H0.) More precisely, we consider the following steps:

(i) Treatment centering: We center the treatment indicators v Ti by an arbitrary ﬁxed policy π that depends only on XS (i.e., π ΠS). More precisely, we replace v Ti with v Ti π(1|XS i ). As an example, one could consider a ﬁxed random policy π(t|x) := qt(1 q)(1 t) for some q [0, 1].

(ii) Weighted least squares: We estimate the model parameters via a weighted least-squares approach. The weights are deﬁned by Wi := π(Ti|XS i )/πtr(Ti|Xi), where π is the policy chosen in step (i) and πtr is the initial policy.

The use of the above steps ensures that the estimator for treatment eﬀects remains consistent even if the main eﬀect is misspeciﬁed (Boruvka et al., 2018) and allows us to obtain a test with pointwise asymptotic level, see Proposition 10. Formally, we employ a generalized method of moments estimator. Deﬁne ζi(α, A, β, B) :=

α Xi +(uei A) Xi +(v Ti π(1|XS i ))(β XS i +(uei B) XS i ) and ζi := h ζi

β ζi B i . We

then estimate ˆα, ˆA, ˆβ, ˆB as the solutions to the estimating equations

i=1 Gi(α, A, β, B) = 0, (22)

where Gi(α, A, β, B) := Wi[Yi ζi(α, A, β, B)] ζi. Under additional regularity conditions (see Appendix A.5), we have, for a vectorized B,

that n( ˆB B) d N(0, V[B]). This allows us to construct a hypothesis test for H0 : B = 0. To this end, we estimate V[B] as follows. First, for all i {1, . . . , n} deﬁne

b Gi := Gi(ˆα, ˆA, ˆβ, ˆB) Rs+q and b Ji := Ji(ˆα, ˆA, ˆβ, ˆB) R(s+q) (s+q),

Effect-Invariant Mechanisms for Policy Generalization

where Ji is the Jacobian of Gi and s := (1 + d + |S|) + ℓ(1 + d) and q := ℓ(1 + |S|). Then, the covariance matrix V[B] can be consistently estimated by the lower block diagonal q q entry of the matrix, i.e,

i=1 b Gi b G i

(Boruvka et al., 2018, Proposition 3.1). We can then use the Wald test to test the null hypothesis H0 : B = 0 using the consistent estimator ˆV of V[B] (see, e.g., Boos et al. (2013)). When both π and πtr are given, the covariance estimate can be obtained using standard implementations (e.g., Huber White covariance estimator (Huber, 1967; White, 1980)). However, when either π or πtr

are estimated, one needs to adjust the covariance estimator to incorporate the additional estimation error (see Supplement C in Boruvka et al. (2018)). The full testing procedure is given in Algorithm 1.

Algorithm 1 (Wald e-invariance test) Given: a training sample Dtr of size n, a subset S {1, . . . , d} and a signiﬁcance level α (0, 1).

(i) Solve the estimating equation (22) and compute the covariance estimator (23) and the test statistic Tn := n ˆB ˆV ˆB.

(ii) Return ψWd n (Dtr, S, α) := 1(Tn > qα), where qα is the (1 α)-quantile of a chi-squared distribution with ℓ(1 + |S|)-degrees of freedom.

Proposition 10 shows that the above results carry over to our setting in that the proposed procedure achieves pointwise asymptotic level for testing the e-invariance hypothesis Htr 0,S.

Proposition 10 Assume Setting 1 and Assumption 4. Let S {1, . . . , d} be a subset of interest, α (0, 1) be a signiﬁcance level, and ψWd n (Dtr, S, α) be the Wald invariance test detailed in Algorithm 1. Under some regularity conditions (see Appendix A.5), it holds that ψWd n (Dtr, S, α) has pointwise asymptotic level for testing the e-invariance hypothesis Htr 0,S, that is, sup P Htr 0,S lim sup n P(ψWd n (Dtr, S, α) = 1) α. (24)

Proof The proof follows directly from (Boruvka et al., 2018, Proposition 3.1), see Appendix A.5.

4.2 Non-linear CATE functions

This section relaxes the assumption of linear CATEs (Assumption 4) and proposes a nonparametric approach for testing the e-invariance hypothesis Htr 0,S. The key idea is to employ a pseudo-outcome approach to estimate non-linear CATE functions (see (4)) and apply a conditional mean independence test based on the pseudo-outcome. In particular, we consider the Doubly Robust (DR) learner due to Kennedy (2020).

Saengkyongam, Pfister, Klasnja, Murphy, Peters

For all e Etr, let µe : X T R denote a model of the conditional expected outcome Ee[Y | X = , T = ] and π denote a model of the initial policy πtr. Assume t0 = 0. We consider, for all e Etr, x X, t T and y Y, the function

Oe(x, t, y) = µe(x, 1) µe(x, 0) + 1(t = 1)(y µe(x, 1))

π(1|x) 1(t = 0)(y µe(x, 0))

1 π(1|x) , (25)

and generate pseudo-outcomes by plugging in the observed data. The motivation for constructing the above pseudo-outcome is that, under Setting 1, the conditional mean of Oe(X, T, Y ) given XS is equal to the CATE function τ S e if at least one of the models µe or π is correct. Formally, we have the following result.

Proposition 11 Assume Setting 1. Let S {1, . . . , d}, e Etr, π Π and Oe( ) be the pseudo-outcome deﬁned in (25). Assume t0 = 0. If for all x X and t T µe(x, t) = Ee,πtr[Y | X = x, T = t] or for all x X and t T π(t | x) = πtr(t | x), we have for all x X S that Ee,πtr[Oe(X, T, Y ) | XS = x] = τ S e (x, 1). (26)

Proof See Appendix A.6.

Under the assumptions of Proposition 11, it holds for all S {1, . . . , p} that the null hypothesis Htr 0,S is equal to

e1, e2 Etr, x X S : Ee1,πtr[Oe1(X, T, Y ) | XS = x] = Ee2,πtr[Oe2(X, T, Y ) | XS = x]. (27) We can thus test for e-invariance by using an appropriate conditional mean independence test that has a correct level under the null hypothesis (27). For example, one can use the generalised covariance measure7 (Shah and Peters, 2020; Scheidegger et al., 2022) or the projected covariance measure (Lundborg et al., 2022). We therefore propose the following steps to construct a non-parametric test for the einvariance hypothesis Htr 0,S.

Algorithm 2 (DR-learner e-invariance test) Given a training sample Dtr of size n, subset of interest S, signiﬁcance level α and conditional mean independence test φ. Let D1 Dtr denote a random sample of Dtr, and D2 := Dtr \ D1.

(i) Fit models µe and π from the data D1.

(ii) Construct the pseudo-outcomes

Oei i (Xi, Ti, Yi) = µei 1 (Xi) µei 0 (Xi) + Ti Yi µe1 1 (Xi) π(1 | Xi) (1 Ti) Yi µei 0 (Xi) 1 π(1 | Xi).

for each observation (Xi, Ti, Yi) D2 in D2.

7. The generalised covariance measure (GCM) does not directly test for the conditional mean independence. However, it preserves the level guarantees under the conditional mean independence null hypothesis. Speciﬁcally, consider a random vector (A, B, C). It holds that E[A | B, C] = E[A | B] = Cov(A, B | C) = 0 = E[Cov(A, B | C)] = 0, where the ﬁrst equality is the conditional mean independence hypothesis and the last equality is the null hypothesis of the GCM test.

Effect-Invariant Mechanisms for Policy Generalization

(iii) Apply the test φ on Oei i (Xi, Ti, Yi) and observations in D2 with a signiﬁcance level α and return the test result.

In practice, when using Algorithm 1 or Algorithm 2 to search for all e-invariant set Se-inv Etr , one needs to iterate over all subsets S {1, . . . , d}, which can be computationally challenging when d is large. To mitigate this issue, as suggested in e.g., Peters et al. (2016); Rojas-Carulla et al. (2018); Saengkyongam et al. (2023), we can employ a variable screening method, such as Lasso regression (Tibshirani, 1994), to ﬁlter out variables that are not relevant in estimating the CATE function and employ Algorithm 1 or Algorithm 2 on the resulting set.

5. Extension: Few-shot policy generalization through e-invariance

In the Zero-shot setting, the outcome is not observed in the test environment and, as shown in Theorem 9, relying on e-invariant covariates is optimal under certain assumptions. This is no longer true if, in the test environment, we have access to observations not only of the covariates but also of the corresponding outcomes obtained after using a test policy in the test environment. We may then want to adapt to the test environment while exploiting the e-invariance information gathered in the training environments. In this section, we illustrate how our method could be extended to such a setup (called few-shot generalization), where we observe a large number of training observations from the training environments and a small number of test observations (including the outcome) from the test environment.

Setting Few-shot Assume Setting 1 and that we are given n N training observations Dtr := (Y tr i , Xtr i , T tr i , πtr i , etr i )n i=1 from the observed environments etr i Etr and m N test observations Dtst := (Y tst i , Xtst i , T tst i , πtst i )m i=1 from a test environment etst E and assume that m n.

The goal of few-shot policy generalization is to ﬁnd a policy π Π that maximizes the expected outcome in the test environment etst by exploiting the common information shared between the training and test environments. We consider using Assumption 2 as the commonalities shared between the environments. In what follows, we propose a constrained optimization approach to learn a policy that aims to maximize the expected outcome in the test environment while exploiting the e-invariance condition. An optimal policy π tst in the test environment etst distributes all its mass on treatments which maximize the CATE in the test environment conditioned on the covariates X. That is, an optimal policy π tst satisﬁes for all x X and t T that

π tst(t|x) > 0 = t argmax t T τetst(x, t ). (28)

Therefore, learning an optimal policy π tst can be reduced to learning the CATE function τetst in the test environment. As mentioned in Section 3.1, the problem of learning τetst from observational data is a well-studied problem. Here, we abstract away from a speciﬁc method and assume that we are given a function class H {τ | τ : X T R} and a loss function ℓ: Y X T Π H R such that ˆτ argminτ H Pm i=1 ℓ(Y tst i , Xtst i , T tst i , πtst i , τ) is a consistent estimator of τetst as

Saengkyongam, Pfister, Klasnja, Murphy, Peters

m . Now, we propose to leverage Assumption 2 when estimating τetst in the test environment. In particular, by Assumption 2 we have for all S Se-inv Etr and for any ﬁxed e Etr that x X S, t T : τ S e (x, t) = Eetst[τetst(X, t) | XS = x]. (29)

Let S Se-inv Etr and for all x X S, t T and τ H, we deﬁne h S(τ, x, t) := Eetst[τ(X, t) | XS = x] and τ S tr(x, t) := τ S e (x, t) (for an arbitrary e Etr). We then consider the following constrained optimization

ˆτ S argmin τ

i=1 ℓ(Y tst i , Xtst i , T tst i , πtst i , τ)

s.t. τ H and τ S tr( , ) h S(τ, , ).

If there are multiple S Se-inv Etr satisfying e-invariance, that is, |Se-inv Etr | > 1, one may choose an optimal set S as in (14). We now impose the following separability assumption on the CATE function τetst, which allows us to ﬁnd a solution to the optimization problem (30).

Assumption 5 (Separability of CATEs) Let etst E be a test environment, S Se-inv Etr and N := {1, . . . , d} \ S. There exist function classes F {X S T R} and G {X N T R} and a pair of functions f F and g G such that

x X, t T : τetst(x, t) = f(x S, t) + g(x N, t). (31)

Under Assumptions 2 and 5, there exists f F and g G such that for all x X S and t T

τ S tr(x, t) = Eetst[τetst(X, t) | XS = x]

= Eetst[f(XS, t) + g(XN, t) | XS = x]

= f(x, t) + Eetst[g(XN, t) | XS = x],

which is equivalent to

f(x, t) = τ S tr(x, t) Eetst[g(XN, t) | XS = x]. (32)

Combining (32) and (31), we then have for all x X and t T that

τetst(x, t) = τ S tr(x S, t) Eetst[g(XN, t) | XS = x S] + g(x N, t). (33)

Instead of optimizing over the function class H, we now optimize over the function class G by replacing τ in (30) with τ S g : (x, t) 7 τ S tr(x S, t) Eetst[g(XN, t) | XS = x S] + g(x N, t). More speciﬁcally, we consider the unconstrained optimization

ˆg argmin g G

i=1 ℓ(Y tst i , Xtst i , T tst i , πtst i , τ S g ). (34)

Then, τ S ˆg is a solution to the constraint optimization (30).

Effect-Invariant Mechanisms for Policy Generalization

In practice, we estimate the conditional expectation Eetst[g(XN, t) | XS = ] by an estimator ˆqg,t. Intuitively, if the function class G (see Assumption 5) has a lower complexity compared to H, and ˆqg,t has good ﬁnite-sample properties, one may expect an improvement (e.g., τ S ˆg has a lower variance) using this approach over an estimator that does not take into account the training sample. Without additional assumptions on G, the optimization problem (34) requires the computation of ˆqg,t at each iteration (since ˆqg,t depends on g). In Appendix B, we present an example to demonstrate that the optimization can simplify when imposing an additional assumption, such as linearity.

6. Experiments

This section presents the empirical experiments conducted on both simulated and real-world datasets. Firstly, we demonstrate through simulations that the testing methods introduced in Section 4 provide level guarantees that hold empirically in ﬁnite samples. Secondly, we demonstrate the eﬀectiveness of our e-invariance approach in a semi-real-world case study of mobile health interventions, where it outperforms the baselines in terms of generalization to a new environment. The code for all experiments is available at https://github.com/ sorawitj/effect-invariance.

6.1 Testing for e-invariance (simulated data)

We now conduct simulated experiments to validate the e-invariance tests proposed in Section 4. We generate datasets of size n {1000, 2000, 4000, 8000} according to the SCM in Example 1 with two training environments Etr = {0, 1}. Each of the noise variables (ϵU1, ϵU2, ϵX1, ϵX2, ϵX3, ϵT , ϵY ) is independently drawn from a standard Gaussian distribution. The environment-speciﬁc parameters (γ1 e, γ2 e, γ3 e, µe) are drawn independently from a uniform distribution on [ 3, 3]. As for the initial policy, we consider a policy that depends on the full covariate set {X1, X2, X3}. More precisely, for all x X, the initial policy πtr selects a treatment according to πtr(T = 1 | X = x) = 1/(1 + e (0.5+x1 0.5x2+0.3x3)). Moreover, we explore a scenario where the assumption of linear main eﬀects in Equation (21) is violated. Speciﬁcally, we modify the structural assignment of Y in Example 1 as Y := T(1 + 0.5X2 + 0.5U1) + U1 + X2 + µe + U2 0.5X2X3 + X3 + ϵY . Lastly, we also consider a setting where the treatment eﬀect itself is nonlinear. In this case, the structural assignment for Y is deﬁned as Y := T(1 + 0.5(X2)2 + 0.5(X2)3 + 0.5U1) + U1 + X2 + µe + U2 0.5X2X3 + X3 + ϵY . We then conduct the Wald and DR-learner e-invariance tests (Wald test and DR test for short, respectively) for all candidate subsets according to Algorithm 1 and Algorithm 2, where we assume that the initial policy πtr is given. For the DR test, we estimate the conditional mean function ( µe) with a random forest (Breiman, 2001) and use the weighted generalised covariance measure (Shah and Peters, 2020; Scheidegger et al., 2022) as the ﬁnal test φ in Algorithm 2. Figure 2 reports the rejection rates at the 5% signiﬁcance level for each candidate set under various settings. Recall that in Example 1, {X2} is the only e-invariant set. The results indicate that, for ﬁnite sample sizes, both of the proposed methods hold the correct level at 5% in all settings (the rejection rates for the e-invariant set {X2} are approximately 5% in all settings) except in the bottom left setting: here, the linear CATEs assumption

Saengkyongam, Pfister, Klasnja, Murphy, Peters

Rejection Rate

Test: Wald | Main effect: Linear | : Linear Test: DR | Main effect: Linear | : Linear

Rejection Rate

Test: Wald | Main effect: Nonlinear | : Linear Test: DR | Main effect: Nonlinear | : Linear

1000 2000 4000 8000 Sample Size

Rejection Rate

Test: Wald | Main effect: Nonlinear | : Nonlinear

1000 2000 4000 8000 Sample Size

Test: DR | Main effect: Nonlinear | : Nonlinear

set X1 X2 X3 X1,X2 X1,X3 X2,X3 X1,X2,X3

Figure 2: Rejection rates (at the 5% signiﬁcance level) of the proposed e-invariance tests from Section 4 for varying sample sizes. (Top) the main eﬀect and treatment eﬀect see (21) are linear, (Middle) the main eﬀect is nonlinear while the treatment eﬀect is linear, (Bottom) both the main eﬀect and treatment eﬀect are nonlinear. In all settings, the DRlearner test achieves the correct level, i.e., the e-invariant set {X2} has a 5% rejection rate. Similarly, the Wald test correctly rejects the set {X2} at 5% level in all scenarios except in the bottom scenario due to the violation of Assumption 4. As sample size increases, all other sets are rejected with increasing empirical probability.

(Assumption 4) is violated and the Wald test fails to maintain the correct level. When the linear main eﬀect and treatment eﬀect assumptions in (21) are speciﬁed correctly (top row), the Wald test shows superior performance compared to the DR test (that is, the Wald test rejects the non-e-invariant sets more often). When the linear main eﬀect assumption is violated (middle row), the Wald test remains valid but the power of the test drops signiﬁcantly. The Wald test, nonetheless, slightly performs better than the DR test in terms of test power in this setting.

6.2 A case study using Heart Steps V1 dataset

We apply our proposed approach to the study of a mobile health intervention for promoting physical activity called Heart Steps V1 (Klasnja et al., 2019). Heart Steps V1 was a 42-day micro-randomized trial with 37 adults that aimed to optimize the eﬀectiveness of

Effect-Invariant Mechanisms for Policy Generalization

two intervention components for promoting physical activity. One of the interventions was contextual-aware activity suggestions, delivered as push notiﬁcations, which aimed to encourage short bouts of walking throughout the day. Each participant was equipped with a wearable tracker that linked to the mobile application, which gathered sensor data and contextual information about the user. This information was used to tailor the content of activity suggestions that users received and to determine whether the user was available to receive an activity suggestion (e.g., if the sensor data indicated that the user was currently walking, they would not be sent a suggestion). The application randomized the delivery of activity suggestion up to ﬁve times a day at user-selected times spaced approximately 2.5 hours apart. If the contextual information indicated that the person was unavailable for the intervention, no suggestion was sent. In this paper, we consider users as environments. We ﬁlter out users that had zero interactions with the application, resulting in a total of 27 users. For each user u {1, . . . , 27}, we have the user s trajectory (Xu,1, Tu,1, Yu,1), . . . , (Xu,ℓu, Tu,ℓu, Yu,ℓu) of size ℓu (on average ℓu is 160), where the covariates Xu,i are the contextual information about the user at time step i, the treatment Tu,i {0, 1} is whether to deliver an activity suggestion, and the outcome Yu,i is the log transformation of the 30-minute step count after the decision time. In this analysis, we make Assumption 4 and consider the following approximation for the conditional mean of Yu,i: α u g(Xu,i) + β u f(Xu,i)Tu,i, (35)

where g(Xu,i) is a (known) baseline feature vector and f(Xu,i) is a (known) feature vector for the treatment eﬀect. We allow the main eﬀect to be misspeciﬁed. As for the feature vectors, we consider the same features (with minor modiﬁcations8) as in Liao et al. (2020); the vector f(Xu,i) contains Decision Bucket (DB) (bucketized decision time), Application Engagement (AE) (indicating how frequently users interact with the application), Location (LC) (indicating whether users are at home, at work or somewhere else) and Variation Indicator (VI) (the variation level of step count 60 minutes around the current time slot in past 7 days). The baseline vector g(Xu,i) contains f(Xu,i) along with the prior 30-minute step count, the previous day s total step count and the current temperature. Since, for a given user, the outcome model (35) does not change over time, we can combine all users trajectories and obtain the combined observations under multiple environments; that is, we have the dataset D := (Xi, Ti, Yi, ei)n i=1, with n = P u {1,...,27} ℓu, collected from multiple environments (users) E, where ei E for all i {1, . . . , n}. In particular, we do not account for potential temporal dependencies that are not captured by the bucketized decision times. In practice, one may allow for dependence across time in the observations (Xu,i, Tu,i, Yu,i)ℓu i=1 within each user, which we leave for future work, see Section 7.

6.3 Inferring e-invariant sets (Heart Steps V1)

We begin our analysis on the Heart Steps V1 data by conducting the Wald e-invariance test detailed in Algorithm 1 to ﬁnd subsets of the treatment eﬀect feature vector f(X) that satisfy the e-invariant condition (5). As a comparison, we also apply the invariance test

8. We replace the dosage variable with the bucketized decision time variable to account for potential nonlinear time dependency.

Saengkyongam, Pfister, Klasnja, Murphy, Peters

Full-invariance e-invariance

log(p-value)

Contain AE?

e-invariant set p-value {DB} 0.61 {DB, VI} 0.58 {VI} 0.53 {DB, LC} 0.09 {DB, LC, VI} 0.08

Figure 3: (Left) P-values for all subsets, considering the fulland e-invariance hypotheses. (Right) all ﬁve subsets for which we do not reject the e-invariance hypothesis.

proposed in Peters et al. (2016, Method II), which tests for a full-invariance instead of our proposed e-invariance (see Figure 1). Figure 3(Left) reports the p-values of all subsets for the full-invariance and e-invariance tests. The p-values for the full-invariance are all below the 5% level and hence there is no subset that satisﬁes the full-invariance hypothesis. However, we ﬁnd several subsets that satisfy the e-invariance condition (those with p-values of the e-invariance hypothesis greater than the 5% level). Interestingly, all subsets that contain Application Engagement (AE) have p-values close to zero, suggesting that AE is a variable that renders the conditional treatment eﬀect unstable between environments if included in the model. We report all the subsets for which we accept the e-invariance hypothesis at the 5% signiﬁcance level in Figure 3(Right). The above ﬁnding demonstrates that the relaxed notion of invariance that we propose can be beneﬁcial in practice. The full-invariance condition may be too strict in that there is no full-invariant set. But if our goal is to learn a generalizable policy, it may suﬃce to test for the weaker notion of e-invariance, which the following section investigates using semi-real data.

6.4 Zero-shot generalization (augmented Heart Steps V1)

As the Heart Steps V1 study has been completed, it is not possible to implement and test a proposed policy on a new subject. In this section, we instead conduct a simulation study using Heart Steps V1 data to illustrate the use of e-invariance for zero-shot generalization, see Section 3. To evaluate the performance of a policy, we consider leave-one-environmentout cross validation. Speciﬁcally, we ﬁrst choose e E as a test environment (user) and split the dataset D into the test set Dtst := {(Xtst i , T tst i , Y tst i , etst i )}ntst i=1 and the training set Dtr := {(Xtr i , T tr i , Y tr i , etr i )}ntr i=1, where etst i = e and etr i Etr := E \ {e} for all i. We then conduct the training and testing procedure as follows.

Training phase: Using the training data Dtr, we ﬁnd all sets that are not rejected by the Wald e-invariance test detailed in Algorithm 1. Using the inferred e-invariant sets, we then compute an estimate of πe-inv as discussed in Section 3.1, where we use the R-learner due to Nie and Wager (2021) as the CATE estimator based on the implementation of the econml Python package (Battocchi et al., 2019). As a baseline, we include an optimal policy which utilizes all variables in f(X) (denoted as full-set ). This baseline is computed by pooling all data from the training environments and ﬁtting the R-learner CATE estimator

Effect-Invariant Mechanisms for Policy Generalization

on the complete covariate set. Additionally, we include a uniformly random policy denoted as random as another baseline for comparison. To illustrate this procedure for the test user e = 1, consider the set of training users Etr = {2, . . . , 27}. Using the observations from Etr, we apply Algorithm 1 to obtain the inferred e-invariant sets Se-inv Etr = {{DB}, {DB, VI}, {VI}, {DB, LC, VI}}. For each S Se-inv Etr , we then train a policy ˆπS as in (17) using the R-learner as the CATE estimator and choose an optimal ˆS as in (18). We then use π ˆS as the ﬁnal estimate of πe-inv.

Testing phase: To perform policy evaluation, we create a semi-real test environment. To do so, we follow Liao et al. (2020). Given a test dataset Dtst, the value V (π) of a policy π Π is computed by the following procedure.

(1) Fit a regression model (35) on Dtst

Y tst i = α tstg(Xtst i ) + β tstf(Xtst i )T tst i + ϵi, (36)

and obtain pairs of covariates and residuals {(Xtst i , ˆϵi)}ntst i=1 and parameters ˆαtst and ˆβtst.

(2) Generate more pairs to obtain a total of 1000 observations {( Xtst i , ϵi)}1000 i=1 by uniformly sampling with replacement from the original pairs.

(3) For each i, the treatment T tst i is selected based on the covariates Xtst i according to π.

(4) For each i, the reward Y tst i is deﬁned by

Y tst i = ˆα tstg( Xtst i ) + ˆβ tstf( Xtst i ) Ti + ϵi, (37)

where the coeﬃcients ˆαtst and ˆβtst are obtained from the regression model ﬁtted in step (1). The value is then given as the average reward: ˆV (π) = 1 1000 P1000 i=1 Y tst i .

The performance of a policy π is then computed as ˆV (π) ˆV (π0), where π0 is the policy that always selects to not deliver a suggestion. This corresponds to an empirical version of the expected relative reward as in (9). Figure 4(Left) shows the performance of diﬀerent policies trained on the data available during training. Our proposed approach (e-inv) shows a slight improvement over the baseline approaches in terms of the mean and median performances over all users. Furthermore, as presented in Figure 4(Right), the e-invariance policy πe-inv yields higher relative reward comparing to the policy that uses all the variables in f(X) in the majority of users (17 out of 27 users). We use the Wilcoxon signed-rank test (Wilcoxon, 1945) to compare the performance of the proposed e-inv policy with that of the full-set policy. It shows a p-value of 0.008, indicating that the improvement is statistically signiﬁcant.

7. Conclusion and future work

This work addresses the challenge of adjusting for distribution shifts between environments in the context of policy learning. We propose an approach that leverages e-invariance, which is a relaxation of the full invariance assumption commonly used in causal inference

Saengkyongam, Pfister, Klasnja, Murphy, Peters

e-inv full-set random

Relative reward: V( ) V( 0)

Test user (sorted)

Improvement over Full policy

Figure 4: (Left) The out-of-environment (here: out-of-user) performance of diﬀerent policies including the proposed e-invariance policy (e-inv), an optimal policy that uses all variables in f(X) (full-set), and a uniformly random policy (random). (Right) Comparing the out-ofenvironment performance between the e-inv policy and the full-set policy for each test user. For the majority of test users, the e-inv policy outperforms the baseline (p-value 0.008).

literature. We show that despite being a weaker assumption, e-invariance is suﬃcient for building policies that generalize better to unseen environments compared to other policies. That is, under suitable assumptions, an optimal e-invariance policy is worst-case optimal. Additionally, we present a method for leveraging e-invariance information in the few-shot generalization setting, when a sample from the test environment is available. To enable the practical use of e-invariance, we propose two testing procedures; one to test for e-invariance in linear and one in nonlinear model classes. Moreover, we validate the eﬀectiveness of our policy learning methods through a semi-real-world case study in the domain of mobile health interventions. Our experiments show that an optimal policy based on an e-invariant set outperforms policies that rely on the complete context information when it comes to generalizing to new environments. There are several promising directions for future research. It might be worthwhile to develop e-invariance testing procedures that can handle more complex temporal dependencies, especially when the data is collected from adaptive algorithms such as contextual bandit algorithms. Existing works have proposed inference methods to handle such scenarios (e.g., Zhang et al., 2021a; Hadad et al., 2021), but how to incorporate these methods eﬀectively into our framework remains an open question. Another interesting area of future work is how best to use the e-invariant set S (see (14)) in order to warm-start a contextual bandit algorithm. In the digital health ﬁeld, one frequently conducts a series of optimization trials (each on a set of diﬀerent users) in the process of optimizing a full digital health intervention. The data from each trial is used to inform the design of the subsequent trial. In the case of Heart Steps, 3 trials (V1, V2 and V3) were conducted beginning with Heart Steps V1. Heart Steps V2 & V3 deployed a Bayesian Thompson-Sampling algorithm (Russo et al., 2018; Liao et al., 2020) which uses a prior distribution on the parameters to warm-start the algorithm. Clearly the knowledge of an optimal e-invariant set S should guide the formation of the prior. Determining the most eﬀective approach to achieve this is still an open question.

Effect-Invariant Mechanisms for Policy Generalization

Lastly, our work also contributes to the ﬁeld of causal inference by introducing a relaxation of the full invariance assumption. We believe that there are other scenarios where the full invariance assumption is too restrictive, and a relaxation of the assumption may be suﬃcient to address the task at hand. Further investigating the potential for relaxation in diﬀerent causal inference settings would be a promising future research direction.

Acknowledgments

We thank Eura Shin for providing the code used to preprocess the Heart Steps V1 dataset. During part of this project SS and JP were supported by a research grant (18968) from VILLUM FONDEN. NP is supported by a research grant (0069071) from Novo Nordisk Fonden. SM s research is supported by the National Institutes of Health grants P50DA054039 and P41EB028242. PK is supported by the National Institutes of Health grants R01HL125440, U01CA229445 and R01LM013107.

M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz. Invariant risk minimization. Ar Xiv e-prints (1907.02893), 2019.

Elias Bareinboim and Judea Pearl. Transportability from multiple environments with limited experiments: Completeness results. Advances in neural information processing systems, 27, 2014.

Keith Battocchi, Eleanor Dillon, Maggie Hei, Greg Lewis, Paul Oka, Miruna Oprescu, and Vasilis Syrgkanis. Econ ML: A Python Package for ML-Based Heterogeneous Treatment Eﬀects Estimation. https://github.com/microsoft/Econ ML, 2019. Version 0.x.

S. Bongers, P. Forre, J. Peters, and J. M. Mooij. Foundations of structural causal models with cycles and latent variables. Annals of Statistics, 49(5):2885 2915, 2021.

Dennis D Boos, Leonard A Stefanski, et al. Essential statistical inference. Springer, 2013.

Audrey Boruvka, Daniel Almirall, Katie Witkiewitz, and Susan A Murphy. Assessing timevarying causal eﬀect moderation in mobile health. Journal of the American Statistical Association, 113(523):1112 1121, 2018.

Leo Breiman. Random forests. Machine learning, 45(1):5 32, 2001.

Rich Caruana. Multitask learning. Machine learning, 28(1):41 75, 1997.

Rune Christiansen, Niklas Pﬁster, Martin Emil Jakobsen, Nicola Gnecco, and Jonas Peters. A causal framework for distribution generalization. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1 1, 2021.

Koby Crammer, Michael Kearns, and Jennifer Wortman. Learning from multiple sources. Journal of Machine Learning Research, 9(8), 2008.

Saengkyongam, Pfister, Klasnja, Murphy, Peters

Philip Dawid. Decision-theoretic foundations for statistical causality. Journal of Causal Inference, 9(1):39 77, 2021.

Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press, 2019.

Zijian Guo and Peter Bühlmann. Two stage curvature identiﬁcation with machine learning: Causal inference with possibly invalid instrumental variables. ar Xiv preprint ar Xiv:2203.12808, 2022.

T. Haavelmo. The probability approach in econometrics. Econometrica, 12:S1 S115 (supplement), 1944.

Vitor Hadad, David A Hirshberg, Ruohan Zhan, Stefan Wager, and Susan Athey. Conﬁdence intervals for policy evaluation in adaptive experiments. Proceedings of the national academy of sciences, 118(15):e2014602118, 2021.

C. Heinze-Deml, J. Peters, and N. Meinshausen. Invariant causal prediction for nonlinear models. Journal of Causal Inference, 6(2):1 35, 2018.

Peter J Huber. The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability: Weather Modiﬁcation; University of California Press: Berkeley, CA, USA, page 221, 1967.

M. Jakobsen and J. Peters. Distributional robustness of K-class estimators and the PULSE. The Econometrics Journal, 25(2):404 432, 2022.

Edward H Kennedy. Towards optimal doubly robust estimation of heterogeneous causal eﬀects. ar Xiv preprint ar Xiv:2004.14497, 2020.

Predrag Klasnja, Shawna Smith, Nicholas J Seewald, Andy Lee, Kelly Hall, Brook Luers, Eric B Hekler, and Susan A Murphy. Eﬃcacy of contextually tailored suggestions for physical activity: a micro-randomized optimization trial of heartsteps. Annals of Behavioral Medicine, 53(6):573 582, 2019.

Peng Liao, Kristjan Greenewald, Predrag Klasnja, and Susan Murphy. Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1 22, 2020.

Anton Rask Lundborg, Ilmun Kim, Rajen D Shah, and Richard J Samworth. The projected covariance measure for assumption-lean variable signiﬁcance testing. ar Xiv preprint ar Xiv:2211.02039, 2022.

S. Magliacane, T. van Ommen, T. Claassen, S. Bongers, P. Versteeg, and J. M. Mooij. Domain adaptation by using causal inference to predict invariant conditional distributions. In Advances in Neural Information Processing Systems 31 (Neur IPS), pages 10846 10856. Curran Associates, Inc., 2018.

Effect-Invariant Mechanisms for Policy Generalization

Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation. In Proceedings of the 30th International Conference on Machine Learning, pages 10 18. PMLR, 2013.

Jerzy Neyman. Optimal asymptotic tests of composite statistical hypotheses. Probability and statsitics, page 416 444, 1959.

Jerzy Neyman. C (α) tests and their use. Sankhy a: The Indian Journal of Statistics, Series A, pages 1 21, 1979.

Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment eﬀects. Biometrika, 108(2):299 319, 2021.

J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, USA, 2nd edition, 2009.

Judea Pearl and Elias Bareinboim. Transportability of causal and statistical relations: A formal approach. In Twenty-ﬁfth AAAI conference on artiﬁcial intelligence, 2011.

J. Peters, P. Bühlmann, and N. Meinshausen. Causal inference using invariant prediction: identiﬁcation and conﬁdence intervals. Journal of the Royal Statistical Society: Series B (with discussion), 78(5):947 1012, 2016.

N. Pﬁster, P. Bühlmann, and J. Peters. Invariant causal prediction for sequential data. Journal of the American Statistical Association, 114(527):1264 1276, 2018.

Niklas Pﬁster, Evan G Williams, Jonas Peters, Ruedi Aebersold, and Peter Bühlmann. Stabilizing variable selection and regression. The Annals of Applied Statistics, 15(3): 1220 1246, 2021.

Thomas S Richardson and James M Robins. Single world intervention graphs (swigs): A uniﬁcation of the counterfactual and graphical approaches to causality. Center for the Statistics and the Social Sciences, University of Washington Series. Working Paper, 128 (30):2013, 2013.

Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. The Journal of Machine Learning Research, 19(1): 1309 1342, 2018.

Dominik Rothenhäusler, Nicolai Meinshausen, Peter Bühlmann, and Jonas Peters. Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society: Series B, 83(2):215 246, 2021.

Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on thompson sampling. Foundations and Trends R in Machine Learning, 11(1): 1 96, 2018.

Sorawit Saengkyongam, Nikolaj Thams, Jonas Peters, and Niklas Pﬁster. Invariant policy learning: A causal perspective. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.

Saengkyongam, Pfister, Klasnja, Murphy, Peters

Cyrill Scheidegger, Julia Hörrmann, and Peter Bühlmann. The weighted generalised covariance measure. Journal of Machine Learning Research, 23(273):1 68, 2022.

B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. M. Mooij. On causal and anticausal learning. In Proceedings of the 29th International Conference on Machine Learning (ICML). Omnipress, 2012.

R. Shah and J. Peters. The hardness of conditional independence testing and the generalised covariance measure. Annals of Statistics, 48(3):1514 1538, 2020.

Anoopkumar Sonar, Vincent Pacelli, and Anirudha Majumdar. Invariant policy optimization: Towards stronger generalization in reinforcement learning. In Learning for Dynamics and Control, pages 21 33. PMLR, 2021.

Adarsh Subbaswamy, Peter Schulam, and Suchi Saria. Preventing failures due to dataset shift: Learning predictive models that transport. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 3118 3127. PMLR, 2019.

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267 288, 1994.

Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Tao Qin, Wang Lu, Yiqiang Chen, Wenjun Zeng, and Philip Yu. Generalizing to unseen domains: A survey on domain generalization. IEEE Transactions on Knowledge and Data Engineering, 2022.

Halbert White. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica: journal of the Econometric Society, pages 817 838, 1980.

Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6): 80 83, 1945.

Amy Zhang, Clare Lyle, Shagun Sodhani, Angelos Filos, Marta Kwiatkowska, Joelle Pineau, Yarin Gal, and Doina Precup. Invariant causal prediction for block MDPs. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 11214 11224. PMLR, 2020.

Kelly Zhang, Lucas Janson, and Susan Murphy. Statistical inference with m-estimators on adaptively collected data. Advances in neural information processing systems, 34:7460 7471, 2021a.

Weijia Zhang, Jiuyong Li, and Lin Liu. A uniﬁed survey of treatment eﬀect heterogeneity modelling and uplift modelling. ACM Computing Surveys (CSUR), 54(8):1 36, 2021b.

Appendix A. Proofs

A.1 Proof of Proposition 4

Proof We split the proof into three parts. First (Part 1), we show that the expected outcome function can be decomposed into an eﬀect-modiﬁcation term that depends on the

Effect-Invariant Mechanisms for Policy Generalization

treatment and a main-eﬀect term that does not depend on the treatment. We then proceed and prove the only if part of the main result in Part 2 and the if part in Part 3. Part 1: We show the following lemma.

Lemma 12 Assume Setting 1. Let S {1, . . . , d} be an arbitrary subset and t0 T be the baseline treatment. Then, there exists a pair of functions κS : X S T E R and νS : X S E R such that

e E, x X S, t T : Ee,πt[Y |XS = x] = 1(t = t0)κS(x, t, e) + νS(x, e). (38)

Proof Fix e E and t T , and deﬁne δS( , t, e) := Ee,πt[Y | XS = ] and νS( , e) := Ee,πt0[Y | XS = ]. It then holds for all x X S that

Ee,πt[Y | XS = x] = 1(t = t0)(δS(x, t, e) νS(x, e)) + νS(x, e).

We then deﬁne κS( , t, e) := δS( , t, e) νS( , e), which concludes the proof.

Part 2: Assume a subset S {1, . . . , d} is e-invariant w.r.t. E . Fix e0 E as a reference environment. By Lemma 12, there exists a pair of functions κS : X S T E R and νS : X S E R such that for all e E , x X S and t T

τ S e (x, t) = (1(t = t0)κS(x, t, e) + νS(x, e)) νS(x, e) = 1(t = t0)κS(x, t, e).

Next, we deﬁne the function ψS : X S T R for all x X S and t T by

ψS(x, t) := κS(x, t, e0).

Now, since S is e-invariant w.r.t. E it holds for all e E , x X S and t T that

1(t = t0) ψS(x, t) = 1(t = t0)κS(x, t, e). (39)

Then, combining (38) and (39) implies that (6) is true. Part 3: Assume (6) holds for a subset S {1, . . . , d}. It then holds for all e, h E , x X S and t T that

τ S e (x, t) τ S h (x, t)

= (ψS(x, t) + νS(x, e) ψS(x, t0) νS(x, e)) (ψS(x, t) + νS(x, h) ψS(x, t0) νS(x, h))

which proves that S is e-invariant w.r.t. E .

A.2 Proof of Proposition 6

From the SCM (7), we have for all x X PAf,X that

Ee,πt[Y | XPAf,X = x] = Ee[f(XPAf,X, UPAf,U , t) | XPAf,X = x] + Ee[ge(X, U, ϵY ) | XPAf,X = x]

Saengkyongam, Pfister, Klasnja, Murphy, Peters

Using the assumption (i) that UPAf,U XPAf,X in Pe X,U for all e E , we have

Ee,πt[Y | XPAf,X = x] = Ee[f(x, UPAf,U , t)] + Ee[ge(X, U, ϵY ) | XPAf,X = x], (40)

where a formal proof for the equality (40) is given, for example, in (Durrett, 2019, Example 4.1.7). Next, using the assumption (ii) that Pe UPAf,U are identical across e E , we can

drop the dependency on e from the component Ee[f(x, UPAf,U, t)] in (40) and have that

Ee,πt[Y | XPAf,X = x] = E[f(x, UPAf,U, t)] + Ee[ge(X, U, ϵY ) | XPAf,X = x]

= 1(t = t0) E[f(x, UPAf,U , t)] E[f(x, UPAf,U, t0)]

+ E[f(x, UPAf,U, t0)] + Ee[ge(X, U, ϵY ) | XPAf,X = x].

Deﬁning ψS : (x, t) 7 1(t = t0) E[f(x, UPAf,U, t)] E[f(x, UPAf,U , t0)] and νS : (x, e) 7 E[f(x, UPAf,U , t0)] + Ee[ge(X, U, ϵY ) | XPAf,X = x], we have that ψS( , t0) 0 and

Ee,πt[Y | XPAf,X = x] = ψS(x, t) + νS(x, e).

Thus, by Proposition 4, PAf,X is e-invariant w.r.t. E .

A.3 Proof of Proposition 8

Proof We begin with the proof of the ﬁrst statement (i) of Proposition 8. First, we show that the collections of policies (ΠS opt)S Se-inv Etr are identiﬁable from Qtr i (for an arbitrary

1 i n). Fix (an arbitrary) f Etr and S Se-inv Etr . Let πS ΠS opt. Then, πS satisﬁes

πS(t|x) > 0 = t argmax t T

e Etr τ S e (x S, t ) = argmax t T τ S f (x S, t ). (41)

Thus, the identiﬁability of ΠS opt depends on the identiﬁability of τ S f . Fix i {1, . . . , n}. Recall that in Setting 1 we assume x X, t T : πi(t | x) > 0. It then holds for all x X S and t T that

Eei,πt[Yi | XS i = x] = Eei[Eei,πt[Yi | Xi] | XS i = x]

( ) = Eei[Eei,πi 1(Ti=t)

πi(Ti|Xi)Yi | Xi | XS i = x]

= Eei,πi 1(Ti=t)

πi(t|Xi)Yi | XS i = x , (42)

where the equality ( ) holds by deﬁnition of πt. Since the right-hand side of (42) is the expectation w.r.t. Qtr i , the quantity Eei,πt[Yi | XS i = x] is identiﬁable from Qtr i . Next, we have

Eei,πt[Yi | XS i = x] Eei,π0[Yi | XS i = x] = Eei,πt[Y | XS = x] Eei,π0[Y | XS = x]

= τ S ei(x, t)

= τ S f (x, t) since S Se-inv Etr . (43)

Effect-Invariant Mechanisms for Policy Generalization

From (43), we then have that τ S f is identiﬁable from Qtr i and therefore ΠS opt is identiﬁable from Qtr i . Consequently, the collection

A := argmax S Se-inv Etr Eetst P t T τ S e (XS, t)πS(t | X)

is identiﬁable from Qtr i and Qtst. Next, we show the proof of the second statement (ii) of Proposition 8. Fix etst E, (an arbitrary) f Etr and let S {1, . . . , d} be a subset that satisﬁes

S argmax S Se-inv Etr Eetst[ X

t T τ S f (XS, t)πS(t | XS)].

Next, we recall the defnition Πe-inv opt = {π Π | S Se-inv Etr s.t. π ΠS opt} and let π etst argmaxπ Πe-inv opt Eetst,π[Y ]. Then, using that π etst Πe-inv opt , choose S Se-inv Etr such that

π etst ΠS opt. We have

Eetst,π etst[Y ] Eetst,πt0[Y ] = Eetst[ X

t T (Eetst,πt[Y | XS ] Eetst,πt0[Y | XS ])π etst(t | XS )]

t T τ S etst(XS , t)π etst(t | XS )]

t T τ S f (XS , t)π etst(t | XS )] by Assumption 2

t T τ S f (XS , t)πS (t | XS )] since π etst ΠS opt

max S Se-inv Etr Eetst[ X

t T τ S f (XS, t)πS(t | XS)]

t T τ S f (XS , t)πS (t | XS )]

t T τ S etst(XS , t)πS (t | XS )] by Assumption 2

t T (Eetst,πt[Y | XS ] Eetst,πt0[Y | XS ])πS (t | XS )]

= Eetst,πS [Y ] Eetst,πt0[Y ].

We therefore have that

Eetst,π etst[Y ] Eetst,πS [Y ],

which concludes the proof of the second statement (ii) of Proposition 8.

Saengkyongam, Pfister, Klasnja, Murphy, Peters

A.4 Proof of Theorem 9

Proof Let etst E be a test environment, πe-inv be a policy satisfying (11) and, for all S {1, . . . , d}, ΠS opt be the set policies satisfying (10). We now prove the ﬁrst statement, see Theorem 9(i). By deﬁnition, there exists S Se-inv Etr such that πe-inv ΠS opt. It then holds that

Eetst,πe-inv[Y ] Eetst,πt0[Y ] = Eetst[ X

t T (Eetst,πt[Y | XS ] Eetst,πt0[Y | XS ])πe-inv(t | XS )]

t T τ S etst(XS , t)πe-inv(t | XS )].

Fix etr Etr. We have

Eetst,πe-inv[Y ] Eetst,πt0[Y ] = Eetst[ X

t T τ S etst(XS , t)πe-inv(t | XS )]

t T τ S etr (XS , t)πe-inv(t | XS )] by Assumption 2

= Eetst[max t T τ S etr (XS , t)] by the deﬁnition of ΠS opt

= Eetst[max t T τ S etst(XS , t)] by Assumption 2

max t T (Eetst[τ S etst(XS , t)])

= max t T (Eetst,πt[Y ] Eetst,πt0[Y ]) by the tower property

This implies,

Eetst,πe-inv[Y ] max t T Eetst,πt[Y ], (44)

which concludes the proof of Theorem 9(i). Next, we prove the second statement, see Theorem 9(ii). Recall that [etst] := {e E | Pe X = Qtst X }. From Assumption 3, there exists an environment f [etst] and S Se-inv E such that

x X : max t T τf(x, t) = max t T τ S f (x S , t). (45)

We have for all S Se-inv Etr that

Ef[max t T τ S f (XS , t)] = Ef[max t T τf(X, t)] from (45)

= Ef[Ef[max t T τf(X, t) | XS]]

Ef[max t T Ef[τf(X, t) | XS]]

= Ef[max t T τ S f (XS, t)]. (46)

Effect-Invariant Mechanisms for Policy Generalization

Now, we have for all e [etst] and for all S Se-inv Etr

Ee[max t T τ S e (XS , t)] = Ef[max t T τ S f (XS , t)] by Assumption 2 and f [etst]

Ef[max t T τ S f (XS, t)] from (46)

= Ee[max t T τ S e (XS, t)] by Assumption 2 and f [etst]. (47)

Next, we recall the deﬁnition Πe-inv opt = {π Π | S Se-inv Etr s.t. π ΠS opt} and that πe-inv argmaxπ Πe-inv opt Eetst,π[Y ]. Then there exists S Se-inv Etr such that πe-inv ΠS opt. We

therefore have for all e [etst] that

Ee,πe-inv[Y ] Ee,πt0[Y ] = Ee[ X

t T (Ee,πt[Y | XS ] Ee,πt0[Y | XS ])πe-inv(t | XS )]

t T τ S e (XS , t)πe-inv(t | XS )]. (48)

Fix etr Etr, we have for all e [etst]

Ee,πe-inv[Y ] Ee,πt0[Y ] = Ee[ X

t T τ S etr (XS , t)πe-inv(t | XS )] by (48) and Assumption 2

= Ee[max t T τ S etr (XS , t)] by the deﬁnition of ΠS opt

= Ee[max t T τ S e (XS , t)] by Assumption 2. (49)

Let πS ΠS opt, we then have that

Eetst[max t T τ S etst(XS , t)] = Eetst,πe-inv[Y ] Eetst,πt0[Y ] from (49)

Eetst,πS [Y ] Eetst,πt0[Y ]

t T (Eetst,πt[Y | XS ] Eetst,πt0[Y | XS ])πS (t | XS )]

t T τ S etst(XS , t)πS (t | XS )]

t T τ S etr (XS , t)πS (t | XS )] by Assumption 2

= Eetst[max t T τ S etr (XS , t)] by the deﬁnition of ΠS opt

= Eetst[max t T τ S etst(XS , t)], by Assumption 2 (50)

where the above inequality holds because

πe-inv argmax π Πe-inv opt Eetst,π[Y ] = argmax π Πe-inv opt Eetst,π[Y ] Eetst,πt0[Y ].

Combining the two inequalities (50) and (47), we then have that

Eetst[max t T τ S etst(XS , t)] = Eetst[max t T τ S etst(XS , t)]. (51)

Saengkyongam, Pfister, Klasnja, Murphy, Peters

Then, from (51) and since etst [etst] and S is e-invariant w.r.t. Etr [etst] (by Assumption 2), we have for all e [etst]

Ee[max t T τ S e (XS , t)] = Ee[max t T τ S e (XS , t)]. (52)

We are now ready to prove the main statement of Theorem 9(ii).

V [etst](πe-inv) = inf e [etst](Ee,πe-inv[Y ] Ee,πt0[Y ])

= inf e [etst] Ee[max t T τ S e (XS , t)] from (49).

By the deﬁnition of [etst] and Assumption 2, we then have

V [etst](πe-inv) = Ef[max t T τ S f (XS , t)]

= Eef [max t T τ S ef (XS , t)] from (52). (53)

Finally, we show that for all π Π, V [etst](π) is bounded above by V [etst](πe-inv).

V [etst](π) = inf e [etst](Eetst,π[Y ] Eetst,πt0[Y ])

Ef,π[Y ] Ef,πt0[Y ] since f [etst]

t T (Ef,πt[Y | X] Ef,πt0[Y | X])π(t | X)]

t T τf(X, t)π(t | X)]

Ef[max t T τf(X, t)]

= Ef[max t T τ S f (XS , t)] from (45)

= V [etst](πe-inv) from (53),

which concludes the proof.

A.5 Proof of Proposition 10

Proof Let S {1, . . . , d}, α (0, 1), X := [1 X] and XS := [1 XS] and ˆB be the estimator solving the equation Pn i Gi(α, A, β, B) = 0. Assume Setting 1 and assume the following regularity conditions (these are similar to the ones required by Boruvka et al. (2018) with the diﬀerence that we require them to hold for all e Etr).

Assumption 6 (Regularity conditions) For all e Etr it holds that

(i) Ee,πtr[Y 4] < and maxj {1,...,d+1} Ee[( Xj)4] < ,

Effect-Invariant Mechanisms for Policy Generalization

(ii) the matrices Ee[ XS( XS) ] and

t T π(t|XS)

X ue X vt π(1|XS) (vt π(1|XS))ue

X ue X vt π(1|XS) (vt π(1|XS))ue

are invertible.

From Proposition 3.1 in Boruvka et al. (2018), we have that n( ˆB B) d N(0, V[B]), where ˆV deﬁned in (23) is a consistent estimator of V[B]. Next, from Theorem 8.3 in Boos

et al. (2013), we have that Tn := n ˆB ˆV ˆB d χ2 |vec(B)|. Let qα be the (1 α)-quantile of χ2 |vec(B)| and ψWd n (Dtr, S, α) := 1(Tn > qα). We can then conclude that

sup P Htr 0,S lim sup n P(ψWd n (Dtr, S, α) = 1) α.

A.6 Proof of Proposition 11

Proof For all x X, t, w T = {0, 1}, y Y and e E, let Ze w(x, t, y) := µe(x, w) + 1(t=w)(y µe(x,w))

π(w|x) . (This implies that Oe(X, T, Y ) = Ze 1(X, T, Y ) Ze 0(X, T, Y ).) We now

show that for all w T = {0, 1} and S {1, . . . , d}, Ee,πtr[Ze w(X, T, Y ) | XS] = Ee,πw[Y | XS] if one of the models is correct.

(i) Assume that µe is correct, i.e., µe(x, t) = Ee[Y | X = x, T = t] for all x X, t T and e Etr. We then have for all w T

Ee,πtr 1(T = w)(Y µe(X, w))

π(w | X) | XS = Ee Ee,πtr 1(T = w)(Y µe(X, w))

π(w|X) | X | XS

= Ee πtr(w|X)

π(w|X) Ee (Y µe(X, w)) | X, T = w | XS

Next, we have

Ee,πtr µe(X, w) | XS = Ee Ee Y | X, T = w | XS

= Ee Ee,πw Y | X | XS

= Ee,πw Y | XS . (55)

Then, from (54) and (55), we have that Ee,πtr[Ze w(X, T, Y ) | XS] = Ee,πw[Y | XS] and it thus holds for all x X S that

Ee,πtr[Oe(X, T, Y ) | XS = x] = Ee,πtr[Ze 1(X, T, Y ) Ze 0(X, T, Y ) | XS = x] = τ S e (x, 1).

Saengkyongam, Pfister, Klasnja, Murphy, Peters

(ii) Assume that π is correct, i.e., π(t | x) = πtr(t | x) for all x X, t T . We then have for all w T

Ee,πtr 1(T = w)(Y µe(X, w))

π(w | X) | XS = Ee Ee,πtr 1(T = w)(Y µe(X, w))

πtr(w|X) | X | XS

= Ee Ee (Y µe(X, w)) | X, T = w | XS

= Ee Ee,πw Y µe(X, w) | X | XS

= Ee,πw Y µe(X, w) | XS . (56)

Next, we have

Ee,πtr µe(X, w) | XS = Ee,πw µe(X, w) | XS , (57)

since the expectation is only over X and does not depend on the treatment T. Then, from (56) and (57), we have that Ee,πtr[Ze w(X, T, Y ) | XS] = Ee,πw[Y | XS] and it thus holds for all x X S that

Ee,πtr[Oe(X, T, Y ) | XS = x] = Ee,πtr[Ze 1(X, T, Y ) Ze 0(X, T, Y ) | XS = x] = τ S e (x, 1).

Appendix B. Few-shot policy generalization in linear models

Example 2 (Few-shot policy generalization for linear CATE functions) Let S Se-inv Etr and N := {1, . . . , d}\S and recall that T = {1, . . . , k}. We assume that H is a class of linear functions parameterized by Θ Rk d, i.e., H := {τ | θ Θ s.t. t T : τ( , t) θt( )}. By the linearity of H, Assumption 5 is satisﬁed, that is, there exists θS ΘS Rk |S|

and θN ΘN Rk |N| such that

x X, t T : τetst(x, t) = θS t x S + θN t x N. (58)

Under Assumption 2, we then have

x X S, t T : θS t x = τ S tr(x, t) θN t Eetst[XN | XS = x], (59)

x X, t T : τetst(x, t) = τ S tr(x S, t) + θN t (x N Eetst[XN | XS = x S]). (60)

Next, let ˆq be an estimator of Eetst[XN | XS = ] and, for all θ ΘN, deﬁne τ S θ : (x, t) 7 τ S tr(x S, t)+θt(x N ˆq(x S)). Importantly, the estimand we are estimating by ˆq does not change with θN. We then consider the unconstrained optimization

ˆθ argmin θ ΘN

i=1 ℓ(Y tst i , Xtst i , T tst i , πtst i , τ S θ ). (61)

Here, by utilizing the e-invariance information τ S tr along with Assumption 2, we now optimize over the restricted function class ΘN Θ.