# partial_transportability_for_domain_generalization__c356246a.pdf

Partial Transportability for Domain Generalization

Kasra Jalaldoust Alexis Bellot : Elias Bareinboim Causal Artificial Intelligence Lab Columbia University {kasra, eb}@cs.columbia.edu, abellot95@gmail.com

A fundamental task in AI is providing performance guarantees for predictions made in unseen domains. In practice, there can be substantial uncertainty about the distribution of new data, and corresponding variability in the performance of existing predictors. Building on the theory of partial identification and transportability, this paper introduces new results for bounding the value of a functional of the target distribution, such as the generalization error of a classifier, given data from source domains and assumptions about the data generating mechanisms, encoded in causal diagrams. Our contribution is to provide the first general estimation technique for transportability problems, adapting existing parameterization schemes such Neural Causal Models to encode the structural constraints necessary for cross-population inference. We demonstrate the expressiveness and consistency of this procedure and further propose a gradient-based optimization scheme for making scalable inferences in practice. Our results are corroborated with experiments.

1 Introduction

In the empirical sciences, the value of scientific theories arguably depends on their ability to make predictions in a domain different from where the theory was initially learned. Understanding when and how a conclusion in one domain, such as a statistical association, can be generalized to a novel, unseen domain has taken a fundamental role in the philosophy of biological and social sciences in the early 21st century. As Campbell and Stanley [8, p. 17] observed in an early discussion on the interpretation of statistical inferences, Generalization always turns out to involve generalization into a realm not represented in one s sample where, in particular, statistical associations and distributions might differ, presenting a fundamental challenge.

As society transitions to become more AI centric, many of the every-day tasks based on predictions are increasingly delegated to automated systems. Such developments make various parts of society more efficient, but also require a notion of performance guarantee that is critical for the safety of AI, in which the problem of generalization appears under different forms. For instance, one critical task in the field is domain generalization, where one tries to learn a model (e.g. classifier, regressor) on data sampled from a distribution that differs in several aspects from that expected when deploying the model in practice. In this context, generalization guarantees must build on knowledge or assumptions on the relatedness of different training and testing domains; for instance, if training and testing domains are arbitrarily different, no generalization guarantees can be expected from any predictor [12, 40]. The question becomes how to link the domains of data that are used to train a model (a.k.a., the source domains) to the domain where this model is deployed in practice (a.k.a., the target domain).

Equal Contribution. :Now at Google Deep Mind.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Figure 1: Illustration of the task of evaluating the generalization error of a model h. The mechanisms for C and W vary across domains.

To begin to answer this question, a popular type of assumption that relates source and target domains is statistical in nature: invariances in the marginal or conditional distribution of some variables across the source and target distributions. Examples include assumptions of covariate shift and label shift (among others) [35, 34]. Notably, generalization is justified by the stability and invariance of the causal mechanisms shared across the domains [14, 21], since the distributional/statistical invariances across the domains are consequences of mechanistic/structural invariances governing the underlying data generating process. Although the induced statistical invariances, once exploited correctly, can be used as bases for generalizability. Broadly, invariance-based approaches to domain generalization [27, 29, 2, 40, 24, 20, 7, 6, 13] search for predictors that not only achieves small error on the source data but also maintain certain notions of distributional invariance across the source domains. Since these statistical invariances can be viewed as proxies to structural invariances, in certain instances generalization guarantees can be provided through causal reasoning [17, 31, 39]. This idea can be illustrated in Fig. 1. The value of variables t C, Y, W, Zu are determined as a stochastic function of variables pointing to it, while these functions may differ across domains. The challenge is to evaluate the generalization risk of a model, e.g. RP phq : EP rp Y hq2s for h : hp C, W, Zq EP 1r Y | C, W, Zs, without observations from the target P . General instances of this challenge have been studied under the rubric of the theory of causal transportability, where qualitative assumptions regarding the underlying structural causal models are encoded in a graphical object, and algorithms are designed to leverage these assumptions and compute certain statistical queries in the target domain in terms of the existing source data [26, 4, 5, 19, 11, 17].

Despite these advances, in practice, the combination of source data and graphical assumptions is not always sufficient to identify (uniquely evaluate) the desired statistical query, e.g., the average loss of a given predictor in the target domain. In this case, the query is said to be non-transportable3. For example, given Fig. 1, RP phq is non-transportable for the classifier h : hp C, W, Zq. In this paper, we study the fundamental task of computing tight upper-bounds for statistical queries in a new unseen domain. This allows us to assess worst-case performance of prediction models for the domain generalization task. Our contributions are as follows:

Sections 2 & 3. We develop the first general estimation technique for bounding the value of queries across multiple domains (e.g., the generalization risk) in non-transportable settings (Def. 4). Specifically, we extend the formulation of canonical models [3, 42] to encode the constraints necessary for solving the transportability task, and demonstrate their expressiveness for generating distributions entailed by the underlying Structural Causal Models (SCMs) (Thm. 1). Section 4. We adapt Neural Causal Models (NCMs) [41] for the transportability task via a parameter sharing scheme (Thm. 2), similarly demonstrating their expressiveness and consistency for solving the partial transportability task. We then leverage the theoretical findings in sections 2 & 3 to implement a gradient-based optimization algorithm for making scalable inferences (Alg. 1), as well as a Bayesian inference procedure. Finally, we introduce Causal Robust Optimization (CRO) (Alg. 2), an iterative method to find a predictor with the best worst-case risk.

Preliminaries. We use capital letters to denote variables (X), small letters for their values (x), bold letters for sets of variables (X) and their values (x), and use supp to denote their domains of definition (x P supp X). A conditional independence statement in distribution P is written as p X

Y | Zq P . A d-separation statement in some graph G is written as p X

d Y | Zq. To denote Pp Y y | X xq, we use the shorthand Ppy | xq. The basic semantic framework of our analysis relies on Structural Causal Models (SCMs) [25, Definition 7.1.1], which are defined below.

3The notion of non-transportability formalizes a type of aleatoric uncertainty [16] arising from the inherent variability within compatible data generating systems for the target domain. In particular, it cannot be explained away with increasing sample size from the source domains.

Definition 1. An SCM M is a tuple M x V , U, F, Py where each observed variable V P V is a deterministic function of a subset of variables P a V Ă V and latent variables UV Ă U, i.e., v : f V ppa V , u V q, f V P F. Each latent variable U P U is distributed according to a probability measure Ppuq. We assume the model to be recursive, i.e. that there are no cyclic dependencies among the variables.

SCM M entails a probability distribution P Mpvq over the set of observed variables V such that

V PV P Mpv | pa V , u V q Ppuq du, (1)

where the term Ppv | pa V , u V q corresponds to the function f V P F in the underlying structural causal model M. It also induces a causal diagram GM in which each V P V is associated with a vertex, and we draw a directed edge between two variables Vi Ñ Vj if Vi appears as an argument of f Vj in the SCM, and a bi-directed edge Vi Ø Vj if UVi X UVj H, that is Vi and Vj share an unobserved confounder. Throughout this paper, we assume the observational distributions entailed by the SCMs satisfy the positivity assumption, that is, P Mpvq ą 0, for every v. We will also operate non-parametrically, i.e., making no assumption about the particular functional form or the distribution of the unobserved variables.

2 Risk Evaluation through Partial Transportability

In this section, we focus on challenges of the domain generalization problem through a causal lens, in particular regarding assessment of average loss of a given classifier in the target domain. We study a system of variables V where Y P V is a categorical outcome variable and consider a classifier h : supp X Ñ supp Y mapping a set of covariates X Ă V to the domain of the outcome. The setting of domain generalization is characterized by multiple domains, defined by SCMs M : t M1, . . . , MK, M u that entail distributions P t P M1, . . . , P MKu and P M over V . We are given a classifier h : supp X Ñ supp Y , and the objective is to evaluate its risk in the domain M which is defined as, RP phq : EP r Lp Y, hp Xqqs, (2) where L : supp Y ˆ supp Y Ñ R is a loss function. The following example illustrates these notions. Example 1 (Covariate shift). A common instance of the domain generalization problem considers source and target domains M : t M 1, M u over V t X, Y u and U t UX, UY u defined by

% F1 : "X Ð f 1 Xp UXq Y Ð f Y p X, UY q P 1p Uq P 1p UXq Pp UY q M :

% F : "X Ð f Xp UXq Y Ð f Y p X, UY q P p Uq P p UXq Pp UY q

Here, because P 1p UXq P 1p UXq, this implies via Eq. 1 that the covariate distributions are different, i.e., P 1p Xq P p Xq. Still, the label distribution conditional on covariates is invariant, i.e., P 1p Y | Xq P p Y | Xq, also known as the covariate shift setting. Accordingly, the risk of a classifier h : hpxq can be written as,

supp Y ˆsupp X

Lpy, hpxqq P py, xq dydx ż

supp Y ˆsupp X

Lpy, hpxqq P 1py | xq P pxq dydx. (3)

We will consider the problem of quantifying the variation in RP phq subject to variation in P pxq that would be consistent with partial observations from these domains, e.g. samples from P 1p X, Y q, and assumptions about the commonalities and discrepancies across the domains.

To describe more general discrepancies in the mechanisms between the SCMs, we adapt the notion of domain discrepancy and selection diagram introduced in [19]. Definition 2 (Domain discrepancy). For SCMs Mi, Mj (i, j P t , 1, 2, . . . , Ku) defined over V , the domain discrepancy set ij Ď V is defined such that for every V P ij there might exist a discrepancy f Mi V f Mj V , or P Mipu V q P Mjpu V q. For abbreviation, we denote i as i.

In words, if an endogenous variable V is not in ij, this means that the mechanism for V (i.e., the function f V and the distribution of exogenous variables Ppu V q) are structurally invariant across Mi, Mj. What follows integrates the domain discrepancy sets into a generalization of causal diagrams to express qualitative assumptions about multiple SCMs [26, 11].

Definition 3 (Selection diagram). The selection diagram G i is constructed from Gi (i P t1, 2, . . . , Tu) by adding the selection node Si to the vertex set, and adding the edge Si Ñ V for every V P i. The collection G t G u Ťt G iui Pt1,2,...,T u encodes the graphical assumptions. Whenever the causal diagram is shared across the domains, a single diagram can depict G .

Selection diagrams extend causal diagrams and provide a parsimonious graphical representation of the commonalities and disparities across a collection of SCMs. The following example illustrates these notions and highlights various subtleties in the generalization error of different predictors. Example 2 (Generalization performance of classifiers). Consider the SCMs Mi (i P t1, 2, u) over the binary variables X t C1, C2, . . . , C10u Ťt W, Zu and Y , defined as follows:

@1 ď j ď 10 : UCj Bernp0.1q if i 1 Bernp0.5q if i 2 Bernp0.7q if i UY W Bernp0.2q UW Bernp0.01q if i 1 Bernp0.02q if i 2 Bernp0.5q if i UZ Bernp0.9q

Fi : C Ð UC, Y Ð UY W à

CPC C, W Ð UY W UW , Z Ð Y UZ W p1 UZq

À denotes the xor operator, i.e., A À B evaluates to 1 if A B and evaluates to 0 if A B. Notice that the distribution of exogenous noise associated with C1:10 and t Wu differs across the domains. Consider three baseline classifiers h1pc, wq : w À

c Pc c, h2pcq : À

c Pc c, h3pzq : z evaluated on data from P 1, P 2, P with the symmetric loss function Lp Y, hp Xqq 1t Y hp Xqu. Their errors are given in Table 1. Notice that h1 has almost perfect accuracy on both source distributions, but does not generalize to M1 as it uses the unstable feature W, incurring 50% loss. This observation indicates that mere minimization of the empirical risk might yield arbitrarily large risk in the unseen target domain. h2 uses the features C that are the direct causes of Y , also known as the causal predictor [27, 2], and yields a stable loss of 20% across all domains. On the other hand, h3 uses only Z that is a descendant of Y , yet achieves a small loss across all domains as the mechanism of Z is assumed to be invariant. This observation is surprising, because h3 is neither a causal predictor nor the minimizer of the empirical risk, yet it performs nearly optimally on all domains.

Classifier RP M1 RP M2 RP M

h1pc, wq 1% 4% 49% h2pcq 20% 20% 20% h3pzq 3% 5% 4%

Table 1: Classifiers in Example 2.

Example 2 illustrates potential challenges of the domain generalization problem, particularly regarding the variation of the risk of classifiers across the source and target domains. The following definition introduces the problem of partial transportability which is the main conceptual contribution of our paper. The objective is bounding a statistic of the target distribution using the data and assumptions available about related domains. Definition 4 (Partial Transportability). Consider a system of SCMs M : t M1, M2, . . . , MK, M u that induces the selection diagram G over the variables V and entails the distributions P : t P 1pvq, P 2pvq, . . . , P Kpvqu and P pvq. A functional ψ : ΩV Ñ R is partially transportable from P given G if,

EP M 0 rψp V qs ď qmax, @ SCMs M0 that entail P and induce G , (4)

where qmax P R is a constant that can be obtained from P given G .

For instance, finding the worst-case performance of a classifier based on the source distributions given the selection diagram is a special case of partial transportability with ψpx, yq : Lpy, hpxqq. In principle, this task is challenging as the exogenous distribution P p UV q and structural assignments f V of variables V P V that do not match with any of the source domains could be arbitrary. In the following section, we will define tractable parameterization of t Pp Uq, Fu to derive a systematic approach to solving partial transportability tasks.

3 Canonical Models for Partial Transportability

We begin with an example to illustrate how one might approach parameterizing a query such as EP M rψp V qs, e.g., the generalization error, to consistently solve the partial transportability task.

Example 3 (The bow model). Let X : t Xu be a single binary variable, and Y be a binary label. Consider two source domains defined by the following SCMs:

UX Bernp0.2q UY Bernp0.05q UXY Bernp0.95q

F1 : "X Ð UX UXY Y Ð p X UXY q UY

UX Bernp0.9q UY Bernp0.05q UXY Bernp0.95q

F2 : "X Ð UX UXY Y Ð p X UXY q _ UY

The task is to evaluate the generalization error of the classifier hpxq x. h can be shown optimal in both source domains: achieving RP 1phq 0.11 and RP 2phq 0.06. However, it is unclear whether it generalizes well to a target domain M , given the domain discrepancy sets 1 t Xu, 2 t Y u.

Figure 2: Selection diagram & Canonical param.

Balke and Pearl [3] derived a canonical parameterization of SCMs such as t M1, M2, M u in Example 3. They showed that it is sufficient to parameterize Pp Uq with correlated discrete latent variables RX, RY , where RX determines the value of X, and RY determines the functional that decides Y based on X. The causal diagrams are shown in Figure 2. Canonical SCMs entails the same set of distributions as the true underlying SCMs, i.e. are equally expressive. In particular, Zhang and Bareinboim [42] showed that for every SCM M, there exists an SCM of the described form specified with only a distribution Ppr X, r Y q, where, supp RX t0, 1u, supp RY ty 0, y 1, y x, y xu. The joint distribution Ppr X, r Y q can be parameterized by a vector in 8-dimensional simplex, and entails all observational, interventional and counterfactual variables generated by the original SCM.

The following definition by Zhang et al. [42] provides a general formulation of canonical models. Definition 5 (Canonical SCM). A canonical SCM is an SCM N x U, V , F, Pp Uqy defined as follows. The set of endogenous variables V is discrete. The set of exogenous variables U t RV : V P V u, where supp RV t1, . . . , m V u and m V |th V : supppa V Ñ supp V u|. For each V P V ,

f V P F is defined as f V ppa V , r V q hpr V q V ppa V q. Example 3 (continued). Consider extending the canonical parameterization to to solve the partial transportability task by optimization. Each SCM M1, M2, M is associated with a canonical SCM N 1, N 2, N . with exogenous variables t RX, RY u as above. The domain discrepancy sets indicate that certain causal mechanisms need to match across pairs of the SCMs. For example, 1 t Xu, which does not contain Y , and this means that (1) the function f Y is the same across M1, M , and (2) the distribution of unobserved variables that are arguments of f Y , namely, UXY , UY remains the same across M1, M . Imposing these equalities on the canonical parameterization is straightforward as (1) the function f Y is the same across all canonical SCMs by construction, and (2) the only unobserved variable pointing to variable V is RV (for V P t X, Y u). Following the selection diagram shown in Fig. 2a, M1, M agree on the mechanism of Y , which translates to the constraint P N 1pr Y q P N pr Y q. Similarly, M2, M agree on the mechanism of X that translates to the constraint P N 2pr Xq P N pr Xq. Putting these together, the optimization problem below finds the upper-bound for the risk RP phq for the classifier hpxq x:

max N 1,N 2,N P N p Y Xq (5)

s.t. P N 1pr Y q P N pr Y q, P N 2pr Xq P N pr Xq (Y R 1, and X R 2)

P N 1px, yq P 1px, yq, P N 2px, yq P 2px, yq (matching source dists)

Notably, the above optimization has a linear objective with linear equality constraints.

This example illustrates a more general strategy, in which probabilities induced by an SCM over discrete endogenous variables V may be generated by a canonical model. What follows is the main result of this section, and provides a systematic approach to partial transportability using the canonical models.

Theorem 1 (Partial-TR with canonical models). Consider the tuple of SCMs M that induces the selection diagram G over the variables V , and entails the source distributions P, and the target distribution P . Let ψ : ΩV Ñ R be a functional of interest. Consider the following optimization scheme: max N 1,N 2,...,N EP N rψp V qs s.t. P N ipvq P ipvq @i P t1, 2, . . . , K, u (6)

P N ipr V q P N jpr V q, @i, j P t1, 2, . . . , K, u @V R i,j where each N i is a canonical model characterized by a joint distribution over t RV u V PV . The value of the above optimization is a tight upper-bound for the quantity EP rψp V qs among all tuples of SCMs that induce G and entail P.

In words, this Theorem states that one may tightly bound the value of a target quantity EP rψp V qs by optimizing over the space of canonical models subject to the proposed constraints, without any loss of information. An implementation of Thm. 1 approximating the worst-case error, by making inference on the posterior distribution of the target quantity, is provided in Appendix A.

4 Neural Causal Models for Partial Transportability

Figure 3: Selection diagram for Example 4.

In this section we consider inferences in more general settings by using neural networks as a generative model, acting as a proxy for the underlying SCMs M with the potential to scale to real-world, high-dimensional settings while preserving the validity and tightness of bounds. For this purpose, we consider Neural Causal Models [41] and adapt them for the partial transportability task. What follows is an instantiation of [41, Definition 7]. Definition 6 (Neural Causal Model). A Neural Causal Model (NCM) corresponding to the causal diagram G over the discrete variables V is is an SCM defined by the exogenous variables: U t UW unifp0, 1q : W Ď V s.t. A Ø B P G, @A, B P W u, (7) and the functional assignments V Ð fθV p Pa V , UV q, where UV t UW P U : V P W u. The function fθV is a feed-forward neural network parameterized with θV that outputs in supp V . The distribution entailed by an NCM is denoted by Ppv; θq, where θ tθV u V PV .

To illustrate how one might leverage this parameterization to define an instance of partial transportability task consider the following example. Example 4. Let SCMs M1, M2, M induce G shown in Fig. 3 over the binary variables X, Y , where X t C1, C2, Z1, Z2, W1, . . . , W5u. Let θ1, θ2, θ be the parameters of NCMs constructed based on the causal diagram in Fig. 3 (without the s-nodes). The objective is to constrain these parameters to simulate a compatible tuple of NCMs Mθ1, Mθ2, Mθ that equivalently entail P 1px, yq, P 2px, yq and induce G .

For instance, the fact that S2 is not pointing to Y suggests the invariance f Y f 2 Y and P pu Y q P 2pu Y q for the true underlying SCMs. That same invariance may be enforced in the corresponding NCMs by relating the parameterization of Mθ2, Mθ , i.e., imposing that θ Y θ2 Y for the NN generating Y . Similarly, the observed data D1, D2 from the source distributions P 1px, yq, P 2px, yq, respectively, impose constraints on the parameterization of NCMs as plausible models must satisfy Ppx, y; θ1q P 1px, yq and Ppx, y; θ2q P 2px, yq. This may be enforced, for instance, by maximizing the likelihood of data w.r.t. the NCM parameters: θi P arg maxθ ř x,y PDi log Ppx, y; θiq, for i P t1, 2u. By extending this intuition for all constraints imposed by the selection diagram and data, we narrow the set of NCMs Mθ1, Mθ2, Mθ to a set that is compatible with our assumptions and data. Maximizing the risk of some prediction function RP phq in this class of constrained NCMs might then achieve an informative upper-bound.

Motivated by the observation in Example 4, we now show a more formal result (analogous to Thm. 1) that guarantees that the solution to the partial transportability task in the space of constrained NCMs achieves a tight bound on a given target quantity EP rψp V qs.

Theorem 2 (Partial-TR with NCMs). Consider a tuple of SCMs M that induces G , P and P over the variables V . Let Di P ipx, yq denote the samples drawn from the i-th source domain. Let θi denote the parameters of NCM corresponding to Mi P M. Let EP rψp V qs be the target quantity. The solution to the optimization problem,

ˆΘ P arg max Θ:xθ1,θ2,...,θK,θ y

vzw Ppv; θ q (8)

s.t. θi V θj V , @i, j P t1, 2, . . . , K, u @V R i,j

θi P arg max θ

v PDi log Ppv; θq, @i P t1, 2, . . . , Ku.

is a tuple of NCMs that induce G , entails P. In the large sample limit, the solution yields a tight upper-bound for EP rψp V qs.

Theorem 2 establishes the expressive power of NCMs for solving partial transportability tasks. This formulation is powerful because it enables the use of gradient-based optimization of neural networks for learning and, in principle, might scale to large number of variables.

4.1 Neural-TR: An Efficient Implementation

We could further explore the efficient optimization of parameters by exploiting the separation between variables in the selection diagram. Rahman et al. [28], for instance, show that the NCM parameterization is modular w.r.t. the c-components of the causal diagram. We can similarly elaborate on this property, and leverage it for more efficient partial transportability.

In the following, we build towards an efficient algorithm for partial transportability using NCMs by first showing an example that describes how a given target quantity EP rψp V qs might be decomposed for learning more efficiently. Example 4 (continued). Ppx, y; θ q in the objective in Eq. (8) may be decomposed as follows:

Ppx, y; θ q P pc1, c2, w1, w2 loooooomoooooon

; θ A1q Pp y, w3 lomon

| c2, w2 lomon

; θ A2q Ppz1, z2, w4, w5 loooooomoooooon

| y, w3 lomon

where the subsets A1, A2, A3 are the c-components of G . Notice, S2 is not pointing to any of the variables A2, which means that their mechanism is shared across M2, M , and therefore,

Ppa2 | b2; θ A2q Ppa2 | b2; θ2 A2q P 2pa2 | b2q. (9)

This property is the basis of transportability algorithms [4, 10], and is known as the s-admissibility criterion [26], which allows us to deduce distributional invariances from structural invariances. By Eq. (9), we can replace the term Ppa2 | b2; θ A2q in the objective with the probabilistic model Ppa2 | b2; η2q that is trained with D2 to approximate P 2pa2 | b2q and plug it into the objective Eq. (8) as a constant.

As a consequence, we do not need to optimize over the parameters θ1 A2, θ2 A2, θ A2 from the partial transportability optimization problem. Similarly, since S1 does not point to A1, we can substitute Ppa1; θ q with Ppa1; η1q, and pre-train it with data D1. In the context of Example 4 and the evaluation of RP phq, the objective in Eq. (8) may be simplified to the substantially lighter optimization task:

max θ1 A3,θ2 A3,θ A3 EA1 P pa1;η1q EA2 P pa2|b2;η2qr ÿ

a3 Ppa3 | b3; θ A3q 1thpa1, a2, a3ztyuq yus

s.t. θi A3 P arg max θA3

a3,b3PDi log Ppa3 | b3; θA3q, for i P t1, 2u. (10)

In general, the parameter space of NCMs can be cleverly decoupled and the computational cost of the optimization problem can be significantly improved since only a subset of the conditional distributions need to be parameterized and optimized. This observation motivates Alg. 1 designed to exploit these insights. It proceeds by first, decomposing the query, second, computing the identifiable components, and third, parameterizing the components that are not point identifiable and running the NCM optimization routine. The following proposition demonstrates the correctness of this procedure.

Algorithm 1 Neural-TR

Require: Source data D1, D2, . . . , DK; selection diagram G ; functional ψ : ΩW Ñ r0, 1s. Ensure: Upper-bound for EP rψp W qs

1: t Ajum j 1 Ð c-components of A : An G p W q in causal diagram G . 2: Θ, Cexpert Ð H, Ldata Ð 0 3: P pwq : ř

azw śm j 1 P paj | doppa Ajqq 4: for j 1 to m do 5: if Di P t1, 2, . . . , Ku such that Aj X i H then 6: ηi Aj Ð arg maxηAj ř

aj,pa Aj PDi log Ppaj | doppa Ajq; ηAjq

7: In P pwq, replace P paj | doppa Ajqq with Ppaj | doppa Ajq; ηi Ajq. 8: else 9: Θ Ð Θ Ťtθi Ajui Pt1,2,...,K, u 10: In P pwq, replace P paj | doppa Ajqq with Ppajdoppa Ajq; θ Ajq.

11: Cexpert Ð Cexpert Ťttθi V θ V u V PAjz iu K i 1 12: Llikelihood Ð Llikelihood ř

aj,pa Aj PDi log Ppaj, doppa Ajq; θi Ajq.

13: end if 14: end for 15: Return ˆΘ Ð arg maxΘ ř

w P pw; Θq ψpwq Λ LlikelihoodpΘq subject to Cexpert

Proposition 1. Neural-TR (Algorithm 1) computes a tuple of NCMs compatible with the source data and graphical assumptions that yields the upper-bound for EP rψp W qs in the large sample limit.

This result may be understood as an enhancement of Thm. 2 in which the factors that are readily transportable from source data are taken care of in a pre-processing step. The hybrid approach is especially useful in case researchers have pre-trained probabilistic models with arbitrary architecture that they can use off-the-shelf and avoid unnecessary computation.

4.2 Neural-TR for the Optimization of Classifiers

The Neural-TR algorithm can be viewed as an adversarial domain generator that takes a classifier hpzq as the input, and then parameterizes a collection of SCMs to find a plausible target domain that yields the worst-case risk for the given classifier, namely, ˆθ . By flipping hpzq for some z P Ωwe can reduce the risk of h under ˆθ .

Algorithm 2 CRO (Causal Robust Optimization)

Require: D : x D1, D2, . . . , DKy; G ; δ ą 0 Ensure: hp Xq with the best worst-case risk.

1: Initialize h randomly and D Ð H 2: ˆΘ Ð Neural-TRp D, G , ψ : Lphpxq, yqq 3: while RP px,y;ˆθ qphq max DPD RDphq ą δ do

4: D Ð D Ťt D Ppx, y; ˆθ qu 5: h Ð arg minh max DPD RDphq 6: ˆΘ Ð Neural-TRp D, G , ψ : Lphpxq, yqq 7: end while 8: Return h

Interestingly, we can exploit Neural-TR to generate adversarial data for a given classifier and introduce an iterative procedure to progressively train classifiers with with minimum risk upper-bound. Algorithm 2 describes this approach. At each iteration, CRO uses Neural-TR as a subroutine to obtain an adversarially designed NCM ˆθ that yields the worst-case risk for the classifier at hand. Next, it collects data D from this NCM and adds it to a collection of datasets D . Finally, it updates the classifier to be robust to the collection D by minimizing the maximum of the empirical risk RDphq : ř

x,y PD Lpy, hpxqq across all D P D . We repeat this process until convergence of the upper-bound for risk. The following result justifies optimality of CRO for domain generalization; more discussion is provided in Appendix C.2. Theorem 3 (Domain generalization with CRO). Algorithm 2 returns a worst-case optimal solution;

CROp D, G q P arg min h:ΩXÑΩY max tuple of SCMs M0 that entails P & induces G RP M 0 phq. (11)

In words, Thm. 3 states that the classifier returned by CRO, in the large sample limit, minimizes worst-case risk in the target domain subject to the constraints entailed by the available data and induced by the structural assumptions.

(a) Example 2

(b) Example 3

(d) Example 2

(e) Example 3

Figure 4: (a-c): worst-case risk evaluation results as a function of Neural-TR (Alg. 1) training iterations. (d,e): worst-case risk evaluation of CRO.

5 Experiments

This section illustrates Algs. 1 and 2 for the evaluation and optimization of the generalization error on several tasks, ranging from simulated examples to semi-synthetic image datasets. The details of the experimental set-up and examples not fully described below, along with additional experiments, can be found in the Appendix.

5.1 Simulations

Worst-case risk evaluation Our first experiment revisits Examples 2 and 3 for the evaluation of the worst-case risk RP of various classifiers with Neural-TR (Alg. 1).

In Example 2 we had made (anecdotal) performance observations for the classifiers h1pc, wq : w À

c Pc c, h2pcq : À

c Pc c, h3pzq : z in a selected target domain M . We now consider providing a worst-case risk guarantee with Neural-TR for any (compatible) target domain. The main panel in Fig. 4a shows the convergence of the worst-case risk evaluator over successive training iterations (line 15, Alg. 1), repeated 10 times with different model seeds and solid lines denoting the mean worst-case risk. The source performances RP1, RP2 are given in the two right-most panels for reference. We observe that the good source performance of h2pcq and h3pzq generalizes to all possible target domains consistent with our assumptions, while the classifier h1pc, wq diverges, with an error of 90% in the worst target domain. In Example 3, we consider the evaluation of binary classifiers h P th1pxq : x, h2pxq : x, h3pxq : 0, h4pxq : 1u. h2pxq x. Our results are given in Fig. 4b, highlighting the extent to which source performance need not be indicative of target performance. With these results, we are now in a position to confirm the desirable performance profile of h2, even in the worst-case, as hypothesized in Example 3.

Worst-case risk optimization For each one of the examples above, we implement CRO (Alg. 2) to uncover the theoretically optimal classifier in the worst-case. The worst-case risks of the classifiers learned by CRO, denoted h CRO, are given by 0.05 for Example 2 and 0.18 for Example 3. The worst-case risk evaluation results (with Neural-TR, as above) are given in Figs. 4d and 4e. It is interesting to note that these errors coincide with the best performing classifiers considered in the previous experiment, i.e. h3pzq : z for Example 2 and h2pxq x for Example 3. In fact, by comparing the outputs of CRO h CRO with these classifiers, we can verify that the classifiers learned by CRO in these examples are precisely the mappings h CROpzq : z and h CROpxq x which is remarkable. By Thm. 3, h3pzq : z and h2pxq x are the theoretically best worst-case classifiers among all possible functions given the data and assumptions.

5.2 Colored MNIST

Our second experiment considers the colored MNIST (CMNIST) dataset that is used in the literature to highlight the robustness of classifiers to spurious correlations, e.g. see [2]. The goal of the classifier is to predict a binary label Y P t0, 1u assigned to an image Z P R28ˆ28ˆ3 based on whether the digit in the image is greater or equal to five. MNIST images W P R28ˆ28 are grayscale (and latent), and color C P tred, greenu correlates with the class label Y .

Figure 5: G CMNIST

Following standard implementations, we construct datasets from three domains with varying correlation strength between the color and image label: set to 90% agreement between the color C red and label Y 1 in source domain M1, and 80% in source domain M2. We consider performance evaluation and optimization in a target domain M with potential discrep-

(a) Iteration 1

(b) Iteration 2

(c) Iteration 3

Figure 6: Illustration of the CRO training process (Alg. 2) on the colored MNIST task.

ancies in the mechanism for C, rendering the correlation between color and label unstable. The selection diagram is given in Figure 5.

Worst-case risk evaluation Consider a setting in which we are given a classifier h : ΩZ Ñ ΩY , and the task is to assess its generalizability with a symmetric 0-1 loss function. We use data drawn from P 1,2pz, yq to train predictors using Empirical Risk Minimization (ERM) [38], Invariant Risk Minimization (IRM) [2], and group Distributionally Robust Optimization (group DRO) [32], namely h ERMpzq, h IRMpzq, and h DROpzq respectively; more detailed discussion about the role of invariance and robustness in domain generalization is available in appendix D. Using Neural-TR, we observe in Fig. 4c that the worst-case risk of h ERM in a target domain with a discrepancy in the color assignment is approximately 0.95, h DRO achieves 0.90 worst-case risk, and h IRM achieves 0.65 worst-case risk. Either method perform worse than the baseline, that is random classification with risk 0.5. On this task, a classifier trained on gray-scale images W achieves a worst-case error of 0.25.

Worst-case risk optimization We now ask whether we could learn a theoretically optimal classifier in the worst-case with CRO (Alg. 2). Fig. 6 illustrates the training process over several iterations. Specifically, given a randomly initialized h, we infer the NCM ˆθ that entails worst-case performance of h (in this case, chance performance RP phq 0.5) and generate data D from ˆθ , shown in Fig. 6a. In a second iteration, a new candidate h is trained to minimize worst-case risk on D D . Note that in D , we observe an almost perfect association between the color C green and label Y 1: h therefore is encouraged to exploit color for prediction. Its worst-case error (inferred with Neural-TR) is accordingly close to 1, and the corresponding worst-case NCM ˆθ entails a distribution of data in which the correlation between color and label is flipped: with a strong association between the color C red and label Y 1, as shown in Fig. 6b. In a third iteration, a new candidate h is trained to minimize worst-case risk on the updated D with data samples from the previous two iterations (exhibiting opposite color-label correlations). By construction, this classifier is trained to ignore the spurious association between color and label, classifying images based on the inferred digit which leads to better behavior in the worst-case: achieving a final error of approximately 0.25, as shown in Fig. 6c, which is theoretically optimal. Note, however, that the poor performance of the baseline algorithms is not directly comparable to that of CRO, since CRO has access to background information (selection diagrams) that can not be communicated with the baseline algorithms. CRO may thus be interpreted as a meta-algorithm that operates with a broader range of assumptions encoded in a certain format (i.e., the selection diagram) that enable it to find the theoretically optimal classifier for domain generalization, in contrast to the baseline algorithms.

6 Conclusion

Guaranteeing the performance of ML algorithms implemented in the wild is a critical ingredient for improving the safety of AI. In practice, evaluating the performance of a given algorithm is non-trivial. Often the performance may vary as a consequence of our uncertainty about the possible target domain, also called a non-transportable setting. In this paper, we provide the first general estimation technique for bounding an arbitrary statistic such as the classification risk across multiple domains. More specifically, we extend the formulation of canonical models and neural causal models for the transportability task, demonstrating that tight bounds may be estimated with both approaches. Building on these theoretical findings, we introduce a Bayesian inference procedure as well as a gradient-based optimization algorithm for scalable inferences in practice. Moreover, we introduce Causal Robust Optimization (CRO), an iterative learning scheme that uses partial transportability as a subroutine to find a predictor with the best worst-case risk given the data and graphical assumptions.

7 Acknowledgement

This research was supported in part by the NSF, ONR, AFOSR, DARPA, Do E, Amazon, JP Morgan, and The Alfred P. Sloan Foundation.

[1] Isabela Albuquerque, João Monteiro, Tiago H Falk, and Ioannis Mitliagkas. Adversarial targetinvariant representation learning for domain generalization. ar Xiv preprint ar Xiv:1911.00804, 2019.

[2] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019.

[3] Alexander Balke and Judea Pearl. Bounds on treatment effects from studies with imperfect compliance. Journal of the American Statistical Association, 92(439):1171 1176, 1997.

[4] Elias Bareinboim, Sanghack Lee, Vasant Honavar, and Judea Pearl. Transportability from multiple environments with limited experiments. Advances in Neural Information Processing Systems, 26, 2013.

[5] Elias Bareinboim and Judea Pearl. Transportability from multiple environments with limited experiments: Completeness results. Advances in neural information processing systems, 27, 2014.

[6] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79:151 175, 2010.

[7] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006.

[8] Donald T Campbell and Julian C Stanley. Experimental and quasi-experimental designs for research. Ravenio books, 2015.

[9] J. Correa and E. Bareinboim. A calculus for stochastic interventions: Causal effect identification and surrogate experiments. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, 2020. AAAI Press.

[10] J. Correa and E. Bareinboim. General transportability of soft interventions: Completeness results. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 10902 10912, Vancouver, Canada, Jun 2020. Curran Associates, Inc.

[11] Juan D Correa and Elias Bareinboim. From statistical transportability to estimating the effect of stochastic interventions. In IJCAI, pages 1661 1667, 2019.

[12] Shai Ben David, Tyler Lu, Teresa Luu, and Dávid Pál. Impossibility theorems for domain adaptation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 129 136. JMLR Workshop and Conference Proceedings, 2010.

[13] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Francois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096 2030, 2016.

[14] Stuart S Glennan. Mechanisms and the nature of causation. Erkenntnis, 44(1):49 71, 1996.

[15] Ishaan Gulrajani and David Lopez-Paz. In search of lost domain generalization. ar Xiv preprint ar Xiv:2007.01434, 2020.

[16] Eyke Hüllermeier and Willem Waegeman. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine learning, 110(3):457 506, 2021.

[17] Kasra Jalaldoust and Elias Bareinboim. Transportable representations for domain generalization. Proceedings of the AAAI Conference on Artificial Intelligence, 38(11):12790 12800, Mar. 2024.

[18] David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. Out-of-distribution generalization via risk extrapolation (rex). In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5815 5826. PMLR, 18 24 Jul 2021.

[19] Sanghack Lee, Juan D Correa, and Elias Bareinboim. Generalized transportability: Synthesis of experiments from heterogeneous domains. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, 2020.

[20] Ya Li, Mingming Gong, Xinmei Tian, Tongliang Liu, and Dacheng Tao. Domain generalization via conditional invariant representations. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.

[21] Peter Machamer, Lindley Darden, and Carl F Craver. Thinking about mechanisms. Philosophy of science, 67(1):1 25, 2000.

[22] Sara Magliacane, Thijs Van Ommen, Tom Claassen, Stephan Bongers, Philip Versteeg, and Joris M Mooij. Domain adaptation by using causal inference to predict invariant conditional distributions. Advances in neural information processing systems, 31, 2018.

[23] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. ar Xiv preprint ar Xiv:1411.1784, 2014.

[24] Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pages 10 18. PMLR, 2013.

[25] Judea Pearl. Causality. Cambridge university press, 2009.

[26] Judea Pearl and Elias Bareinboim. Transportability of causal and statistical relations: A formal approach. In Twenty-fifth AAAI conference on artificial intelligence, 2011.

[27] Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947 1012, 2016.

[28] Md Musfiqur Rahman and Murat Kocaoglu. Modular learning of deep causal generative models for high-dimensional causal inference. In Forty-first International Conference on Machine Learning, 2024.

[29] Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. The Journal of Machine Learning Research, 19(1):1309 1342, 2018.

[30] Elan Rosenfeld, Pradeep Kumar Ravikumar, and Andrej Risteski. The risks of invariant risk minimization. In International Conference on Learning Representations, 2021.

[31] Dominik Rothenhäusler, Nicolai Meinshausen, Peter Bühlmann, and Jonas Peters. Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 83(2):215 246, 2021.

[32] Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019.

[33] Xinwei Shen, Peter Bühlmann, and Armeen Taeb. Causality-oriented robustness: exploiting general additive interventions. ar Xiv preprint ar Xiv:2307.10299, 2023.

[34] Amos Storkey. When training and test sets are different: characterizing learning transfer. 2008.

[35] Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(5), 2007.

[36] Jin Tian and Judea Pearl. A general identification condition for causal effects. e Scholarship, University of California, 2002.

[37] US Department of Health and Human Services. The health consequences of smoking 50 years of progress: a report of the surgeon general, 2014.

[38] Vladimir Vapnik. Principles of risk minimization for learning theory. Advances in neural information processing systems, 4, 1991.

[39] Yoav Wald, Amir Feder, Daniel Greenfeld, and Uri Shalit. On calibration and out-of-domain generalization. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.

[40] Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Tao Qin, Wang Lu, Yiqiang Chen, Wenjun Zeng, and Philip Yu. Generalizing to unseen domains: A survey on domain generalization. IEEE Transactions on Knowledge and Data Engineering, 2022.

[41] Kevin Xia, Kai-Zhan Lee, Yoshua Bengio, and Elias Bareinboim. The causal-neural connection: Expressiveness, learnability, and inference. Advances in Neural Information Processing Systems, 34:10823 10836, 2021.

[42] Junzhe Zhang, Jin Tian, and Elias Bareinboim. Partial counterfactual identification from observational and experimental data. ar Xiv preprint ar Xiv:2110.05690, 2021.

Table of Contents

A Partial Transportability as a Bayesian Inference Task 15 A.1 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Implementing Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

B Additional Experiments and Details 17 B.1 Additional Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.2 More on Colored MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 B.3 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

C Extended Discussion on Algorithms 21 C.1 Examples of Neural-TR (Algorithm 1) . . . . . . . . . . . . . . . . . . . . . . 21 C.2 Illustration of CRO (Algorithm 2) . . . . . . . . . . . . . . . . . . . . . . . . . 23

D Extended Related Work 23 D.1 Invariant Learning for Domain Generalization . . . . . . . . . . . . . . . . . . 23 D.2 Group Robustness for Domain Generalization . . . . . . . . . . . . . . . . . . 25

E Proofs 25 E.1 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 E.2 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 E.3 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

F Broader Impact and Limitations 32

A Partial Transportability as a Bayesian Inference Task

Consider a system of multiple SCMs M1, M2, . . . , MK, M that induces the selection diagram G , and entails the source distributions P 1, P 2, . . . , P K, and the target distribution P over the variables V . Let ψ : ΩX Ñ R be a functional of interest. Consider the following optimization scheme:

ˆqmax max N 1,N 2,...,N EP N rψp Xqs (12)

s.t. P N ipr V q P N jpr V q, @i, j P t1, 2, . . . , K, u @V R i,j

P N ipvq P ipvq @i P t1, 2, . . . , K, u,

where each N i is a canonical model characterized by a joint distribution over t RV u V PV .

This section describes an Markov Chain Monte Carlo (MCMC) algorithm to approximate the optimal scalar ˆqmax upper bounding the query ϕN : EP N rψp Xqs above from finite samples drawn from input distributions P 1, P 2, . . . , P K. Formally, we aim to infer the value,

ˆqmax : PpϕN ă ˆqmax | vq 1 (13)

where v : p v P 1, . . . , v P kq, v P i tvpjq P i : j 1, . . . , niu denote ni independent sampled drawn from P i.

We consider a setting in which we are provided with prior distributions (possibly uninformative) over parameters of the family of compatible CMs N 1, N 2, . . . , N . In particular, we assume that for each CM, probabilities of Pp Uq, U P U are drawn from uninformative Dirichlet priors; and F are drawn uniformly from the finite class of possible structural functions. That is, for every U P U and every V P V ,

Pp Uq Dirichletpα1, . . . , αd U q, f V UniformpΩP AV ˆ ΩUV ÞÑ ΩV q (14)

where d U ś

V PP ap CUq |ΩV | and α1 . . . αd U 1.

The total collection of parameters is given by the set tpθN 1, ξN 1q, . . . , pθN , ξN qu. Among them θ tθU P r0, 1sd U : U P Uu define the parameterization of exogenous probabilities while ξ tξppa V ,u V q V P supp V : PAV Ă V , UV Ă Uu define the structural functions, one set of each CM separately.

We design a Gibbs sampler to evaluate posterior distributions over these parameters. For simplicity, we describe each step of the gibbs sampler for a single domain and input dataset, and consider the implementation of constraints below.

A.1 Gibbs Sampling

The Gibbs sampler iterates over the following steps, each parameter conditioned on the current values of the remaining terms in the parameter vector.

1. Sample u. Let u P ΩU, U P U. For each observed data example across all domains vpnq P v, n 1, . . . , ř

i ni, we sample corresponding exogeneous variables U P U from the conditional distribution,

Ppupnq | vpnq, ξ, θq 9 Ppupnq, vpnq | ξ, θq ź

V PV 1tξ ppapnq V ,upnq V q V vpnqu ź

UPU θu. (15)

2. Sample ξ. Parameters ξ define deterministic causal mechanisms. For a given parameter ξppa V ,u V q V P ξ its conditional distribution is given by Ppξppa V ,u V q V v | v, uq 1 if there exists a sample pvpnq, papnq V , upnqq for some n, where n iterates over the samples of u from step 1 and v associated with the subset of domains in which exogeneous probabilities match the target

domain, such that ξ ppapnq V ,upnq V q V vpnq. Otherwise, Ppξppa V ,u V q V v | v, uq is given by a uniform discrete distribution over its support supp V .

3. Sample θ. Let θU pθ1, . . . , θd U q P θ be the parameters that define the probability vector of possible values of variables U P UC. Its conditional distribution is given by,

θU | v, u Dirichlet pα1 β1, . . . , αd U βd U q ,

where βi : ř n 1tupnq uiu. Similarly, n iterates over the samples of u from step 1 associated with the subset of domains in which exogeneous probabilities match the target domain.

A.2 Implementing Constraints

Iterating this procedure forms a Markov chain with the invariant distribution Ppu, ξ, θ | vq. This naturally enforces the soft constraint P N ipvq P ipvq, i P t1, 2, . . . , K, u for the CMs defined by the sampled parameters. The posterior distributions of the subset of pθN , ξN q for which invariances across domains are assumed are then matched with the posterior distribution inferred from source data. The constraint P N ipr V q P N pr V q, i P t1, 2, . . . , K, u, V R i, is enforced by generating θN U from the prior such that P N pr V q : ř

u PΩθN i u : P N ipr V q, V R i, where Ωdenotes the partition of supp U that is expressed by RV .

The query is then approximated by plugging the T MCMC samples into the query ϕN to obtain ϕp1q N , . . . , ϕp T q N and

ˆqmax : suptx : ÿ

t 1tϕptq N ď xu αu. (16)

for a chosen value of confidence value α.

Example 5 (Example 3 continued). Consider again the evaluation of the risk RP phq : P N p Y hp Xqq given the classifier hpxq x. We are data sampled from P 1px, yq, P 2px, yq. For every SCM M, there exists an SCM of the described format specified with only a distribution Ppr X, r Y q, where, supp RX t0, 1u, supp RY ty 0, y 1, y x, y xu. (17)

Thus, the joint distribution Ppu XY q Ppr X, r Y q can be parameterized by a vector in 8-dimensional simplex. The canonical SCMs associated with each of the SCMs M1, M2, M , are denoted N 1, N 2, N , for which V t X, Y u, U t UXY u and supp UXY t1, . . . , 8u. The partial task can be translated into an optimization problem aiming to to find the upper-bound for the risk RP phq for the classifier hpxq x:

max N 1,N 2,N P N p Y Xq (18)

s.t. P N 1pr Y q P N pr Y q, P N 2pr Xq P N pr Xq (Y R 1, and X R 2)

P N 1px, yq P 1px, yq, P N 2px, yq P 2px, yq (matching source dists)

With the Gibbs sampler outlined above, we obtain samples from the posterior distribution PpθN 1, θN 2, ξN 1, ξN 2 | vq. θN 1, θN 2 encode the probabilities P N 1p UXY uq, P N 2p UXY uq and are instantiated as two-dimensional arrays of shape p2, 4q such that, e.g., P N 1pr Y q ř

dim. 1 θN 1, with r Y P t1, 2, 3, 4u and similarly P N 1pr Xq ř

dim. 0 θN 1, with r X P t1, 2u.

To enforce the constraints P N 1pr Y q P N pr Y q, P N 2pr Xq P N pr Xq it thus suffices to sample θN from the prior Dirichlet distribution (as it has not been updated with data) and re-scale the outcomes such that the partial row and column sums satisfy the corresponding partial row and column sums computed from the MCMC samples of PpθN 1, θN 2 | vq. The resulting MCMC parameters pθN , ξN q are then valid samples from the posterior distribution PpθN , ξN | vq subject to assumed constraints, and the risk could be computed by plugging those samples into RP phq : P N p Y hp Xqq to obtain RP phqp1q, . . . , RP phqp T q and evaluating

ˆqmax : suptx : ÿ

t 1t RP phqptq ď xu αu. (19)

for a chosen value of confidence value α.

Figure 7: Violin plots that describe MCMC samples of RP phq for Example 2. The upper end-point is an estimate of max RP phq. n stands for the number of source domain samples used as a conditioning set in the posterior evaluation.

The following Theorem shows that ˆqmax converges to the true (tight) bounds qmax for the unknown query RP phq.

Theorem 4. ˆqmax defined in Eq. (19) is a valid upper bound on qmax for any sample size, and coincides with qmax as the sample size increases to infinity.

Proof. Let Θ denote the collection of parameters ξ, θ of discrete SCMs that generate the observed data from P 1, P 2, . . . . We assume that the prior distribution on ξ, θ has positive support over the domain of Θ. That is, the probability density function ρpξq ą 0 and ρpθq ą 0 for every possible realization of ξ, θ. By the definition of Θ, for every pair of parameter pξ, θq P Θ, it must be compatible with the dataset v, i.e., Pp v | ξ, θq ą 0. Similarly, given that the prior has positive support in Θ, Ppξ, θ | vq ą 0.

Note that parameters pξ, θq P Θ fully determine the optimal upper bound qmax for RP phq. And so this implies that Pp RP phq ă qmax | vq ą 0, which by definition of a 100% credible interval means that RP phq ă ˆqmax.

Next we show convergence of the posterior by way of convergence of the likelihood of the data given one SCM M. For increasing sample size the posterior will, with increasing probability, be low for any parameter configuration, i.e. for any pξ, θq R Θ. By the definition of the optimal upper bound qmax given by the solution to the partial identification task,

Pp v | RP phq ă qmaxq Ñp 1. (20)

Therefore if the prior on parameters pξ, θq defining SCMs is non-zero for any M compatible with the data and assumptions, also the posterior converges,

Pp RP phq ă qmax | vq Ñp 1, (21)

which is the definition of the credible value ˆqmax as the 100th quantile of the posterior distribution, which coincides with qmax asymptotically.

B Additional Experiments and Details

This section includes experimental details not covered in the main body of this paper as well as additional examples to illustrate our methods, including the Bayesian inference approach.

For the approximation of credible intervals and expectations required for the Bayesian inference approach, we draw 10,000 samples from posterior distributions Pp | vq after discarding 2, 000 samples as burn-in. The results will be given a violin plots that encode the full posterior distribution of the query of interest, here the target error RP phq of a classifier h. The worst-case target error can then be read as the upper end-point of the posterior distribution.

For completeness, we provide MCMC results for Examples 2 and 3, analyzed in the main body of this paper, in Figs. 7 and 8, respectively. One could check that the upper bounds match with the analysis in the main body of this paper.

B.1 Additional Examples

This section adds additional synthetic examples to illustrate our methods.

Figure 8: Violin plots that describe MCMC samples of RP phq for Example 3. The upper end-point is an estimate of max RP phq. n stands for the number of source domain samples used as a conditioning set in the posterior evaluation. Recall that h1pxq : x, h2pxq : x, h3pxq : 0, h4pxq : 1.

(a) Example 6

(b) Example 7

Figure 9: NCM experimental results on Examples 6 and 7.

Example 6. This experiment is inspired by the debate around the relationship between smoking and lung cancer in the 1950 s [37], and the corresponding selection diagram is shown in Figure 12a. We consider M : t M1, M u that describe the effect of an individual s smoking status S on lung cancer C, including related measured variables such presence of tar in the lungs T, and demographic factors W . The data generating mechanism is given by

V t W , S, T, Cu U t UW , US, UT , USCu

f W pu W q u W , for W P W , UW P UW

f Spw, u S, u SCq

d u SC 1.5 us 1 ą 0 and i 1 1, if ř

d u SC u S 2 ą 0 and i 0, otherwise

f T ps, u T q "1, if s 0.5u T 1 ą 0 0, otherwise

f Cpw, u C, u SCq "1, if t ř

d u SC 1 ą 0 0, otherwise Pp Uq defined such that US, UT , USC Bernp0.5q, UW Np0, 1q, W P W ,

Note that t Su as the mechanism for S differs across domains while the mechanisms for all other variables are assumed invariant. The quantity to upper-bound is the target mean squared error: RP phq : EP rp C hq2s of cancer prediction algorithms h P th1pw, s, tq EP 1r C | w, s, ts, h2pw, tq EP 1r C | w, ts, h3pwq EP 1r C | wsu given data from P 1 and G .

The results for the NCM approach are given in Fig. 9a. We observe that despite the discrepancy in S, all methods maintain an error of close to 0.4.

The results for the Gibbs sampling approach are given in Fig. 10. The violin plots encode the full posterior distribution of the query of interest, here the target error RP phq of a classifier h. The worst-case target error can then be read as the upper end-point of the posterior distribution. We observe that the upper-bounds from the NCM and MCMC approach approximately match. Example 7. This experiment considers the design of prediction rules for the development of Alzheimer s disease in a target hospital M in which no data could be recorded, and the corresponding selection diagram is shown in Figure 12b. The observed variables are given by V t X1, X2, W, Y, Zu. Among those, X1 and X2 are treatments for hypertension and clinical depression, respectively, both known to influence Alzheimer s disease Y , and blood pressure W. Z is a symptom of Alzheimer s. Their biological mechanisms are somewhat understood, e.g. the

Figure 10: Violin plots that describe MCMC samples of RP phq for Example 6. The upper end-point is an estimate of max RP phq. n stands for the number of source domain samples used as a conditioning set in the posterior evaluation.

Figure 11: Violin plots with MCMC samples for Example 7. n stands for the number of source domain samples used as a conditioning set in the posterior evaluation.

effect of hypertension is mediated by blood pressure W, although several unobserved factors, such as physical activity levels and diet patterns, are expected to simultaneously affect both conditions. We assume that hypertension and clinical depression are not known to affect each other, although it s common for patients with clinical depression to simultaneously be at risk of hypertension (expressed through the presence of an unobserved common cause). More specifically, investigators have access to data from a related study conducted in domain M1. SCMs M : t M1, M u are given as follows,

V t X1, X2, W, Y, Zu U t UW Y , UX2, UW , UX1X2, UZu

f X1p UX1X2q "1, if UX1X2 ą 0 0, otherwise

f X2p UX1X2, UX2q "1, if UX1X2 UX2 ą 0 0, otherwise

f W p X1, UW Y , UW q

1, if X1 UW Y 1.5UW 1 ą 0 and i 1, if X1 UW Y UW 1 ą 0 and i 1 0, otherwise

f Y p W, X1, UW Y q "1, if W UW Y 0.1X1 1 ą 0 0, otherwise

f Zp Y, UZq "1, if Y UZ ą 0.5 0, otherwise Pp Uq defined such that UW Y , UX2, UW , UX1X2, UZ Np0, 1q,

Note that t Wu as the mechanism for W differs across domains while the mechanisms for all other variables are assumed invariant. In this example, we aim at upper-bounding the target mean squared error: RP phq : EP rp C hq2s of cancer prediction algorithms h P th1px1, x2, wq EP 1r Y | x1, x2, ws, h2pwq EP 1r Y | ws, h3pz, tq EP 1r Y | zsu given data from P 1 and G 1.

The results for the NCM approach are given in Fig. 9b. We observe that the discrepancy in W leads to poor performance for all methods (chance level) except for h3 that outperforms.

The results for the Gibbs sampling approach are given in Fig. 11. The violin plots encode the full posterior distribution of the query of interest, here the target error RP phq of a classifier h. The worst-case target error can then be read as the upper end-point of the posterior distribution. We observe that the upper-bounds from the NCM and MCMC approach approximately match.

(a) Example 6

(b) Example 7

Figure 12: Selection diagrams for additional experiments

B.2 More on Colored MNIST

Consider handwritten grayscale digits W P r0, 1s28ˆ28 that are annotated with Y P t0, 1, . . . , 9u and colored with C P tred, greenu, resulting in colored images Z P r0, 1s28ˆ28ˆ3. What follows describes the underlying SCM for domain i P t1, 2, u:

W , UY , UC, UZ Ppwq Ppu Y q Ppu Cq Ppu Zq

Y Ð f Y p W , UY q (The annotation mechanism) C Ð f i Cp Y, UCq (The choice of color based on digit) Z Ð W CJ UZ (Coloring image W with color C)

In words, the grayscale image of handwritten digits W is generated according to a distribution Ppwq shared across all domains. The label Y is the annotation of the image with the corresponding digit through mechanism f Y shared across all domains; the variable UY accounts for the possible error in annotation. Next, the color is chosen based on the digit Y following some stochastic policy f i Cp , UCq that changes across the source and target domains. Finally, the colored image Z is produced by product of the grayscale image W and the color C; exogenous variable UZ accounts for possible noise in coloring.

We have a classifier h : ΩZ Ñ ΩY at hand, and the task is to assess its generalizability. Consider the following derivation:

c P py, c, zq (22)

c P pyq P pc | yq P pz | c, yq (23)

c P pc | yq P 1,2pz | c, yq S1, S2

d Z | C, Y (24)

c P pc | yq P 1,2pz | c, yq S1, S2

Motivated by the above derivation, we use the source data drawn from P 1, P 2 and train the generative models Ppy; ηY q, Ppz | y, c; ηZq to approximate sampling from the distributions P 1,2pyq, P 1,2pz | y, cq, respectively. The former generates a random digit Y according to the distribution of label in the source domain, and the latter generates a colored picture Z by taking color C and digit Y as the input. Also, we use an NCM with parameter θ C to model the c-factor P pc | dopyqq P pc | yq. We can now rewrite the risk as follows:

z,y |y hpzq| P 1,2pyq ÿ

c P pc | yq P 1,2pz | c, yq (26)

EY P py;ηY q ÿ

c Ppc | y; θ Cq EZ P pz|c,y;ηZqr|Y hp Zq|s . (27)

By maximizing the above w.r.t. the free parameter θ C, we achieve the worst-case risk of the classifier.

B.3 Reproducibility

For the synthetic experiments, we used feed-forward neural networks with 7 layers and 128 ˆ 128 neurons in each layer. The activation for all layers is Re Lu, but for the last layer which is a sigmoid since fθV outputs the probability of V 1. For evaluation, at each epoch, we used 1000 samples from the joint distribution. The data generative process for all experiments is provided in the corresponding

Figure 13: Selection diagram of Example 8

example. We used Adam optimizer for training the Neural networks. In CMNIST example, we used a standard implementation of a conditional GAN [23] trained over 200 epochs with a batch-size of 64. The learning rate of Adam was set to 0.0002. The architecture of the generator is given by a 5 layer feed-forward neural network with Batch normalization and Leaky-Re Lu activations.

C Extended Discussion on Algorithms

In this section, we elaborate more on the algorithms presented in the paper.

C.1 Examples of Neural-TR (Algorithm 1)

In the next examples, we follow Algorithm 1 to compute the worst-case risk of a classifier. Example 8 (Simplify). Consider a system of SCMs M1, M2M over X t C, Z, Wu and Y that induces the selection diagram shown in Figure 13. Suppose we would like to assess the risk of a classifier hpzq. Following Theorem 2, the naive approach requires us to parameterize three NCMs θ1, θ2, θ over the variables X, Y , and then proceed with the maximization of the target quantity RP N phq EP r1t Y hp Zqus. Notably, the latter depends only on P py, zq. We can rewrite the risk of h as follows:

EY,Z P py,zqr1t Y hp Zqus ÿ

z,y 1ty hpzqu P py, zq (28)

z,y 1ty hpzqu P pyq P pz | yq (factorization) (29)

z,y 1ty hpzqu P pyq P 2pz | yq p Y | Z

z,y 1ty hpzqu P 2pz | yq ÿ

c P py, cq (31)

This new expression for the objective function depends only on the unknown P py, cq, a so-called ancestral c-factor, that can generally be expressed as P pa | doppa Aqq, A t C, Y u. In the following, we argue that to partially transport the risk we only need to parameterize the SCMs over ancestral c-factors that are not transportable. Specifically, the partial transportation problem can be restated as follows:

max θ1,θ2,θ EUCY r ÿ

y,c,z Ppy, c | UCY ; θ q Ppz | y; η2q 1ty hpzqus (32)

y,c PD1 EUCY rlog Ppy, c | UCY ; θ1qs ÿ

y,c PD2 EUCY rlog Ppy, c | UCY ; θ2qs

s.t. θ r Cs θ2r Cs, θ r Y s θ1r Y s.

In the above, Di P ipc, y, z, wq denotes the source data, and Ppz | y; η2q is a probabilistic model of P 2pz | yq learned using the data D2. Example 9 (Partial-TR illustrated). Consider a system of SCMs M1, M2, M over the binary variables X t X1, X2, . . . , X9u and Y that induces the selection diagram shown in Figure 14. Consider the classifier hpx1, x4q x1 _ x4. The objective is partial transportation of the risk of h, expressed as follows:

RP phq P p Y hp X1, X4qq (33) EX1,X4,Y P r1t Y X1 _ X4us. (34)

Figure 14: Selection diagram of Example 9

The latter indicates that ψp X1, X4, Y q : 1t Y X1 _ X4u must be passed to the algorithm. The objective function is then expressed as:

RP phq EP rψp X1, X4, Y qs ÿ

x1,x4,y 1t Y X1 _ X4u P px1, x4, yq. (35)

Next, we focus on transporting P px1, x4, yq. First, we compute the ancestral set using the selection diagram;

A Anp X1, X4, Y q t X1, X2, X4, Y, X5, X6, X7, X8, X9u. (36)

and we decompose this set into c-components:

A1 t X1, X2u, A2 t X4, Y, X5u, A3 t X6, X7, X8, X9u. (37)

Next, we form the expression below:

P px1, x4, yq : ÿ

x2,x5,...,x9 P px1, x2 | dopyqq P py, x4, x5 | dopx6qq P px6, . . . , x9q. (38)

P pa2 | dopx6qq rule 2 do-calc. P pa2 | x6q S1

d Y,X4,X5|X6 P 1pa2 | x6q, (39)

d A3|X6 P 2pa3q. (40)

Thus, we use the source data D1, D2 to learn the generative model Ppa2 | x6; η1 A2q, Ppa3; η2 A3q to approximate sampling from P 1pa2 | x6q, P 2pa3q respectively. We plug these models as constants into Eq. 35.

Since S 1, S 2 are pointing to the variables X2, X1, respectively, the first term P px1, x2q | dopyqq in Eq. 38 can not be directly transported from neither of the source domains. Thus, we need to parameterize this c-factor using NCMs across all domains. We require the following properties:

1. Parameter sharing: Since X4, Y, X5, X7, X8, X9 are not pointed by S1, we share their mechanisms across all domains. Also, since X2, X6 are not pointed by S2, we set θ t X2,X6u θ2 t X2,X6u. These constraints are stored in Cexpert in the Algorithm.

2. Source data: To enforce θ1, θ2 to be compatible with the source data D1, D2, we compute the likelihood of the data w.r.t. the parameters, as follows:

Llikelihood :

xx1,x2,yy PDi EUX1,X2rlog Ppx1, x2 | y, UX1,X2; θi X2qs (41)

We plug Ppx1, x2 | dopyq; θ A1q into Eq. 38. Finally, we use stochastic gradient ascent to maximize the objective function in Eq. 35 regularized by an additive term Λ Llikelihood that encourages the likelihood of the data w.r.t. the parameters of the source NCMs.

C.2 Illustration of CRO (Algorithm 2)

First, we initialize with a random classifier. One may also warm start with a reasonable guess such as empirical risk minimizer defined as,

h ERM P arg min h:ΩXÑΩY

x,y PDi Lpy, hpxqq. (42)

Throughout the runtime of the algorithm we accumulate instances of distributions that we obtain via Neural-TR (Alg. 1). At each step, these distribution are aimed to maximize the risk of the classifier at hand. In this sense, Neural-TR can be viewed as an adversary, and the CRO can be viewed as a game between two players:

1. Neural-TR. Searches over the spaces of plausible target domains that are characterized by the source data and the domain relatedness encoded in the selection diagram, to find a distribution that is hard to generalize to using the classifier at hand. 2. group DRO [32] Updates the classifier at hand by minimizing the maximum risk over the distributions produced by Neural-TR so far, that is,

min h:ΩXÑΩY max DPD 1 |D| ÿ

x,y PD Lpy, hpxqq. (43)

For more information about group DRO, see Appendix D.2.

The equilibrium of the above happens if the worst-case risk obtained by Neural-TR almost coincides with the risk obtained by group DRO, i.e.,

RP px,y;ˆθqphq max DPD 1 |D| ÿ

x,y PD Lpy, hpxqq ă δ. (44)

Once this is achieved, we stop the search and return the classifier at hand. When the game is not at equilibrium, we would have a difference larger than δ, meaning that the new target domain ˆθ has enough novelty to forces the classifier at hand to perform at least δ worse than what it achieves over the existing distributions in D . Therefore, we draw samples D Ppx, y; ˆθq and add them to our collection D . As shown in Theorem 3 this game reaches the equilibrium in finitely many steps, and the classifier that we return has the best worst-case risk w.r.t. the selection diagram G and the source distributions P. The conceptual Figure 15 shows the process of convergence of CRO.

It is important to note that although we employ group DRO as a subroutine in our CRO algorithm, we do not use the source distributions directly. Instead, we use group DRO on the distributions obtained from Neural-TR. Note that under the assumptions encoded in the selection diagram, the target distribution distribution may be geometrically unrelated to the source distributions; the reason is that mechanistic relatedness of the target domain to the source domains (as indicated by the graph) do not translate directly to closeness of the entailed distributions under known distributional distance measures.

D Extended Related Work

In this section, we discuss some learning schemes based on invariant and robust learning that are proposed for domain generalization, including IRM and group DRO that are discussed in the experiments.

D.1 Invariant Learning for Domain Generalization

Several common invariance criteria are extensively studied in the literature and proposed for the domain generalization task. A prominent idea is label conditional distribution invariance that seeks a representation ϕ such that P ip Y | ϕp Xqq is equal across the source domains [29, 1, 13, 22]. These notions do not explicitly rely on an underlying structural causal model (SCM), although invariances are often justified by an underlying causal model [27, 2, 39, 31, 33]. Jalaldoust & Bareinboim

(a) Iteration 1

(b) Iteration 2

(c) Iteration 3

(d) Equilibrium

Figure 15: Conceptual illustration of CRO. The rectangle represents the space of all distributions over X, Y , and the circle inside it represents the subset of that are plausible target distributions, as characterized by the source distributions and selection diagram. Iteration 1: At first we start with some classifier that may or may not perform well for all distributions in the plausible subset; the darker spots indicate distributions that yield higher risk for the classifier at hand. Neural-TR uses gradient ascend steps to find an SCM that entails a distribution which yields the highest risk for the classifier at hand, i.e., the darkest spot within the plausible subset (likely at the boundary of it), shown by the star blue in Fig. (a). We register this distribution by taking samples from it and adding them to the collection D . Iteration 2: We update the classifier at hand to have group robustness to the collection of distributions D ; in this case, only risk minimizer, since there is only one distribution in the collection. Now the distributions that are close to the registered distribution would entail small risk, thus, the region around the first star is now brighter. Once again, using Neural-TR we find a distribution that yields high risk for the classifier at hand. Iteration 3: We update the classifier, this time to minimize the risk on both registered distributions indicated with yellow starts using group DRO. Now the risk is smaller in most parts of the plausible set, though Neural-TR still finds another distribution at the boundary with high risk. Equilibrium: We update the classifier using group DRO over the three registered distributions. This time, the registered distributions correctly represent the plausible set, meaning that the maximum risk inside the plausible set is not significantly larger than what is achieved at the registered points through group DRO.

[17] studied the implicit assumptions that license generalizability of representations that satisfy the probabilistic relation P ip Y | ϕp Xqq. Although searching for such representation is practically challenging and in cases theoretically intractable. Thus, one may resort to achieving an approximate notion that serve as a proxy to invariance of P ip Y | ϕp Xqq; A well-known instance of such effort is invariant risk minimization [2], discussed below.

The paper [2] studies a constrained optimization problem called invariant risk minimization (IRM) in the context of domain generalization. In the notation of our paper, the IRM problem can be written as follows:

i 1 EP ir Y h ϕp Xqs

s.t. h P arg min h:ΩRÑt0,1u EP ir Y h ϕp Xqs @i, (45)

Where ϕ : ΩX Ñ ΩR is a representation, and h : ΩR Ñ t0, 1u is a classifier defined based on it. In words, a pair h, ϕ satisfies the invariant risk minimization property if h ϕ attains the minimum risk across all classifiers defined based on ϕ, across all source domains. The search procedure suggests choosing the classifier that satisfies the mentioned constraint, and achieves minimum risk on the pooled source data. The constrained optimization program above is highly non-convex and hard to solve in practice. To approximate the solution, the paper considers the Langrangian form below:

h IRM P min hθ:ΩXÑt0,1u

i 1 EP ir Y hθp Xqs λ } θEP ir Y hθp Xqs}2. (46)

In this program, θ parametrizes the classifier h, and the penalty term λ accounts for how restrictive one wants to enforce the IRM constraint. In the extreme λ 0 the objective equates to the vanilla ERM with all data pooled; on the other extreme, for λ Ñ 8 ascertains that the solution is guaranteed to satisfy the IRM constraint.

Consider a representation that satisfies the original IRM constraint in Eq. 45. The optimal classifier defined over this representation is the bayes classifier, that uses 1

2 level set of P ip Y 1 | ϕp Xqq as the decision boundary. This means that satisfying the IRM constraint implies a match between 1 2 level-sets of P ip Y 1 | ϕp Xqq across all source domains. On the other hand, invariance of

P ip Y | ϕp Xqq requires coincidence of every level-set across the source domains, and in this sense, the IRM constraint can be viewed as a proxy to the invariance property of P ip Y | ϕp Xqq. One can speculate that since IRM yields a proxy to invariance of P ip Y | ϕp Xqq, it might still exhibit generalization, though slightly weaker than what is derived from invariance of P ip Y | ϕp Xqq. However, IRM is shown to have poor domain generalizability, both theoretically (e.g., [30]) and empirically (e.g., [15]). Still, due to popularity of this method in the literature, we find it insightful to use the Neural-TR algorithm to find out what would be the worst-case risk of IRM. As shown in Fig. 4c, the worst-case performance of IRM is much worse than what is reported by [2] and [15]; the reason is that Neural-TR does not commit to one held-out domain, and instead it constructs an SCMs that is tailored to yield the poorest performance subject to the graph and source distributions.

D.2 Group Robustness for Domain Generalization

Group Distributionally robust optimization (group DRO) [32] has been employed in the broad context of learning under uncertainty. In group DRO one seeks a single classifier that minimizes the risk on multiple distributions simultaneously. More specifically, the objective is minimizing the maximum risk among the source distributions, i.e.,

h DRO P arg min max i Pt1,2,...,Ku RP iphq (47)

This approach ensures that the learned classifier is optimal w.r.t. an unknown target domain that lies in the convex hull of the source distributions. In this sense, group DRO objective interpolates the perturbations that are represented in the source data to define an uncertainty set for the target distribution. On the other hand, in invariant learning the objective is to extrapolate the perturbations that are observed among the source domain by learning a representation that shields the label from these changes. In particular, [18, 31, 33] highlight the invariant-robust spectrum, and propose methods that have a free parameter which allows interpolating the two. In our experiments, we considered group DRO as a representative of methods in this category, and evaluated its worst-case performance in the Colored MNIST task, as shown in Figure 4c. Once again, we emphasize that this worst-case risk is much larger than what is shown in the benchmarks, e.g., by [15]. The reason is that the worst-case performance is obtained by Neural-TR that operates as an adversary, seeking a plausible target domain that is hardest to generalize to, subject to the assumptions encoded in the graph and the source data.

Proof of Theorem 1

Our results rely on the expressiveness of discrete SCMs, i.e. defined over variables t V , Uu with finite cardinalities. Discrete SCMs, introduced first in [3] and then in [42] have been shown to be canonical in the sense that they could represent all counterfactual distributions entailed by any SCM with the same induced causal diagram defined over finite V . The following example illustrates this observation. Example 10 (The double bow). Let t X, Y, Zu be binary variables. Consider two source domains defined based on the following SCMs:

UX Normalp0, 1q UXY Normalp0, 1q UZY Normalp0, 1q

X Ð 1t UX UXY ą 0u Y Ð 1t X UXY ą 0u Z Ð 1t Y UZY ą 0u

UX Normalp0, 1q UXY Normalp0, 1q UZY Normalp0, 1q

X Ð 1t UX UXY ą 0u Y Ð 1t UXY 0.5 ą 0u Z Ð 1t Y UZY ą 0u

The SCM M 1 induces a counterfactual probabilities, e.g. P M 1px, yx, zyq for outcomes x, yx, zy P t0, 1u. [3] observed that such probabilities, defined over a finite set of events, may be generated with an equivalent model with a potentially large but finite set of discrete exogenous variables. [3] derived a canonical parameterization for the SCMs that induces the same graph but instead involves possibly correlated discrete latent variables RX, RY , RZ, where RX determines the functional that decides

X, RY determines the functional that decides Y based on X, and RZ determines the functional that decides Z based on Y . [3] showed that for every SCM M with the same induced graph as M1 there exists an SCM of the described format specified with only a distribution Ppr X, r Y , r Zq, where,

supp RX t0, 1u,

supp RY ty 0, y 1, y x, y xu,

supp RZ tz 0, z 1, z y, z yu.

Thus, the joint distribution Ppr X, r Y , r Zq can be parameterized by an 32-dimensional vector.

This example illustrates a more general procedure, in which probabilities induced by an SCM over discrete endogenous variables V may be generated by a canonical model. This is formalized in the following lemma. Definition 7 (Canonical SCM). A canonical SCM is an SCM N x U, V , F, Pp Uqy defined as follows. The set of endogenous variables V is discrete. The set of exogenous variables U t RV : V P V u, where supp RV t1, . . . , m V u (where m V |th V : supppa V Ñ supp V u|) for each

V P V . For each V P V , f V P F is defined as f V ppa V , r V q hpr V q V ppa V q.

The following lemma establishes the expressiveness of canonical SCMs. Lemma 1 (Thm. 2.4 [42]). For an arbitrary SCM M x U, V , F, Pp Uqy, there exists a canonical SCM N such that 1. M and N are associated with the same causal diagram, i.e., GM GN . 2. For any set of counterfactual variables Yx, . . . , Zw, P Mp Yx, . . . , Zwq P N p Yx, . . . , Zwq.

In words, finite exogenous domains in canonical SCMs are sufficient for capturing all the uncertainties and randomness introduced by the (potentially) continuous latent variables in SCMs. Our goal will be to adapt the canonical parameterization of SCMs such that they entail the equality constraints specified by G . The next example illustrates the implication of the constraints induced by G on the construction of canonical SCMs. Example 11 (Example 10 continued.). Consider M1 and M given in Example 10. The domain discrepancy set indicates that certain causal mechanisms need to match across pairs of the SCMs. For example, 1 t Y u, which does not contain t X, Zu, and this implies that the functions f X, f Z are invariant across M1, M , and that the distribution of unobserved variables that are arguments of f Y , f Z, namely, t UX, UXY , UY Zu are invariant across M1, M . The canonical parameterization of M1 is given by

V t X, Y u U t RX, RY , RZu

f 1 X : supp RX Ñ supp X f 1 Y : supp RY ˆ supp X Ñ supp Y f 1 Z : supp RZ ˆ supp Y Ñ supp Z P 1p Uq P 1p RX, RY , RZq

Analogously, the canonical parameterization of M is given by

V t X, Y u U t RX, RY , RZu

f X : supp RX Ñ supp X f Y : supp RY ˆ supp X Ñ supp Y f Z : supp RZ ˆ supp Y Ñ supp Z P p Uq P p RX, RY , RZq

With these definitions, the restrictions in 1 impose straightforward constraints on the parameterization of the canonical models given directly from the definition of discrepancy set:

f 1 Xpr Xq f Xpr Xq, P 1pr Xq P pr Xq, X R 1 f 1 Zpy, r Zq f Zpy, r Zq, P 1pr Zq P pr Zq, Z R 1

for any input x, y, r Y , r X, r Z.

The next lemma formalizes the observation made in the example above, showing that if a pair SCMs and a pair of associated canonical models induce the same distributions and causal diagram, their discrepancies must also agree.

Lemma 2. For a pair of SCMs M i, M j (i, j P t , 1, 2, . . . , Tu) defined over V with discrepancy set ij Ď V , let N i, N j be associated canonical SCMs that induce the same causal graphs and entail the same distributions over V . Then the discrepancy sets of the pairs of SCMs and canonical SCMs must agree, i.e. V P ij if and only if either f Ni V f Nj V , or P N ipu V q P Njpu V q.

Proof. Let V P ij, and fix M i, M j such that P M ipv | doppa V qq P M jpv | doppa V qq. This is possible since the interventional probabilities are parameterized by the mechanism of V which could vary across M i, M j. Assume for a contradiction that f Ni V f Nj V and P Nipu V q P Njpu V q for two canonical models N i, N j constructed to match all L3 statements induced by M i, M j. This implies in particular that P N ipv | doppa V qq P N jpv | doppa V qq and therefore N i, N j do not induce the same probabilities as M i, M j. This contradicts the assumption that the pair of canonical SCMs matches the pair of SCMs in all L3 statements.

For the converse, we proceed similarly. For fixed M i, M j, assume for a contradiction that f Ni V f Nj V , or P Nipu V q P Njpu V q such that P N ipv | doppa V qq P N jpv | doppa V qq for two canonical models N i, N j constructed to match all L3 statements induced by M i, M j, but nevertheless V R ij. The discrepancy set ensures that P M ipv | doppa V qq P M ipv | doppa V qq but the same relation is not true for N i, N j as P N ipv | doppa V qq P N jpv | doppa V qq by assumption and therefore N i, N j do not induce the same probabilities as M i, M j. This contradicts the assumption that the pair of canonical SCMs matches the pair of SCMs in all L3 statements.

Lemma 3. Consider a system of multiple SCMs M : t M1, M2, . . . , MK, M u that induces a selection diagram and entails the source distributions P : t P 1, P 2, . . . , P K, P u over the variables V . Then there exists a system of canonical SCM N : t N 1, N 2, . . . , N K, N u such that

1. M and N are associated with the same set of causal diagrams and selection diagrams.

2. For any set of counterfactual variables Yx, . . . , Zw, P M p Yx, . . . , Zwq P N p Yx, . . . , Zwq.

Proof. For (1), Thm. 2.4 [42] gives that SCMs M : t M1, M2, . . . , MK, M u and canonical SCMs N : t N 1, N 2, . . . , N K, N u induce the same causal diagrams. Lem. 2 gives that for every pair of SCMs M i, M j (i, j P t , 1, 2, . . . , Tu), their discrepancy set is the same as that of N i, N j (i, j P t , 1, 2, . . . , Tu). As selection diagrams are constructed deterministically from causal diagrams and discrepancy sets, M and N must share the same set of selection diagrams.

(2) is given by Thm. 2.4 [42].

Theorem 1 (restated). Consider a system of multiple SCMs M1, M2, . . . , MK, M that induces the selection diagram G and entails the source distributions P 1, P 2, . . . , P K and the target distribution P over the variables V . Let ψp P q P r0, 1s be the target quantity. Consider the following optimization scheme:

ˆqmax max N 1,N 2,...,N ψp P N q (48)

s.t. P N ipr V q P N jpr V q, @i, j P t1, 2, . . . , K, u @V R i,j

P N ipvq P ipvq @i P t1, 2, . . . , K, u,

where each N i is a canonical model characterized by a joint distribution over t RV u V PV . The value of the above optimization, namely ˆqmax, is a tight upper-bound for the quantity ψp P q among all tuples of SCMs that induce the selection diagram and entail the source distributions at hand.

Proof. Note that,

ˆqmax max M1,M2,...,M ψp P M q (49)

s.t. P Mipu V q P Mjpu V q, f Mi V f Mj V , @i, j P t1, 2, . . . , K, u @V R i,j

P Mipvq P ipvq @i P t1, 2, . . . , K, u,

is a tight upper bound to the target ψp P M q among all tuples of SCMs that induce the selection diagram and entail the source distributions at hand, by construction. It follows from Lem. 3 that for any tuple of SCMs t M1, M2, . . . , MK, M u, that induce the selection diagram and entail the source distributions, there exists a tuple of canonical SCMs N 1, N 2, . . . , N , that induce the selection diagram and entail the source distributions such that,

P M p Yx, . . . , Zwq P N p Yx, . . . , Zwq.

The reverse direction of the above equations also holds since a a family of canonical SCMs is an instance of a family of SCMs. This means that solutions for optimization problems in Eq. (48) and Eq. (49) must coincide.

E.1 Proof of Theorem 2

To prove this result, we need to show the following:

1. Necessity. Every tuple of NCMs Θ that are constraint by conditions in Eq. 8 represents a tuple of SCMs that entails P and induces G . 2. Sufficiency. For every tuple of SCMs M that entails P and induces G , there exists a tuple of NCMs Θ that admits the constraints in Eq. 8, and for every i P t , 1, 2, . . . , Ku, we have Ppyx, zw; θiq P Mipyx, zwq, where yx, zw.

Necessity. Consider a tuple of NCMs Θ that are constraint by the conditions in Eq. 8.

G -consistency. Since these NCMs are constructed based on the common causal diagram G, they all induce G (Theorem 2 by Xia et al. [41]). Moreover, the parameter sharing constraint states that V R ij if and only if θi V θj V . This implies that the NCMs parameterized by Θ induce the same domain discrepancy sets as G . Thus, the selection diagram induced by the NCMs parameterized by Θ is exactly G . P-expressivity. The data likelihood condition for source distribution P ipvq states the following: θi P arg max G constrained θ

v PDi log Ppv; θq. (50)

For large enough samples size |Di| P ipvq, and enough model complexity in θ, Theorem 1 by Xia et al. [41] shows that there exists a G-constrained NCM θ that induces the distribution entailed by the true SCM Mi. Thus, by imposing Eq. (50) we assure that Ppv; θiq P ipvq. By imposing all data likelihood conditions, in the limit of sample size and model complexity, we ensure that the source NCMs induce the source distributions.

In conclusion, the tuple of NCMs are necessarily representing a plausible target domain since (1) they induce G and (2) they entail P.

Sufficiency. Consider a tuple of SCMs M x M1, M2, . . . , MK, M y that induce G and entail P. Theorem 1 by Xia et al. [41] shows that for every SCM M that induces G, there exists a G-constraint NCM parameterized by θ such that P Mpvq Ppv; θq (as a consequence of L3-consistency). The proof is constructive, and for every V P V the construction of the neural network θV depends on (1) the function f V and (2) the distribution P Mpu V q.

Consider two SCMs Mi, Mj (i, j P t , 1, 2, . . . , Ku) that induce domain discrepancy set ij. Follow the construction by Xia et al. [41] to obtain the corresponding NCMs parameterized by θi, θj. For every V R i,j, we have, θi V θj V since the construction depends on f i V f j V and P Mipu V q P Mjpu V q. Thus, the domain discrepancy set induced by θi, θj matches with i,j induced by the

SCMs Mi, Mj. Therefore, By constructing the NCM θi from Mi (i P t , 1, 2, . . . , Ku), we are guaranteed to have a tuple of NCMs N that (1) induce G and (2) entails P.

Partial-TR via NCMs. Due to necessity and sufficiency above, we conclude that a tuples of NCMs satisfies the parameter sharing and data likelihood conditions stated in Eq. 8, if and only if there exists a tuple of SCMs M that induce G and entail P such that Ppv; θiq P Mipvq for all i P t , 1, 2, . . . , Ku. Therefore, by solving the following optimization problem,

ˆΘ P arg max Θ:xθ1,θ2,...,θK,θ y

vzw Ppv; θ q (51)

s.t. θi V θj V , @i, j P t1, 2, . . . , K, u @V R i,j

θi P arg max θ

v PDi log Ppv; θq, @i P t1, 2, . . . , Ku.

we achieve a tight upper-bound for the query EP rψp W qs w.r.t. G , P.

E.2 Proof of Proposition 1

Consider the objective of Theorem 2;

ˆΘ P arg max Θ:xθ1,θ2,...,θK,θ y

vzw Ppv; θ q (52)

s.t. θi V θj V , @i, j P t1, 2, . . . , K, u @V R i,j

θi P arg max θ

v PDi log Ppv; θq, @i P t1, 2, . . . , Ku.

No need to parameterize non-ancestors of W . Let T V z An G p W q. By applying Rule 3 of σ-calculus [9] we realize that,

Ppw; θ V z T , θ T q Ppw; θ V z T , θT q. (53)

The latter indicates that the parameters tθ T u T PT are irrelevant to the joint distribution Ppwq, and therefore, can be dropped from the NCMs used for partial transportability of EP rψp W qs ř

w P pwq ψpwq.

Let A An G p W q. We drop the non-ancestors, and rewrite the objective as follows:

ˆ ΘA P arg max ΘA:xθ1 A,θ2 A,...,θK A,θ Ay

azw Ppa; θ q (54)

s.t. θi V θj V , @i, j P t1, 2, . . . , K, u @V R i,j

θi A P arg max θ

a PDi log Ppa; θAq, @i P t1, 2, . . . , Ku.

Next, we add the likelihood terms to the main objective regularized by a coefficient Λ to achieve a single-objective optimization.

ˆ ΘA P arg max ΘA:xθ1 A,θ2,...,θK,θ y

azw Ppa; θ q Λ

a PDi log Ppa; θi Aq (55)

s.t. θi V θj V , @i, j P t1, 2, . . . , K, u @V R i,j

For Λ Ñ 8, the new optimization problem matches with that of Thm. 2. Now, we focus on the likelihood expression, and rewrite it following a causal order of G , namely, A1 ă A2 ă ă AN.

log Ppa; θi Aq

l 1 log Ppal | al 1, . . . , a1; θi Aq (factorization) (56)

l 1 log EUAr Ppal | vl 1, . . . , v1, U; θi Aqs (conditioning on U) (57)

l 1 log EUAl r Ppal | pa Al, UAl; θiqs (Rule 1 of do-calc) (58)

l 1 log EUAl r Ppal | pa Al, UAl; θi Aqs (Rule 3 of do-calc) (59)

Let t Ajum j 1 be the c-components of G r As, which is the graph induced by nodes A. We rewrite the above objective in terms of the c-factors:

log Ppa; θi Aq

APAj log EUAr Ppa | pa A, UA; θi Aqs (c-factor decomp.) (60)

APAj EUAr Ppa | pa A, UA; θi Aqs (sum-of-log to log-of-prod) (61)

j 1 log EUAj r ź

APAj Ppa | pa A, UA; θi Aqs (mutually indep. UA) (62)

j 1 log Ppaj | doppa Ajq; θi Ajq (trunc. fact. prod.) (63)

From the last expression, we can observe that the NCM parameterization is modular w.r.t. the c-components, as Rahman et al. [28] also discusses. We rewrite the full optimization program again:

ˆ ΘA P arg max ΘA:xθ1 A,θ2,...,θK,θ y

j 1 log Ppaj | doppa Ajq; θ Ajqu ψpaq (64)

aj PDi log Ppaj | doppa Ajq; θi Ajq (65)

s.t. θi V θj V , @i, j P t1, 2, . . . , K, u @V R i,j

Let Aj be a c-component that Si is not pointing to it in G , i.e., Aj X i H. The latter means that the parameter sharing θ V θi V is enforced for all V P Aj; we call these parameters θi, Aj. We

notice that θi, Aj only appears through the term log Ppaj | doppa Ajq; θAjq in the score function; once in the main objective as θ Aj and once in the regularizer as θi Aj. For Λ Ñ 8, the regularizer enforces

θi, Aj to satisfy the following criterion:

θi, Aj P arg max θAj

aj PDi log Ppaj | doppa Ajq; θAjq (66)

This criterion is in fact an interventional (L2) constraint [41] enforced on θi, Aj that requires θi, Aj to approximate P ipaj | doppa Ajqq using the observational data Di. Since P ipaj | doppa Ajqq is a complete c-factor, it is identifiable from P ipaj, pa Ajq [36]. Therefore, by increasing the sample size |Di| Ñ 8 and the model complexity of θi, Aj, satisfying the criterion in Eq. 66 guarantees arbitrarily accurate approximation of the interventional quantities P ipaj | doppa Ajqq [41]. This

implies that we can replace the terms involving the parameters θi, Aj with any consistent approximation

of P ipaj | doppa Ajqq as constants. To get the approximation, we are free to use any probabilistic model and architecture depending on the context; this includes the option to train the NCM parameters θi, Aj in the pre-training.

This adjustment gets us to the exact procedure pursued in Algorithm 1, thus proves consistency of it with what we would achieve via Theorem 2.

E.3 Proof of Theorem 3

For this proof, it is useful to define the worst-case risk w.r.t. the selection diagram and the source distributions. Definition 8 (Worst-case risk). For selection diagram G and source distributions P, the worst-case risk of classifer h : ΩX Ñ ΩY is denoted by RG ,Pphq and defined as the solution of partial transportation task for the query EP r Lp Y, hp Xqqs, where Lpy, ˆyq is a loss function. Formally,

RG ,Pphq : max tuple of SCMs M0 that entails P & induces G RP M 0 phq. (67)

Theorem 3 (restated). For discrete X, Y CRO terminates. Furthermore, for large enough data across all source domain, the worst-case risk of CRO is at most ϵ away from the worst-case optimal classifier w.r.t. selection diagram G and source data P. Formally,

lim nÑ8 Pp RG ,Pph CRO n q min h:ΩXÑΩY RG ,Pphq ą ϵq Ñ 0 (68)

where h CRO n : CROp Dn, G q, and Dn x D1, D2, . . . , DKy is a collection of datasets that each contain at least n datapoints.

Proof. Soundness of CRO relies on consistency of Neural-TR (Alg. 1 as a subroutine; we pick the data size large enough to satisfy this condition according to Theorem 2.

Termination. Let ˆθ 1 , ˆθ 2 , . . . be the sequence of target NCMs produced during the runtime of CRO, and let h1, h2, . . . be the sequence of classifiers obtained after each iteration. Let Π denote the space of all distributions over X, Y . For discrete X, Y , the space Π is a compact subspace of some Euclidean space. Thus, every sequence in Π has a convergent subsequence, especially the sequence t Ppx, y; ˆθ mqum Ă Π; let t Plul be this convergent subsequence. Every convergent subsequence is Cauchy, which means,

@τ ą 0 Dn ą 0 @l, l1 ą n : dp Pl Pl1q ă τ, (69)

where d is an appropriate metric over the probability space. Choose τ small enough w.r.t. the convergence tolerance δ ą 0 to ensure,

@P, P 1 where dp P, P 1q ă τ ùñ @h : ΩX Ñ ΩY |RP phq RP 1phq| ď δ. (70)

The above is possible, since the mapping RP phq is a bounded and continuous mapping on the space Π. Now, we are guaranteed to find an index l such that,

|RP px,y;ˆθ l 1qphlq RP px,y;ˆθ l qphlq| ă δ. (71)

Notice that by definition, ˆθ l 1 is obtained by Neural-TR (Alg. 1) to attain the worst-case risk of hl, i.e., RP px,y;ˆθ l 1qphlq RG ,Pphlq. (72)

Moreover, RP px,y;ˆθ l qphlq ď max i Pt1,2,...,lu RD i phlq. (73)

Putting the last three equations together we have,

RG ,Pphlq ď max i Pt1,2,...,lu RD i phlq δ, (74)

which invokes the termination.

Worst-case optimality. Suppose h CRO is returned by CRO, and let h be the true worst-case optimal classifier defined as, h P min h:ΩXÑΩY RG ,Pphq. (75)

Let D denote the collection of datasets collected by the algorithm before termination. We know that h CRO is robust to D , i.e.,

h CRO P arg min h:ΩXÑΩY max DPD RDphq ùñ max DPD RDph CROq opt. h CRO ď max DPD RDph q (76)

Moreover, every distribution in D is entailed by an NCM that represents a possible target domain. Therefore, the worst-case risk is at least as large as the worst-case empirical risk on the set of distribution D, i.e., max DPD RDph q ď RG ,Pph q (77)

Since that algorithm has terminated we have,

RG ,Pph CROq ă max DPD RDph CROq δ. (78)

where δ ą 0 is the tolerance for the convergence condition in the algorithm. Putting all inequalities together, we have,

RG ,Pph CROq δ ď max DPD RDph CROq (79)

ď max DPD RDph q (80)

ď RG ,Pph q, (81)

which indicates that the worst-case risk of h CRO is at most δ larger than the optimal worst-case risk.

F Broader Impact and Limitations

Our work investigates the design of algorithms and conditions under which knowledge acquired in one domain (e.g., particular setting, experimental condition, scenario) can be generalized to a different one that may be related, but is unlikely to be the same. As alluded to in this paper, under-identifiability issues and the difficulty of stating realistic assumptions that are conducive to extrapolation guarantees are pervasive throughout the data sciences. Our hope is that our analysis with a more surgical encoding of structural differences between domains that allow the empirical investigator to determine whether (and how) her/his understanding of the underlying system is sufficient to support the generalization of prediction algorithm is an important addition towards safe and reliable AI. This approach is not without limitations, however. We have shown that selection diagrams are sufficient to ensure consistent domain generalization (through bounds instead of point estimates) but arguably restrict the analysis to a narrow class of problems as graphs or super-structures need to be defined. This stands in contrast with representation learning methods that operate on higher-dimensional spaces, e.g. text, images, which are difficult to reason about in a causal framework. The trade-off is that guarantees for consistent extrapolation are difficult to define and that one-size-fits-all assumptions are difficult to justify in practice. Partial transportability may be understood as a complementary view-point on this problem, applicable in a different class of problems in which structural knowledge is available implying that non-trivial guarantees for extrapolation can be established. Pushing the boundaries of methods based on causal graphs to reach compelling real-world applications is arguably one the most important frontiers for the causal community as a whole. In this work, there is scope for improving posterior estimation and for introducing assumptions on the class of SCMs that are modelled, e.g. linear Gaussian models, etc., that could lead to efficient predictors in higher-dimensional spaces. Similarly, relaxations of selection diagrams, e.g. in the form of equivalence classes or partially-known graphs, could be developed for applications in domains where knowledge of graph structure is unrealistic.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: Please see the contribution bullets.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: See appendix F.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: All proofs are provided in the appendix with a more detailed restatement of the theorems and needed assumptions.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [No] Justification: The code will be accessible through git-hub after publication. We ensured that the results are reproducible; please see Appendix B.3 for some details on the used architecture.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: Code is provided.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: See the corresponding examples for each experiment, and the reproducibility note in B.3.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [NA]

Justification: All the examples are small, so it is not applicable.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: All experiments were executed on a Macbook Pro M2 32 GB RAM. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We respect Neur IPS Code of Ethics. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: See appendix F. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: Our work is theoretical and bears no such risk.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: All related and used results are cited properly.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The code is annotated and provided. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.