# causal_imitability_under_contextspecific_independence_relations__37928d95.pdf

Causal Imitability Under Context-Specific Independence Relations

Fateme Jamshidi EPFL, Switzerland fateme.jamshidi@epfl.ch

Sina Akbari EPFL, Switzerland sina.akbari@epfl.ch

Negar Kiyavash EPFL, Switzerland negar.kiyavash@epfl.ch

Drawbacks of ignoring the causal mechanisms when performing imitation learning have recently been acknowledged. Several approaches both to assess the feasibility of imitation and to circumvent causal confounding and causal misspecifications have been proposed in the literature. However, the potential benefits of the incorporation of additional information about the underlying causal structure are left unexplored. An example of such overlooked information is context-specific independence (CSI), i.e., independence that holds only in certain contexts. We consider the problem of causal imitation learning when CSI relations are known. We prove that the decision problem pertaining to the feasibility of imitation in this setting is NP-hard. Further, we provide a necessary graphical criterion for imitation learning under CSI and show that under a structural assumption, this criterion is also sufficient. Finally, we propose a sound algorithmic approach for causal imitation learning which takes both CSI relations and data into account.

1 Introduction

Imitation learning has been shown to significantly improve performance in learning complex tasks in a variety of applications, such as autonomous driving [27], electronic games [17, 41], and navigation [34]. Moreover, imitation learning allows for learning merely from observing expert demonstrations, therefore circumventing the need for designing reward functions or interactions with the environment. Instead, imitation learning works through identifying a policy that mimics the demonstrator s behavior which is assumed to be generated by an expert with near-optimal performance in terms of the reward. Imitation learning techniques are of two main flavors: behavioral cloning (BC) [42, 30, 24, 25], and inverse reinforcement learning1 (IRL) [26, 1, 37, 44]. BC approaches often require extensive data to succeed. IRL methods have proved more successful in practice, albeit at the cost of an extremely high computational load. The celebrated generative adversarial imitation learning (GAIL) framework and its variants bypass the IRL step by occupancy measure matching to learn an optimal policy [15].

Despite recent achievements, still in practice applying imitation learning techniques can result in learning policies that are markedly different from that of the expert [9, 7, 22]. This phenomenon is for the most part the result of a distributional shift between the demonstrator and imitator environments [11, 31, 12]. All aforementioned imitation learning approaches rely on the assumption that the imitator has access to observations that match those of the expert. An assumption clearly violated when unobserved confounding effects are present, e.g., because the imitator has access only to partial observations of the system. For instance, consider the task of training an imitator to drive a car, the causal diagram of which is depicted in Figure 1a. X and Y in this graph represent the action taken by the driver and the latent reward, respectively. The expert driver controls her speed (X) based on the speed limit on the highway (denoted by S), along with other covariates such as weather conditions,

1Also referred to as inverse optimal control in the literature.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

(a) not imitable

(b) imitable

(c) contextually imitable

(d) wage rate example

Figure 1: Example of learning to drive a car. (a) Imitation is impossible, as the imitator has no access to the speed limit. (b) The car has automatic cruise control, which makes the expert actions independent of the speed limit. (c) Driver actions are independent of speed limit only in the context of heavy traffic. (d) A simple example of a CSI relation in economics.

brake indicator of the car in front, traffic load, etc. We use two such covariates in our example: Z and T. An optimal imitator should mimic the expert by taking actions according to the expert policy P(X|S, Z, T). However, if the collected demonstration data does not include the speed limit (S), the imitator would tend to learn a policy that averages the expert speed, taking only the other covariates (Z and T) into account. Such an imitator could end up crashing on serpentine roads or causing traffic jams on highways. As shown in Figure 1a, the speed limit acts as a latent confounder2.

One could argue that providing more complete context data to the imitator, e.g. including the speed limit in the driving example, would resolve such issues. While this solution is not applicable in most scenarios, the fundamental problem in imitation learning is not limited to causal confounding. As demonstrated by [12] and [43], having access to more data not only does not always result in improvement but also can contribute to further deterioration in performance. In other words, important issues can stem not merely from a lack of observations but from ignoring the underlying causal mechanisms. The drawback of utilizing imitation learning without considering the causal structure has been recently acknowledged, highlighting terms including causal confusion [12], sensor-shift [13], imitation learning with unobserved confounders [43], and temporally correlated noise [36].

Incorporating causal structure when designing imitation learning approaches has been studied in recent work. For instance, [12] proposed performing targeted interventions to avoid causal misspecification. [43] characterized a necessary and sufficient graphical criterion to decide the feasibility of an imitation task (aka imitability). It also proposed a practical causal imitation framework that allowed for imitation in specific instances even when the general graphical criterion (the existence of a π-backdoor admissible set) did not necessarily hold. [36] proposed a method based on instrumental variable regression to circumvent the causal confounding when a valid instrument was available.

Despite recent progress, the potential benefits of further information pertaining to the causal mechanisms remain unexplored. An important type of such information is context-specific independence (CSI) relations, which are generalizations of conditional independence. Take, for instance, the labor market example shown in Figure 1d, where W, E, and U represent the wage rate, education level, and unemployment rate, respectively. The wage rate W is a function of E and U. However, when unemployment is greater than 10%, the education level does not have a significant effect on the wage rate, as there is more demand for the job than the openings. This is to say, W is independent of E given U, only when U > 10%. This independence is shown by a label U > 0.1 on the edge E W. Analogously, in our driving example, the imitator would still match the expert s policy if there was heavy traffic on the route. This is because, in the context that there is heavy traffic, the policy of the expert would be independent of the speed limit. This context-specific independence between the speed limit S and the action X given T = 1 (heavy traffic) is indicated by a label (T = 1) on the edge S X in Figure 1c. This label indicates that the edge S X is absent when T = 1. The variables T in Figure 1c and U in Figure 1d are called context variables, i.e., variables which induce conditional independence relations in the system based on their realizations.

CSIs can be incorporated through a refined presentation of Bayesian networks to increase the power of inference algorithms [8]. The incorporation of CSI also has ramifications on causal inference. For instance, [39] showed that when accounting for CSIs, Pearl s do-calculus is no longer complete for causal identification. They proposed a sound algorithm for identification under CSIs. It is noteworthy

2On the other hand, if the car had automatic cruise control, the expert policy would be independent of the speed limit, resolving imitation issues (refer to Figure 1b.)

that unlike conditional Independence (CI) relations that can be learned from data and are widely used to infer the graphical representation of causal mechanisms [35, 6, 10, 4, 19, 18], not all CSI relations can be learned from mere observational data. However, several approaches exist for learning certain CSIs from data [21, 16, 32, 23]. In this paper, we investigate how CSIs allow for imitability in previously non-imitable instances.

After presenting the notation (Section 2), we first prove a hardness result, namely, that deciding the feasibility of imitation learning while accounting for CSIs is NP-hard. This is in contrast to the classic imitability problem [43], where deciding imitability is equivalent to a single d-separation test. We characterize a necessary and sufficient graphical criterion for imitability under CSIs under a structural assumption (Section 3) by leveraging a connection we establish between our problem and classic imitability. Next, we show that in certain instances, the dataset might allow for imitation, despite the fact that the graphical criterion is not satisfied (Section 4). Given the constructive nature of our achievability results, we propose algorithmic approaches for designing optimal imitation policies (Sections 3 and 4) and evaluate their performance in Section 5. The proofs of all our results appear in Appendix A.

2 Preliminaries

Throughout this work, we denote random variables and their realizations by capital and small letters, respectively. Likewise, we use boldface capital and small letters to represent sets of random variables and their realizations, respectively. For a variable X, DX denotes the domain of X and PX the space of probability distributions over DX. Given two subsets of variables T and S such that T S, and a realization s DS, we use (s)T to denote the restriction of s to the variables in T.

We use structural causal models (SCMs) as the semantic framework of our work [28]. An SCM M is a tuple U, V, P M(U), F where U and V are the sets of exogenous and endogenous variables, respectively. Values of variables in U are determined by an exogenous distribution P M(U), whereas the variables in V take values defined by the set of functions F = {f M V }V V. That is, V f M V (Pa(V ) UV ) where Pa(V ) V and UV U. We also partition V into the observable and latent variables, denoted by O and L, respectively, such that V = O L. The SCM induces a probability distribution over V whose marginal distribution over observable variables, denoted by P M(O), is called the observational distribution. Moreover, we use do(X = x) to denote an intervention on X V where the values of X are set to a constant x in lieu of the functions {f M X : X X}. We use P M x (y) := P M(Y = y|do(X = x)) as a shorthand for the postinterventional distribution of Y after the intervention do(X = x).

Let X, Y, W, and Z be pairwise disjoint subsets of variables. X and Y are called contextually independent given W in the context z DZ if P(X|Y, W, z) = P(X|W, z), whenever P(Y, W, z) > 0. We denote this context-specific independence (CSI) relation by X Y|W, z. Moreover, a CSI is called local if it is of the form X Y |z where {Y } Z Pa(X).

A directed acyclic graph (DAG) is defined as a pair G = (V, E), where V is the set of vertices, and E V V denotes the set of directed edges among the vertices. SCMs are associated with DAGs, where each variable is represented by a vertex of the DAG, and there is a directed edge from Vi to Vj if Vi Pa(Vj) [28]. Whenever a local CSI of the form Vi Vj|ℓholds, we say ℓis a label for the edge (Vi, Vj) and denote it by ℓ L(Vi,Vj). Recalling the example of Figure 1c, the realization T = 1 is a label for the edge (S, X), which indicates that this edge is absent when T is equal to 1. Analogous to [29], we define a labeled DAG (LDAG), denoted by GL, as a tuple GL = (V, E, L), where L denotes the labels representing local independence relations. More precisely, L = {L(Vi,Vj) : L(Vi,Vj) = | (Vi, Vj) E}, where

L(Vi,Vj) = {ℓ DV |V Pa(Vj) \ {Vi}, Vi Vj|ℓ}.

Note that, when L = , GL reduces to a DAG. That is, every DAG is a special LDAG with no labels. For ease of notation, we drop the superscript L when L = . Given a label set L, we define the context variables of L, denoted by C(L), as the subset of variables that at least one realization of them appears in the edge labels. More precisely,

C(L)= n Vi| Vj, Vk = Vi, V , ℓ: (Vk, Vj) E, Vi V , ℓ DV L(Vk,Vj) o , (1)

where V is an arbitrary subset of nodes containing Vi. The argument is that if there exists some arbitrary subset V of nodes containing Vi such that a realization l of this subset results in independence (e.g., of some Vj and Vk), then Vi is considered as a context variable. We mainly focus on the settings where context variables are discrete or categorical. This assumption can be relaxed under certain considerations (See Remark 3.11). We let M GL denote the class of SCMs compatible with the causal graph GL. For a DAG G, we use GX and GX to represent the subgraphs of G obtained by removing edges incoming to and outgoing from vertices of X, respectively. We also use standard kin abbreviations to represent graphical relationships: the sets of parents, children, ancestors, and descendants of X in G are denoted by Pa(X) Ch(X), An(X), and De(X), respectively. For disjoint subsets of variable X, Y and Z in G, X and Y are said to be d-separated by Z in G, denoted by X Y|Z, if every path between vertices in X and Y is blocked by Z (See Definition 1.2.3. in 28). Finally, solid and dashed vertices in the figures represent the observable and latent variables, respectively.

3 Imitability

In this section, we address the decision problem, i.e., whether imitation learning is feasible, given a causal mechanism. We first review the imitation learning problem from a causal perspective, analogous to the framework developed by [43]. We will use this framework to formalize the causal imitability problem in the presence of CSIs. Recall that O and L represented the observed and unobserved variables, respectively. We denote the action and reward variables by X O and Y L, respectively. The reward variable is commonly assumed to be unobserved in imitation learning. Given a set of observable variables PaΠ O \ De(X), a policy π is then defined as a stochastic mapping, denoted by π(X|PaΠ), mapping the values of PaΠ to a probability distribution over the action X. Given a policy π, we use do(π) to denote the intervention following the policy π, i.e., replacing the original function f X in the SCM by the stochastic mapping π. The distribution of variables under policy do(π) can be expressed in terms of post-interventional distributions (P( |do(x))) as follows:

P(v|do(π)) = X

x DX,paΠ DPaΠ P(v|do(x), paΠ)π(x|paΠ)P(paΠ), (2)

where v is a realization of an arbitrary subset V V. We refer to the collection of all possible policies as the policy space, denoted by Π = {π : DPaΠ PX}. Imitation learning is concerned with learning an optimal policy π Π such that the reward distribution under policy π matches that of the expert policy, that is, P(y|do(π )) = P(y) [43]. Given a DAG G and the policy space Π, if such a policy exists, the instance is said to be imitable w.r.t. GL, Π . The causal imitability problem is formally defined below.

Definition 3.1 (Classic imitability w.r.t. G, Π 43). Given a latent DAG G and a policy space Π, let Y be an arbitrary variable in G. P(y) is said to be imitable w.r.t. G, Π if for any M M G , there exists a policy π Π uniquely computable from P(O) such that P M(y|do(π)) = P M(y).

Note that if Y / De(X), the third rule of Pearl s do calculus implies P(y|do(x), paΠ) = P(y|paΠ), and from Equation (2), P(y|do(π)) = P(y) for any arbitrary policy π. Intuitively, in such a case, action X has no effect on the reward Y , and regardless of the chosen policy, imitation is guaranteed. Therefore, throughout this work, we assume that X affects Y , i.e., Y De(X) L. Under this assumption, [43] proved that P(y) is imitable w.r.t. G, Π if and only if there exists a π-backdoor admissible set Z w.r.t. G, Π .

Definition 3.2 (π-backdoor, 43). Given a DAG G and a policy space Π, a set Z is called π-backdoor admissible set w.r.t. G, Π if and only if Z PaΠ and Y X|Z in GX.

The following lemma further reduces the search space of π-backdoor admissible sets to a single set.

Lemma 3.3. Given a latent DAG G and a policy space Π, if there exists a π-backdoor admissible set w.r.t. G, Π , then Z = An({X, Y }) (PaΠ) is a π-backdoor admissible set w.r.t. G, Π .

As a result, deciding the imitability reduces to testing a d-separation, i.e., whether Z defined in Lemma 3.3 d-separates X and Y in GX, for which efficient algorithms exist [14, 38]. If this d-separation holds, then π(X|Z) = P(X|Z) is an optimal imitating policy. Otherwise, the instance is not imitable.

(b)GLw w ,W={Z},w={1}

(c) GLw w , W = {T, Z}, w = {0, 1}

Figure 2: Two examples of context-induced subgraphs (Definition 3.5).

It is noteworthy that in practice, it is desirable to choose π-admissible sets with small cardinality for statistical efficiency. Polynomial-time algorithms for finding minimum(-cost) d-separators exist [2, 38].

3.1 Imitability with CSIs

Deciding the imitability when accounting for CSIs is not as straightforward as the classic case discussed earlier. In particular, as we shall see, the existence of π-backdoor admissible sets is not necessary to determine the imitability of P(y) anymore in the presence of CSIs. In this section, we establish a connection between the classic imitability problem and imitability under CSIs. We begin with a formal definition of imitability in our setting.

Definition 3.4 (Imitability w.r.t. GL, Π ). Given an LDAG GL and a policy space Π, let Y be an arbitrary variable in GL. P(y) is called imitable w.r.t. GL, Π if for any M M GL , there exists a policy π Π uniquely computable from P(O) such that P M(y|do(π)) = P M(y).

For an LDAG GL, recall that we defined the set of context variables C(L) by Equation (1). The following definition is central in linking the imitability under CSIs to the classic case.

Definition 3.5 (Context-induced subgraph). Given an LDAG GL = (V, E, L), for a subset W C(L) and its realization w DW, we define the context-induced subgraph of GL w.r.t. w, denoted by GLw w , as the LDAG obtained from GL by keeping only the labels that are compatible with w 3, and deleting the edges that are absent given W = w, along with the edges incident to W.

Consider the example of Figure 2 for visualization. In the context Z = 1, the label Z = 0 on the edge U X is discarded, as Z = 0 is not compatible with the context (see Figure 2b.) Note that edges incident to the context variable Z are also omitted. On the other hand, in the context Z = 1, T = 0, the edge X Y is absent and can be deleted from the corresponding graph (refer to Figure 2c.) Edges incident to both T and Z are removed in this case. Equipped with this definition, the following result, the proof of which appears in Appendix A, characterizes a necessary condition for imitability under CSIs.

Lemma 3.6. Given an LDAG GL and a policy space Π, let Y be an arbitrary variable in GL. P(y) is imitable w.r.t. GL, Π only if P(y) is imitable w.r.t. GLw w , Π for every realization w Dw of every subset of variables W C(L).

For instance, a necessary condition for the imitability of P(y) in the graph of Figure 2a is that P(y) is imitable in both 2b and 2c. Consider the following special case of Lemma 3.6: if W = C(L), then GLw w = Gw is a DAG, as Lw = for every w DW. In essence, a necessary condition of imitability under CSIs can be expressed in terms of several classic imitability instances:

Corollary 3.7. Given an LDAG GL and a policy space Π, let Y be an arbitrary variable in GL. P(y) is imitable w.r.t. GL, Π only if P(y) is imitable w.r.t. Gc, Π , i.e., there exists a π-backdoor admissible set w.r.t. Gc, Π , for every c DC(L).

It is noteworthy that although the subgraphs Gc in Corollary 3.7 are defined in terms of realizations of C(L), the number of such subgraphs does not exceed 2|E|. This is due to the fact that Gcs share the same set of vertices, and their edges are subsets of the edges of GL.

3Formally, we say that a label l is compatible with a realization w if they are consistent; in the sense that the variables at the intersection of the label and the realization take on the same values under both assignments.

Although deciding the classic imitability is straightforward, the number of instances in Corollary 3.7 can grow exponentially in the worst case. However, in view of the following hardness result, a more efficient criterion in terms of computational complexity cannot be expected. Theorem 3.8. Given an LDAG GL and a policy space Π, deciding the imitability of P(y) w.r.t. GL, Π is NP-hard.

This result places the problem of causal imitability under CSI relations among the class of NPhard problems in the field of causality, alongside other challenges such as devising minimum-cost interventions for query identification [3] and discovering the most plausible graph for causal effect identifiability [5]. Although Theorem 3.8 indicates that determining imitability under CSIs might be intractable in general, as we shall see in the next section taking into account only a handful of CSI relations can render previously non-imitable instances imitable. Before concluding this section, we consider a special yet important case of the general problem. Specifically, for the remainder of this section, we assume that Pa(C(L)) C(L). That is, the context variables have parents only among the context variables. Under this assumption, the necessary criterion of Corollary 3.7 turns out to be sufficient for imitability as well. More precisely, we have the following characterization. Proposition 3.9. Given an LDAG GL where Pa(C(L)) C(L) and a policy space Π, let Y be an arbitrary variable in GL. P(y) is imitable w.r.t. GL, Π if and only if P(y) is imitable w.r.t. Gc, Π , for every c DC(L).

Algorithm 1 Imitation w.r.t. GL, Π

1: function IMITATE1 (GL, Π, X, Y ) 2: Compute C := C(L) using Equation (1) 3: for c DC do 4: Construct a DAG Gc using Definition 3.5 5: if FINDSEP (Gc, Π, X, Y ) Fails then 6: return FAIL 7: else 8: Zc FINDSEP (Gc, Π, X, Y ) 9: πc(X|paΠ) P(X|(paΠ)zc)

10: π (X|paΠ) P

c DC1{(paΠ)C=c}πc(X|paΠ) 11: return π

The proof of sufficiency, which is constructive, appears in Appendix A. The key insight here is that an optimal imitation policy is constructed based on the imitation policies corresponding to the instances Gc, Π . In view of proposition 3.9, we provide an algorithmic approach for finding an optimal imitation policy under CSIs, as described in Algorithm 1. This algorithm takes as input an LDAG GL, a policy space Π, an action variable X and a latent reward Y . It begins with identifying the context variables C(= C(L)), defined by Equation (1) (Line 2). Next, for each realization c DC, the corresponding context-induced subgraph (Def. 3.5) is built (which is a DAG). If P(y) is not imitable in any of these DAGs, the algorithm fails, i.e., declares P(y) is not imitable in GL. The imitability in each DAG is checked through a d-separation based on Lemma 3.3 (for further details of the function Find Sep, see Algorithm 3 in Appendix B.) Otherwise, for each realization c DC, an optimal policy πc is learned through the application of the π-backdoor admissible set criterion (line 9). If such a policy exists for every realization of C, P(y) is imitable w.r.t. GL, Π due to Proposition 3.9. An optimal imitating policy π is computed based on the previously identified policies πc. Specifically, π , the output of the algorithm, is defined as

π (X|paΠ) X

c DC 1{(paΠ)C = c}πc(X|paΠ).

Theorem 3.10. Algorithm 1 is sound and complete for determining the imitability of P(y) w.r.t. GL, Π and finding the optimal imitating policy in the imitable case, under the assumption that Pa(C(L)) C(L). Remark 3.11. Algorithm 1 requires assessing the d-separation of line 5 in the context-induced subgraphs Gc. Even when the context variables C are continuous, the domain of these variables can

(a)HLwith non-id surrogate

(b) GL with no surrogates

(c) GL(C = 0)

(d) GL(C = 1)

Figure 3: Two examples of leveraging CSI relations to achieve imitability.

be partitioned into at most 2m equivalence classes in terms of their corresponding context-induced subgraph, where m denotes the number of labeled edges. This holds since the number of contextinduced subgraphs cannot exceed 2m. It is noteworthy, however, that solving the equation referred to in line 10 of Algorithm 2 for continuous variables may bring additional computational challenges.

Remark 3.12. Under certain assumptions, polynomial-time algorithms can be devised for deciding imitability. One such instance is when DC(L) = O log(|V |) . Another analogous case is when the number of context-specific edges is O log(|V |) . Both of these lead polynomially many contextinduced subgraphs in terms of |V |, which in turn implies that Alg. 1 runs in polynomial time.

4 Leveraging causal effect identifiability for causal imitation learning

Arguably, the main challenge in imitation learning stems from the latent reward. However, in certain cases, there exist observable variables S O such that P(S|do(π)) = P(S) implies P(y|do(π)) = P(y) for any policy π Π. Such S is said to be an imitation surrogate for Y [43]. Consider, for instance, the graph of Figure 3a, where X represents the pricing strategy of a company, C is a binary variable indicating recession (C = 0) or expansion (C = 1) period, U denotes factors such as demand and competition in the market, S represents the sales and Y is the overall profit of the company. Due to Proposition 3.9, P(y) is not imitable in this graph. On the other hand, the sales figure (S) is an imitation surrogate for the profit (Y ), as it can be shown that whenever P(S|do(π)) = P(S) for a given policy π, P(y|do(π)) = P(y) holds for the same policy. Yet, according to do-calculus, P(S|do(π)) itself is not identifiable due to the common confounding U. On the other hand, we note that the company s pricing strategy (X) becomes independent of demand (U) during a recession (C = 0), as the company may not have enough customers regardless of the price it sets. This CSI relation amounts to the following identification formula for the effect of an arbitrary policy π on sales figures:

P(s|do(π)) = X

x,c P(s|x, C = 0)π(x|c)P(c), (3)

where all of the terms on the right-hand side are known given the observations. Note that even though S is a surrogate in Figure 3a, without the CSI C = 0, we could not have written Equation (3). Given the identification result of this equation, if the set of equations P(s|do(π)) = P(s) has a solution π , then π becomes an imitation policy for P(y) as well [43]. It is noteworthy that solving the aforementioned linear system of equations for π is straightforward, for it boils down to a matrix inversion4. In the example discussed, although the graphical criterion of Proposition 3.9 does not hold, the data-specific parameters could yield imitability in the case that these equations are solvable. We therefore say P(y) is imitable w.r.t. GL, Π, P(O) , as opposed to w.r.t. GL, Π . Precisely, we say P(y) is imitable w.r.t. GL, Π, P(O) if for every M M GL such that P M(O) = P(O), there exists a policy π such that P M(y|do(π)) = P M(y).

The idea of surrogates could turn out to be useful to circumvent the imitability problem when the graphical criterion does not hold. In Figure 3b, however, neither graphical criteria yields imitability nor any imitation surrogates exist. In what follows, we discuss how CSIs can help circumvent the problem of imitability in even in such instances. Given an LDAG GL and a context C = c, we denote the context-specific DAG w.r.t. GL, c by GL(c) where GL(c) is the DAG obtained by deleting all the spurious edges, i.e., the edges that are absent given the context C = c, from GL.

4In the discrete case, and kernel inversion in the continuous case.

Algorithm 2 Imitation w.r.t. GL, Π, P(O)

1: function IMITATE2 (GL, Π, X, Y, P(O)) 2: if SUBIMITATE (GL, Π, X, Y, , , P(O)) then 3: π SUBIMITATE (GL, Π, X, Y, , , P(O)) 4: return π

5: else 6: return Fail

1: function SUBIMITATE(GL, Π, X, Y, C, c, P(O)) 2: Construct the context-specific DAG GL(c) 3: if FINDSEP (GL(c), Π, X, Y ) then 4: Zc FINDSEP (GL(c), Π, X, Y ) 5: πc(X|PaΠ) P(X|Zc, C = c) 6: return πc 7: else if FINDMINSEP(GL(c) Π, X, Y, C) then 8: Sc FINDMINSEP (GL(c) Π, X, Y, C)\C 9: if CSI-ID (P(Sc|do(x), C = c)) then 10: Solve P(Sc|do(πc), c)=P(Sc|c) for πc Π 11: if such πc exists then 12: return πc 13: else 14: if C = C(L) PaΠ then 15: return FAIL 16: else 17: Choose V C(L) PaΠ \ C arbitrarily 18: for v DV do 19: C C {V }, c c {v} 20: if SI(GL, Π, X, Y, C , c , P(O)) then 21: πc,v SI(GL, Π, X, Y, C , c , P(O)) 22: else 23: return Fail 24: πc(X|PaΠ) P

v DV 1{(PaΠ)V =v}πc,v 25: return πc

Definition 4.1 (Context-specific surrogate). Given an LDAG GL, a policy space Π, and a context c, a set of variables S O is called context-specific surrogate w.r.t. GL(c), Π if X Y |{S, C} in GL(c) Π where GL(c) Π is a supergraph of GL(c) by adding edges from PaΠ to X.

Consider the LDAG of Figure 3b for visualization. The context-specific DAGs corresponding to contexts C = 0 and C = 1 are shown in Figures 3c and 3d, respectively. S is a context-specific surrogate with respect to GL(C = 0), Π , while no context-specific surrogates exist with respect to GL(C = 1), Π . The following result indicates that despite the absence of a general imitation surrogate, the existence of context-specific surrogates could suffice for imitation. Proposition 4.2. Given an LDAG GL and a policy space Π, let C be a subset of PaΠ C(L). If for every realization c DC at least one of the following holds, then P(y) is imitable w.r.t. GL, Π, P(O) .

There exists a subset Sc such that X Y |{Sc, C} in GL(c) Π, and P(Sc|do(π), C = c) = P(Sc|C = c) has a solution πc Π.

There exists Zc PaΠ such that X Y |{Zc, C} in GL(c)X.

As a concrete example, defining Z1 = , X Y |{Z1, C} holds in the graph of Figure 3d. Moreover, S0 = {S}, is a context-specific surrogate w.r.t. GL(C = 0), Π . As a consequence of Proposition 4.2, P(y) is imitable w.r.t. GL, Π, P(O) , if a solution π0 exists to the linear set of equations

P(S|do(π), C = 0) = P(S|C = 0),

where P(s|do(π), C = 0) = P

x,t P(s|x, T = 0, C = 0)π(x|t, C = 0)P(t|C = 0), analogous to Equation (3).

Table 1: Results pertaining to the model of Figure 3a.

Metric Algorithm Expert Naive 1 Naive 2 Algorithm 2 E[Y ] 1.367 1.194 1.193 1.358 DKL(P(Y )||P(Y |do(πALG)) 0 0.0217 0.0219 0.0007 DKL(πALG(X|T = 0)||ˆπALG(X|T = 0)) NA 2.3 10 5 4.4 10 6 1.3 10 3

DKL(πALG(X|T = 1)||ˆπALG(X|T = 1)) NA 2.3 10 5 4.8 10 4 1.3 10 3

To sum up, accounting for CSIs has a two-fold benefit: (a) context-specific surrogates can be leveraged to render previously non-imitable instances imitable, and (b) identification results can be derived for imitation surrogates that were previously non-identifiable.

In light of Proposition 4.2, an algorithmic approach for causal imitation learning is proposed, summarized as Algorithm 2. This algorithm calls a recursive subroutine, Sub Imitate, also called SI within the pseudo-code. It is noteworthy that Proposition 4.2 guarantees imitability if the two conditions are met for any arbitrary subset of C(L) PaΠ. As we shall see, Algorithm 2 utilizes a recursive approach for building such a subset so as to circumvent the need to test all of the possibly exponentially many subsets.

The subroutine SI is initiated with an empty set (C = ) as the considered context variables at the first iteration. At each iteration, the realizations of C are treated separately. For each such realization c, if the second condition of Proposition 4.2 is met through a set Zc, then P(X|Zc, C = c) is returned as the context-specific imitating policy (lines 3-6). Otherwise, the search for a context-specific surrogate begins. We utilize the Find Min Sep algorithm of [40] to identify a minimal separating set Sc C for X and Y , among those that necessarily include C (lines 7-8). We then use the identification algorithm of [39] under CSI relations to identify the effect of an arbitrary policy on Sc, conditioned on the context c. This algorithm is built upon CSI-calculus, which subsumes do-calculus5. Next, if the linear system of equations P(Sc|do(πc), c) = P(Sc|, c) has a solution, then this solution is returned as the optimal policy (lines 9-12). Otherwise, an arbitrary variable V C(L) PaΠ \ C is added to the considered context variables, and the search for context-specific policies proceeds while taking the realizations of V into account (lines 17-24). If no variables are left to add to the set of context variables (i.e., C = C(L) PaΠ) and neither of the conditions of Proposition 4.2 are met for a realization of C, then the algorithm stops with a failure (lines 14-15). Otherwise, an imitating policy π is returned. We finally note that if computational costs matter, the CSI-ID function of line 9 can be replaced by the ID algorithm of [33]. Further, the minimal separating sets of line 8 might not be unique, in which case all such sets can be used. Theorem 4.3. Given an LDAG GL, a policy space Π and observational distribution P(O), if Algorithm 2 returns a policy π , then π is an optimal imitating policy for P(y) w.r.t. GL, Π, P(O) . That is, Algorithm 2 is sound 6.

5 Experiments

Our experimental evaluation is organized into two parts. In the first part, we address the decision problem pertaining to imitability. We evaluate the gain resulting from accounting for CSIs in rendering previously non-imitable instances imitable. In particular, we assess the classic imitability v.s. imitability under CSIs for randomly generated graphs. In the second part, we compare the performance of Alg. 2 against baseline algorithms on synthetic datasets (see Sec. D for further details of our experimental setup). Python implementation are accessible at https://github.com/ Sina Akbarii/causal-imitation-learning/.

5.1 Evaluating imitability

We sampled random graphs with n vertices and maximum degree = n 10 uniformly at random. Each variable was assumed to be latent with probability 1

6. We chose 3 random context variables

5Figure 3a is an example where do-calculus fails to identify P(s|do(x)), whereas CSI-calculus provides an identification formula, Equation (3). 6See Section C for an example where Algorithm 2 is not complete.

20 40 60 80 100 120 140 160 Number of vertices

Fraction of imitable instances

Imitable Imitable under CSIs

Figure 4: The fraction of imitable instances (in the classical sense) vs those that are imitable considering CSIs.

such that the graphical constraint Pa(C(L)) C(L) is satisfied. Labels on the edges were sampled with probability 0.5. We then evaluated the graphical criterion of classic imitability (existence of π-backdoor) and imitability with CSIs (Corollary 3.7). The results are depicted in Figure 4, where each point in the plot is an average of 100 sampled graphs. As seen in this figure, taking into account only a handful of CSI relations (in particular, 3 context variables among hundreds of variables) could significantly increase the fraction of imitable instances.

5.2 Performance evaluation

In this section, we considered the graph of Figure 3b as a generalization of the economic model of Figure 3a, where the reward variable Y can be a more complex function. As discussed earlier, this graph has neither a π-backdoor admissible set nor an imitation surrogate. However, given the identifiability of P(S|do(π), C = 0), Algorithm 2 can achieve an optimal policy. We compared the performance of the policy returned by Algorithm 2 against two baseline algorithms: Naive algorithm 1, which mimics only the observed distribution of the action variable (by choosing π(X) = P(X)), and Naive algorithm 2, which takes the causal ancestors of X into account, designing the policy π(X|T) = P(X|T). Naive algorithm 2 can be thought of as a feature selection followed by a behavior cloning approach. The goal of this experiment was to demonstrate the infeasibility of imitation learning without taking CSIs into account. A model with binary observable variables and a ternary reward was generated. Let πALG represent the policy that the algorithm would have learned with infinite number of samples, and ˆπALG the policy it learns with the given finite sample size. As can be seen in Table 1, Algorithm 2 was able to match the expert policy both in expected reward and KL divergence of the reward distribution. The naive algorithms, on the other hand, failed to get close to the reward distribution of the expert. Since the algorithms were fed with finite observational samples, the KL divergence of the estimated policies with the nominal policy is also reported. Notably, based on the reported measures, the undesirable performance of the Naive algorithms does not stem from estimation errors.

6 Concluding remarks

We considered the causal imitation learning problem when accounting for context-specific independence relations. We proved that in contrast to the classic problem, which is equivalent to a d-separation, the decision problem of imitability under CSIs is NP-hard. We established a link between these two problems. In particular, we proved that imitability under CSIs is equivalent to several instances of the classic imitability problem for a certain class of context variables. We showed that utilizing the overlooked notion of CSIs could be a worthwhile tool in causal imitation learning as an example of a fundamental AI problem from a causal perspective. We note that while taking a few CSI relations into account could result in significant achievable results, the theory of CSIs is not yet well-developed. In particular, there exists no complete algorithm for causal identification under CSIs. Further research on the theory of CSI relations could yield considerable benefits in various domains where such relations are present.

Acknowledgements

This research was in part supported by the Swiss National Science Foundation under NCCR Automation, grant agreement 51NF40_180545 and Swiss SNF project 200021_204355 /1.

[1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1, 2004.

[2] Silvia Acid and Luis M. De Campos. An algorithm for finding minimum d-separating sets in belief networks. In Proceedings of the Twelfth International Conference on Uncertainty in Artificial Intelligence, UAI 96, page 3 10, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc.

[3] Sina Akbari, Jalal Etesami, and Negar Kiyavash. Minimum cost intervention design for causal effect identification. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 258 289. PMLR, 2022.

[4] Sina Akbari, Luca Ganassali, and Negar Kiyavash. Causal discovery via monotone triangular transport maps. In Neur IPS 2023 Workshop Optimal Transport and Machine Learning, 2023.

[5] Sina Akbari, Fateme Jamshidi, Ehsan Mokhtarian, Matthew James Vowels, Jalal Etesami, and Negar Kiyavash. Causal effect identification in uncertain causal networks. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

[6] Sina Akbari, Ehsan Mokhtarian, Amir Emad Ghassami, and Negar Kiyavash. Recursive causal structure learning in the presence of latent variables and selection bias. Advances in Neural Information Processing Systems, 34:10119 10130, 2021.

[7] Mayank Bansal, Alex Krizhevsky, and Abhijit Ogale. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. ar Xiv preprint ar Xiv:1812.03079, 2018.

[8] Craig Boutilier, Nir Friedman, Moises Goldszmidt, and Daphne Koller. Context-specific independence in bayesian networks. In Proceedings of the Twelfth International Conference on Uncertainty in Artificial Intelligence, UAI 96, page 115 123, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc.

[9] Felipe Codevilla, Eder Santana, Antonio M López, and Adrien Gaidon. Exploring the limitations of behavior cloning for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9329 9338, 2019.

[10] Diego Colombo, Marloes H. Maathuis, Markus Kalisch, and Thomas S. Richardson. Learning high-dimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics, pages 294 321, 2012.

[11] Hal Daumé, John Langford, and Daniel Marcu. Search-based structured prediction. Machine learning, 75(3):297 325, 2009.

[12] Pim De Haan, Dinesh Jayaraman, and Sergey Levine. Causal confusion in imitation learning. Advances in Neural Information Processing Systems, 32, 2019.

[13] Jalal Etesami and Philipp Geiger. Causal transfer for imitation learning and decision making under sensor-shift. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10118 10125, 2020.

[14] Dan Geiger, Thomas Verma, and Judea Pearl. d-separation: From theorems to algorithms. In Machine Intelligence and Pattern Recognition, volume 10, pages 139 148. Elsevier, 1990.

[15] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.

[16] Antti Hyttinen, Johan Pensar, Juha Kontinen, and Jukka Corander. Structure learning for bayesian networks over labeled dags. In International Conference on Probabilistic Graphical Models, pages 133 144. PMLR, 2018.

[17] Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human preferences and demonstrations in atari. Advances in neural information processing systems, 31, 2018.

[18] Fateme Jamshidi, Jalal Etesami, and Negar Kiyavash. Confounded budgeted causal bandits. 3rd Conference on Causal Learning and Reasoning, 2024.

[19] Fateme Jamshidi, Luca Ganassali, and Negar Kiyavash. On sample complexity of conditional independence testing with von mises estimator with application to causal discovery. ar Xiv preprint ar Xiv:2310.13553, 2023.

[20] Richard M Karp. Reducibility among combinatorial problems. In Complexity of computer computations, pages 85 103. Springer, 1972.

[21] Mikko Koivisto and Kismat Sood. Computational aspects of bayesian partition models. In Proceedings of the 22nd international conference on Machine learning, pages 433 440, 2005.

[22] Alex Kuefler, Jeremy Morton, Tim Wheeler, and Mykel Kochenderfer. Imitating driver behavior with generative adversarial networks. In 2017 IEEE Intelligent Vehicles Symposium (IV), pages 204 211. IEEE, 2017.

[23] Ehsan Mokhtarian, Fateme Jamshidi, Jalal Etesami, and Negar Kiyavash. Causal effect identification with context-specific independence relations of control variables. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, pages 11237 11246, 2022.

[24] Urs Muller, Jan Ben, Eric Cosatto, Beat Flepp, and Yann Cun. Off-road obstacle avoidance through end-to-end learning. Advances in neural information processing systems, 18, 2005.

[25] Katharina Mülling, Jens Kober, Oliver Kroemer, and Jan Peters. Learning to select and generalize striking movements in robot table tennis. The International Journal of Robotics Research, 32(3):263 279, 2013.

[26] Andrew Y Ng, Stuart Russell, et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, page 2, 2000.

[27] Yunpeng Pan, Ching-An Cheng, Kamil Saigol, Keuntaek Lee, Xinyan Yan, Evangelos A Theodorou, and Byron Boots. Imitation learning for agile autonomous driving. The International Journal of Robotics Research, 39(2-3):286 302, 2020.

[28] Judea Pearl. Causality. Cambridge university press, 2009.

[29] Johan Pensar, Henrik Nyman, Timo Koski, and Jukka Corander. Labeled directed acyclic graphs: a generalization of context-specific independence in directed graphical models. Data mining and knowledge discovery, 29(2):503 533, 2015.

[30] Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Advances in neural information processing systems, 1, 1988.

[31] Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 661 668. JMLR Workshop and Conference Proceedings, 2010.

[32] Yujia Shen, Arthur Choi, and Adnan Darwiche. A new perspective on learning context-specific independence. In International Conference on Probabilistic Graphical Models, pages 425 436. PMLR, 2020.

[33] Ilya Shpitser and Judea Pearl. Identification of joint interventional distributions in recursive semi-Markovian causal models. e Scholarship, University of California, 2006.

[34] David Silver, James Bagnell, and Anthony Stentz. High performance outdoor navigation from overhead data using imitation learning. Robotics: Science and Systems IV, Zurich, Switzerland, 1, 2008.

[35] Peter Spirtes, Clark N. Glymour, and D. Scheines, Richardand Heckerman. Causation, prediction, and search. MIT press, 2000.

[36] Gokul Swamy, Sanjiban Choudhury, Drew Bagnell, and Steven Wu. Causal imitation learning under temporally correlated noise. In International Conference on Machine Learning, pages 20877 20890. PMLR, 2022.

[37] Umar Syed and Robert E Schapire. A game-theoretic approach to apprenticeship learning. Advances in neural information processing systems, 20, 2007.

[38] Jin Tian, Azaria Paz, and Judea Pearl. Finding minimal d-separators, 1998.

[39] Santtu Tikka, Antti Hyttinen, and Juha Karvanen. Identifying causal effects via context-specific independence relations. Advances in neural information processing systems, 32, 2019.

[40] Benito Van der Zander, Maciej Liskiewicz, and Johannes Textor. Constructing separators and adjustment sets in ancestral graphs. In CI@ UAI, pages 11 24, 2014.

[41] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350 354, 2019.

[42] Bernard Widrow. Pattern-recognizing control systems. Compurter and Information Sciences, 1964.

[43] Junzhe Zhang, Daniel Kumor, and Elias Bareinboim. Causal imitation learning with unobserved confounders. Advances in neural information processing systems, 33:12263 12274, 2020.

[44] Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3, AAAI 08, page 1433 1438. AAAI Press, 2008.

Proofs of the main results are given in Section A. The details of Find Sep subroutine are presented in Section B. In Section C provides an example where Alg. 2 is not complete, and Section D includes the details of our experiment setup.

A Technical proofs

Lemma 3.3. Given a latent DAG G and a policy space Π, if there exists a π-backdoor admissible set w.r.t. G, Π , then Z = An({X, Y }) (PaΠ) is a π-backdoor admissible set w.r.t. G, Π .

Proof. We first claim that for any set W {X, Y }, the set of ancestors of W in G and GX coincide, i.e., An(W)G = An(W)GX. First, note that since GX is a subgraph of G, An(W)GX An(W)G. For the other direction, let T An(W)G be an arbitrary vertex. If T An(X), then T An(W)GX. Otherwise, there exists a directed path from T to Y that does not pass through X. The same directed path exists in GX, and as a result, T An(W)GX. Therefore, An(W)G An(W)GX.

The rest of the proof follows from an application of Lemma 3.4 of [40] in GX, with I = and R = PaΠ.

Lemma 3.6. Given an LDAG GL and a policy space Π, let Y be an arbitrary variable in GL. P(y) is imitable w.r.t. GL, Π only if P(y) is imitable w.r.t. GLw w , Π for every realization w Dw of every subset of variables W C(L).

Proof. Assume the contrary, that there exists a subset of variables W V \ Y such that for an assignment w0 DW, P(y) is not imitable w.r.t. G Lw0 w0 , Π . It suffices to show that P(y) is not imitable w.r.t. GL, Π . Since P(y) is not imitable w.r.t. G Lw0 w0 , Π , there exists an SCM M M G Lw0 w0 such that P M (y|do(π)) = P M (y) for every π Π. Define

ϵ := min π Π |P M (y) P M (y|do(π))|. (4)

Also, define

δ := min{ ϵ

Note that ϵ > 0, and 0 < δ < 1. Next, construct an SCM M over V compatible with G Lw0 w0 as follows. For variables W, P M(W = w0) = 1 δ, and the rest is uniformly distributed over other realizations (summing up to δ), such that P M(W = w0) = δ. For variables Vi V \ W, P M(Vi|Pa(Vi)) = P M (Vi|Pa(Vi)), i.e., they follow the same law as the model M . Note that by construction, W are isolated vertices in G Lw0 w0 , and therefore independent of every other variable in both M and M. Therefore,

w DW P M(y|w)P M(w) = X

w DW P M (y|w)P M(w) = X

w DW P M (y)P M(w)

Moreover, for any policy π1 Π dependant on the values of Z PaΠ, define a policy π2 Π dependant on Z = Z \ W as π2(X|Z) = (1 δ)π1(X|Z, W = w0) + δπ1(X|Z, W = w0). Note

that Z W = . That is, π2 is independent of the values of W. We can write

P M(y|do(π1)) = X

w DW P M(y|do(π1), w)P M(w|do(π))

= P M(y|do(π1), W = w0)(1 δ) + P M(y|do(π1), W = w0)δ

x,z (1 δ)P M(y|do(x), z, W = w0)π1(x|z, W = w0)P M(z|W = w0)

x,z δP M(y|do(x), z, W = w0)π1(x|z, W = w0)P M(z|W = w0)

x,z (1 δ)P M (y|do(x), z)π1(x|z, W = w0)P M (z)

x,z δP M (y|do(x), z)π1(x|z, W = w0)P M (z)

x,z P M (y|do(x), z)π2(x|z)P M (z)

= P M (y|do(π2)).

That is, for any policy π1 under model M, there is a policy π2 that results in the same reward distribution under model M . Therefore, combining Equations (6) and (7),

min π Π |P M(y) P M(y|do(π))| = min π Π |P M (y) P M (y|do(π))|,

although this minimum might occur under different policies. Recalling Equation (4), we get that under Model M and for any policy π Π,

ϵ |P M(y) P M(y|do(π))|. (8)

Now we construct yet another SCM M δ over V as follows. Variables W are distributed as in model M, i.e., P M δ(W) = P M(W). Moreover, for each variable V V \ W set V = V M if Pa(V ) W = (w0)Pa(V ) W where V M denotes the same variable V under SCM M. Otherwise, distribution of V is uniform in DV . Note that by definition of M δ, the values of W are assigned independently, and the values of all other variables (V\W) only depend on their parents, maintaining all the CSI relations. As a result, M δ is compatible with GL. Also, by construction,

P M δ(O \ W|W = w0) = P M(O \ W),

P M δ(O \ W|do(x), W = w0) = P M(O \ W|do(x)). (9)

Next, we write

P M δ(y) = X

w DW P M δ(y|w)P M δ(w)

= P M δ(y|W = w0)P M δ(W = w0)

+ P M δ(y|W = w0)P M δ(W = w0)

= (1 δ)P M δ(y|W = w0) + δP M δ(y|W = w0).

The second term of the right-hand side above is a positive number not larger than δ. Moreover, by construction of M δ, we have P M δ(y|W = w0) = P M(y). Therefore, we get

(1 δ)P M(y) P M δ(y) (1 δ)P M(y) + δ. (10)

S0 S1 Sm . . . Si 1 Si . . .

Figure 5: Reduction from 3-SAT to imitability.

On the other hand, for an arbitrary policy π Π, depending on the values of Z PaΠ,

P M δ(y|do(π)) = P M δ(y|do(π), W = w0)P M δ(W = w0|do(π))

+ P M δ(y|do(π), W = w0)P M δ(W = w0|do(π))

= P M δ(y|do(π), W = w0)P M(W = w0)

+ P M δ(y|do(π), W = w0)P M(W = w0)

x,z P M δ(y|do(x), z, W = w0)π(x|z, w0)P M δ(z|W = w0)P M(W = w0)

+ δP M δ(y|do(π), W = w0)

x,z P M(y|do(x), z, W = w0)π(x|z, w0)P M(z|W = w0)P M(W = w0)

+ δP M δ(y|do(π), W = w0)

= P M(y|do(π), W = w0)P M(W = w0) + δP M δ(y|do(π), W = w0)

= P M(y|do(π), W = w0)P M(W = w0) + P M(y|do(π), W = w0)P M(W = w0)

P M(y|do(π), W = w0)P M(W = w0) + δP M δ(y|do(π), W = w0)

= P M(y|do(π)) δ(P M(y|do(π), W = w0) P M δ(y|do(π), W = w0)),

and as a result, P M(y|do(π)) δ P M δ(y|do(π)) P M(y|do(π)) + δ. (11)

Combining Equations (10) and (11), we get P M(y) P M(y|do(π)) δP M(y) δ

P M δ(y) P M δ(y|do(π)) P M(y) P M(y|do(π)) δP M(y) + δ,

and consequently, P M(y) P M(y|do(π)) 2δ P M δ(y) P M δ(y|do(π)) P M(y) P M(y|do(π)) + 2δ,

which combined with Equation (8) results in the following inequality:

|P M δ(y) P M δ(y|do(π))| ϵ 2δ = ϵ

To sum up, we showed that there exists model M δ M GL , such that for every π Π, P M δ(y|do(π)) = P M δ(y) which is in contradiction with the imitability assumption. Hence, P(y) is imitable w.r.t. GLw w , Π for every w DW of any arbitrary subset of variables W V \ Y .

Theorem 3.8. Given an LDAG GL and a policy space Π, deciding the imitability of P(y) w.r.t. GL, Π is NP-hard.

Proof. Our proof exploits ideas similar to the proof of hardness of causal identification under CSIs by [39]. We show a polynomial-time reduction from the 3-SAT problem, which is a well-known NP-hard problem [20], to the imitability problem. Let an instance of 3-SAT be defined as follows. A formula F in the conjunctive normal form with clauses S1, . . . , Sm is given, where each clause Si has 3 literals from the set of binary variables W1, . . . , Wk and their negations, W1, . . . , Wk. The decision problem of 3-SAT is to determine whether there is a realization of the binary variables W1, . . . , Wk that satisfies the given formula F, i.e., makes F true. We reduce this problem to an instance of imitability problem with CSIs as follows. Define an LDAG with one vertex corresponding to each clause Si for 1 i m, one vertex W which corresponds to the set of binary variables {W1, . . . , Wk}, and three auxiliary vertices S0, X, and Y . Vertex W is a parent of every Si for 1 i m. There is an edge from Si 1 to Si for 1 i m. Also, S0 is a parent of X, and X and Sm are parents of Y . As for the labels, the label on each edge Si 1 Si is that this edge is absent if the assignment of W does not satisfy the clause Si. Constructing this LDAG is clearly polynomial-time, as it only requires time in the order of the number of variables {W1, . . . , Wk} and the number of clauses {S1, . . . , Sm}. We claim that P(y) is imitable in this LDAG if and only if the 3-SAT instance is unsatisfiable.

Only if part. Suppose that the 3-SAT instance is satisfiable. There exists a realization w of the binary variables {W1, . . . , Wk} that satisfies the formula F. Define a model M over the variables of the LDAG as follows. W has the value w with probability 0.6 (P M(W = w ) = 0.6), and the rest of the probability mass is uniformly distributed over other realizations of W (i.e., each with probability 0.4 2k 1). S0 is a Bernoulli random variable with parameter 1

2. Si for 1 i m is equal to Si 1 if the edge Si 1 Si is present (the clause Si is satisfied by the realization of W), and Si = 0 otherwise. X is equal to S0, and Y is defined as Y = X Sm. Note that model M is compatible with the LDAG defined above. We claim that under this model, for every policy π Π, P M(y|do(π)) = P M(y). That is, P(y) is not imitable.

Let Ds W and Du W be the partitioning of the domain of W into two disjoint parts which satisfy and do not satisfy the formula F, respectively. Note that w Ds W. For any realization w Ds W, observed values of Y in M is always equal to 1 (because Y = S0 S0), i.e., P M(Y = 1|w) = 1. On the other hand, for any realization w Du W, there exists a clause Si which is not satisfied, and therefore Sm = 0, and Y = X. Therefore, P M(Y = 1|w) = P M(X = 0) = 1

2. From the total probability law,

P M(Y = 1) = X

w Ds W P(Y = 1|w)P(w) + X

w Du W P(Y = 1|w)P(w)

w Ds W P(w) + 1

w Du W P(w)

w Ds W P(w) + X

w Du W P(w)) + 1

w Ds W P(w)

w Ds W P(w) 1

2P(w ) = 0.8,

where we dropped the superscript M for better readability. Now consider an arbitrary policy π Π. For any realization w Ds W, Y = X S0, where S0 is a Bernoulli variable with parameter 1

2 independent of X. As a result, P M(Y = 1|do(π), w) = P M(S = X) = 1

2. On the other hand, for any realization w Du W, there exists a clause Si which is not satisfied, and therefore Sm = 0, and Y = X. Therefore, P M(Y = 1|do(π), w) = P M(X = 0|do(π)) = π(X = 0). Using the total probability law again, and noting that P M(w|do(π)) = P M(w),

P M(Y = 1|do(π)) = X

w Ds W P(Y = 1|do(π), w)P(w|do(π))

w Du W P(Y = 1|do(π), w)P(w|do(π))

w Ds W P(w) + π(X = 0) X

w Du W P(w),

where we dropped the superscript M for better readability. Now define q = P w Ds W P M(w) = 1 P

w Du W P(w), where we know q 0.6 since w Ds W. From the equation above we have

P M(Y = 1|do(π)) = 1

2q + π(X = 0)(1 q), (13)

where 0.6 q 1 and 0 π(X = 0) 1. We claim that P(Y = 1|do(π)) < 0.7. Consider two cases: if π(X = 0) 1

2, then from Equation (13),

P(Y = 1|do(π)) 1

2(1 q) = 0.5 < 0.7.

Otherwise, assume π(X = 0) > 1

2. Rewriting Equation (13),

P(Y = 1|do(π)) = q(1

2 π(X = 0)) + π(X = 0) < 0.6(1

2 π(X = 0)) + π(X = 0)

= 0.3 + 0.4π(X = 0) 0.7. We showed that for any arbitrary policy π Π,

P M(Y = 1|do(π)) < 0.7. (14) Comparing this to Equaiton (12) completes the proof.

If part. Let IX be the intervention vertex corresponding to X, i.e., IX = 0 and IX = 1 indicate that X is passively observed (determined by its parents) and actively intervened upon (independent of its parents), respectively (refer to Figure 5). Assume that the 3-SAT instance is unsatisfiable. We claim that π(x) = P(x) is an imitating policy for P(y). Since the 3-SAT is unsatisfiable, for any context w DW, there exists 1 i m such that the edge Si 1 Si is absent. Then, we have IX Y |X, W = w in G for every w DW, then IX Y |X, W in G. Moreover, since the d-separation IX G W holds, by contraction, we have IX Y, W|X in G. As a result, IX Y |X in G. Therefore,

P(y|do(π)) = X

x DX P(y|do(x))π(x) = X

x DX P(y|x, IX = 1)P(x) = X

x DX P(y|x, IX = 0)P(x)

x DX P(y|x)P(x) = X

x DX P(y|x)P(x) = P(y).

Proposition 3.9. Given an LDAG GL where Pa(C(L)) C(L) and a policy space Π, let Y be an arbitrary variable in GL. P(y) is imitable w.r.t. GL, Π if and only if P(y) is imitable w.r.t. Gc, Π , for every c DC(L).

Proof. It suffices to show that for an arbitrary M M GL , there exists a policy π Π uniquely computable from P(O) such that P M(y|do(π)) = P M(y). For ease of notation, let C := C(L). For every c DC, construct an SCM Mc over V by setting C = c, and replacing every occurrence of variables C by the constant value c in the equations of M. Therefore, Mc is compatible with Gc and P Mc(O) = P M(O|do(c)). By assumption, we know that for every c DC, there exists a policy πc Π uniquely computable from P(O) such that P Mc(y|do(πc)) = P Mc(y). (15)

Suppose πc relies on the values of Zc PaΠ. We can write

P Mc(y|do(πc)) = P M(y|do(πc), do(c))

x DX,zc DZc P M(y|do(x), do(c), zc)P M(x|do(πc), do(c), zc)P M(zc|do(c))

x DX,zc DZc P M(y|do(x), c, zc)P M(x|do(πc), c, zc)P M(zc|c)

= P M(y|do(πc), c), (16)

where the first and third parts of (a) hold since Y C|X, Zc in GXC and Pa(C) C. For the second part, let Mπc be the model obtained from substituting the model of X given its parents by πc in the equations of M. Then, we get

P M(x|do(πc), do(c), zc) = P Mπc(x|do(c), zc) = P Mπc(x|c, zc) = P M(x|do(πc), c, zc).

Note that, since Pa(C) C, P Mπc(x|do(c), zc) = P Mπc(x|c, zc).

Furthermore, P Mc(y) = P M(y|do(c)) = P M(y|c). (17) Combining Equations (16), (17) and (15) we get the following for every c DC P M(y|do(πc), c) = P M(y|c). (18)

Next, define a policy π (x|PaΠ) := P

c DC 1{(Pa)Π C=c}πc, where 1{(Pa)Π C=c} is an indicator function such that 1c = 1 when C = c, and is equal to zero otherwise. Now, we can write

P M(y|do(π ))

c DC P M(y|do(π ), c)P M(c|do(π ))

c DC P M(y|do(πc), c)P M(c)

c DC P M(y|c)P M(c) = P M(y),

where the last line holds due to Equation (18) and the fact that Pa(C) C. We proved that there exists a policy π Π uniquely computable from P(O) such that P M(y|do(π )) = P M(y). Hence, P(y) is imitable w.r.t. GL, Π, .

Proposition 4.2. Given an LDAG GL and a policy space Π, let C be a subset of PaΠ C(L). If for every realization c DC at least one of the following holds, then P(y) is imitable w.r.t. GL, Π, P(O) .

There exists a subset Sc such that X Y |{Sc, C} in GL(c) Π, and P(Sc|do(π), C = c) = P(Sc|C = c) has a solution πc Π.

There exists Zc PaΠ such that X Y |{Zc, C} in GL(c)X.

Proof. Let M M GL and P M(O) = P(O), i.e., M induces the observational distribution P(O), be an arbitrary model. It suffices to show that there is a policy π Π uniquely computable from P(O) such that P M(y|do(π)) = P M(y). To do so, first, partition DC into two subsets D1 C and D2 C such that DC = D1 C D2 C and for every c D1 C the first condition of the lemma holds, whereas for every c D2 C the second condition does.

For each c D1 C, construct model Mπc by substituting the model of X given its parents by πc in M. Now, we get the following for each c D1 C,

P M(y|do(πc), c) = P Mπc(y|c) = X

sc DSc P Mπc(y|c, sc)P Mπc(sc|c)

sc DSc P Mπc(y|do(x), c, sc)P Mπc(sc|c)

sc DSc P M(y|do(x), c, sc)P Mπc(sc|c)

sc DSc P M(y|c, sc)P Mπc(sc|c)

sc DSc P M(y|c, sc)P M(sc|c) = P M(y|c).

Algorithm 3 Find Possible π-backdoor admissible set

1: function FINDSEP (G, Π, X, Y, C) 2: if C is not specified then 3: C 4: Set Z = An({X, Y } C) (PaΠ) 5: if TESTSEP (GX, X, Y, Z) then 6: return Z 7: else 8: return FAIL

where (a) and (b) hold since Y X|Sc, C in GL(c) Π and in GL(c), respectively (or Y X|Sc, c in M πc and M). Moreover, (c) holds because P M(Sc|do(πc), c) = P M(Sc|c).

Next, for each c D2 C, let πc(x|PaΠ) = P M(x|Zc, c), then we get

P M(y|do(πc), c) = X

x DX,zc DZc P M(y|do(x), zc, c)P M(x|zc, c)P M(zc|c)

x DX,zc DZc P M(y|x, zc, c)P M(x|zc, c)P M(zc|c) = P M(y|c),

where the second equation holds since X Y |Zc, C in GL(c)X.

Now, define a policy π (x|PaΠ) := P

c DC 1{(Pa)Π C=c}πc.

We get the following where the second line follows from Equation (19) and the fact that C De(X) = . P M(y|do(π )) = X

c DC P M(y|do(π ), c)P M(c|do(π ))

c DC P M(y|do(πc), c)P M(c)

c DC P M(y|c)P M(c) = P M(y).

The above equation implies that there is a policy π Π such that P M(y|do(π )) = P M(y) where M is an arbitrary model compatible with GL. Therefore, P(y) is imitable w.r.t. GL, Π and π is an imitating policy for that.

B Find Sep Algorithm

Find Sep was used as a subroutine in Alg. 1. This function is summarized here as Algorithm 3, which takes a DAG G, variables X and Y , and a subset of variables C as inputs. It finds a separating set containing C, if exists, in GX between X and Y . To do so, it constructs Z in line 4. According to lemma 3.4 of [40] where I = C and R = PaΠ, if such separating set exists, Z is a separating set.

C Example for Algorithm 2

We present a simple counterexample to illustrate that Algorithm 2 is not complete, in the sense that it may happen that an instance is imitable w.r.t. GL, Π, P(O) , but Algorithm 2 fails to find an imitating policy. To that end, take the instrumental variable graph depicted in Figure 6. It is evident that the set of equations

P(y|do(π)) = P(y|X = 0, C = 0)π(X = 0) + P(y|X = 1, C = 0)π(X = 1) = P(y|X = 0)P(X = 0) + P(y|X = 1)P(X = 1) = P(y)

for both values of y = 0 and y = 1 is a set of linear equations with two equations in two parameters, namely π(X = 0) and π(X = 1), which always has a solution. This indeed indicates that this

Figure 6: An example where Algorithm 2 fails to identify an imitating policy, despite the fact that there exists one such policy.

instance is imitable, i.e., there always exists at least one imitation policy. However, Algorithm 2 would not be able to find any such policy.

D Experimental setup

For the experiments of Section 5.2, we worked with an SCM in which

C Be(0.05), P(Y = 0|C = 1) = 0.2, P(Y = 1|C = 1) = 0.5, P(Y = 2|C = 1) = 0.3, T Be(0.4), U1 Be(0.8), X|T = 0, U1 = 0 Be(0.7), X|T = 0, U1 = 1 Be(0.7), X|T = 1, U1 = 0 Be(0), and, X|T = 1, U1 = 1 Be(1).

Moreover, we had

S|C = 0, X = 0, U1 = 0 Be(1), S|C = 0, X = 0, U1 = 1 Be(0), S|C = 0, X = 1, U1 = 0 Be(0), S|C = 0, X = 1, U1 = 1 Be(1).

And finally,

P(Y = 0|C = 0, S = 0) = 0.8, P(Y = 1|C = 0, S = 0) = 0.1, P(Y = 2|C = 0, S = 0) = 0.1, P(Y = 0|C = 0, S = 1) = 0.05, P(Y = 1|C = 0, S = 1) = 0.2, and, P(Y = 2|C = 0, S = 1) = 0.75.