# sequential_causal_imitation_learning_with_unobserved_confounders__1accda4d.pdf

Sequential Causal Imitation Learning with Unobserved Confounders

Daniel Kumor Purdue University dkumor@purdue.edu

Junzhe Zhang Columbia University junzhez@cs.columbia.edu

Elias Bareinboim Columbia University eb@cs.columbia.edu

Monkey see monkey do" is an age-old adage, referring to naïve imitation without a deep understanding of a system s underlying mechanics. Indeed, if a demonstrator has access to information unavailable to the imitator (monkey), such as a different set of sensors, then no matter how perfectly the imitator models its perceived environment (SEE), attempting to reproduce the demonstrator s behavior (DO) can lead to poor outcomes. Imitation learning in the presence of a mismatch between demonstrator and imitator has been studied in the literature under the rubric of causal imitation learning (Zhang et al., 2020), but existing solutions are limited to single-stage decision-making. This paper investigates the problem of causal imitation learning in sequential settings, where the imitator must make multiple decisions per episode. We develop a graphical criterion that is necessary and sufﬁcient for determining the feasibility of causal imitation, providing conditions when an imitator can match a demonstrator s performance despite differing capabilities. Finally, we provide an efﬁcient algorithm for determining imitability and corroborate our theory with simulations.

1 Introduction

Without access to observational data, an agent must learn how to operate at a suitable level of performance through trial and error (Sutton et al., 1998; Mnih et al., 2013). This from-scratch approach is often impractical in environments with the potential of extreme negative - and ﬁnal - outcomes (driving off a cliff). While both Nature and machine learning researchers have approached the problem from a wide variety of perspectives, a particularly potent method which has been used with great success in many learning machines, including humans, is exploiting observations of other agents in the environment (Rizzolatti & Craighero, 2004; Hussein et al., 2017).

Learning to act by observing other agents offers a data multiplier, allowing agents to take into account others experiences. Even when the precise loss function is unknown (what exactly goes into being a good driver?), the agent can attempt to learn from experts , namely agents which are known to gain an acceptable reward at the target task. This approach has been studied under the umbrella of imitation learning (Argall et al., 2009; Billard et al., 2008; Hussein et al., 2017; Osa et al., 2018). Several methods have been proposed, including inverse reinforcement learning (Ng et al., 2000; Abbeel & Ng, 2004; Syed & Schapire, 2008; Ziebart et al., 2008) and behavior cloning (Widrow, 1964; Pomerleau, 1989; Muller et al., 2006; Mülling et al., 2013; Mahler & Goldberg, 2017). The former attempts to reconstruct the loss/reward function that the experts minimize and then use it for optimization; the latter directly copies the expert s actions (behavior cloning).

Despite the power entailed by this approach, it relies on a somewhat stringent condition: the expert and imitator s sensory capabilities need to be perfectly matched. As an example, self-driving cars rely solely on cameras or lidar, completely ignoring the auditory dimension - and yet most human demonstrators are able to exploit this data, especially in dangerous situations (car horns, screeching

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Figure 1: (a, b) represents a simpliﬁed view of a driver X and surrounding cars F, B, S. (c) is imitable with policies π1(X1) = P(X1) and π2(X2|Z) = P(X2|Z), but in (d) X1, X2 is not imitable, despite there being a valid sequential backdoor.

tires). Perhaps without a microphone, the self-driving car would incorrectly attribute certain behaviors to visual stimuli, leading to a poor policy? For concreteness, consider the scenario shown in Fig. 1a, where the human driver (X, i.e., the demonstrator, in blue) is looking forward (F), and can hear car horns (H) from cars behind (B), and to the side (S). The driver s performance is represented by a variable Y (red), which is unobserved (dashed node). Since our dataset only contains visual data, car horns H remain unobserved to the learning agent (i.e., the imitator). Despite not being able to hear car horns, the learner from Fig. 1a had a full view of the car s surroundings, including cars behind and to the side, which turns out to be sufﬁcient to perform imitation in this example. Consider an instance where F, B, S are drawn uniformly over {0, 1}. The reward Y is decided by X F B S; represents the exclusive-or operator. The human driver decides the action X H where values of horn H is given by F B S. Preliminary analysis reveals that the learner could perfectly mimic the demonstrator s decision-making process using an imitating policy X F B S. On the other hand, if the driving system does not have side cameras, the side view S becomes latent; see Fig. 1b. The learner s reward IE[Y |do(π)] is equal to 0.5 for any policy π(x|f, b), which is far from the optimal demonstrator s performance, IE[Y ] = 1.

Based on these examples, there arises the question of determining precise conditions under which an agent can account for the lack of knowledge or observations available to the expert, and how this knowledge should be combined to generate an optimal imitating policy, giving identical performance as the expert on measure Y . These questions have been recently investigated in the context of causal imitation learning (Zhang et al., 2020), where a complete graphical condition and algorithm were developed for determining imitability in the single-stage decision-making setting with partially observable models (i.e., in non-Markovian settings). Other structural assumptions, such as linearity (Etesami & Geiger, 2020), were also explored in the literature, but were still limited to a single action. Finally, de Haan et al. (2019) explore the case when expert and imitator can observe the same contexts, but the causal diagram is not available. Despite this progress, it is still unclear how to systematically imitate, or even whether imitation is possible when a learner must make several actions in sequence, where expert and imitator observe differing sets of variables (e.g., Figs. 1c and 1d).

The goal of this paper is to ﬁll this gap in understanding. More speciﬁcally, our contributions are as follows. (1) We provide a graphical criterion for determining whether imitability is feasible in sequential settings based on a causal graph encoding the domain s causal structure. (2) We propose an efﬁcient algorithm to determine imitability and to ﬁnd the policy for each action that leads to proper imitation. (3) We prove that the proposed criterion is complete (i.e. both necessary and sufﬁcient). Finally, we verify that our approach compares favorably with existing methods in contexts where a demonstrator has access to latent variables through simulations. Due to space constraints, proofs are provided in the complete technical report (Kumor et al., 2021).

1.1 Preliminaries

We start by introducing the notation and deﬁnitions used throughout the paper. In particular, we use capital letters for random variables (Z), and small letters for their values (z). Bolded letters represent sets of random variables and their samples (Z = {Z1, ..., Zn}, z = {z1 Z1, ..., zn Zn}). |Z| represents a set s cardinality. The joint distribution over variables Z is denoted by P(Z). To simplify notation, we consistently use the shorthand P(zi) to represent probabilities P(Zi = zi).

Figure 2: Despite there being no latent path between Y and any X, the query in (a) is not imitable, but the query in (b) is imitable. While (c) is imitable if Z comes before X2 in temporal order, the query in (d) is imitable only if Z comes before X1.

The basic semantic framework of our analysis rests on structural causal models (SCMs) (Pearl, 2000, Ch. 7). An SCM M is a tuple U, V , F , P(u) with V the set of endogenous, and U exogenous variables. F is a set of structural functions s.t. for f V F , V f V (pa V , u V ), with PAV V , UV U . Values of U are drawn from an exogenous distribution P(u), inducing distribution P(V ) over the endogenous V . Since the learner can observe only a subset of endogenous variables, we split V into O V (observed) and L = V \ O (latent) sets of variables. The marginal P(O) is thus referred to as the observational distribution.

Each SCM M is associated with a causal diagram G where (e.g., see Fig. 2d) solid nodes represent observed variables O, dashed nodes represent latent variables L, and arrows represent the arguments pa(V ) of each functional relationship f V . Exogenous variables U are not explicitly shown; a bidirected arrow between nodes Vi and Vj indicates the presence of an unobserved confounder (UC) affecting both Vi and Vj. We will use standard conventions to represent graphical relationships such as parents, children, descendants, and ancestors. For example, the set of parent nodes of X in G is denoted by pa(X)G = X Xpa(X)G. ch, de and an are similarly deﬁned. Capitalized versions Pa, Ch, De, An include the argument as well, e.g. De(X)G = de(X)G X. An observed variable Vi O is an effective parent of Vj V if there is a directed path from Vi to Vj in G such that every internal node on the path is in L. We deﬁne pa+(S) as the set of effective parents of variables in S, excluding S itself, and Pa+(S) as S pa+(S). Other relations, like ch+(S) are deﬁned similarly.

A path from a node X to a node Y in G is said to be active" conditioned on a (possibly empty) set W if there is a collider at A along the path ( A ) only if A An(W ), and the path does not otherwise contain vertices from W (d-separation, Koller & Friedman (2009)). X and Y are independent conditioned on W (X Y |W )G if there are no active paths between any X X and Y Y . For a subset X V , the subgraph obtained from G with edges outgoing from X / incoming into X removed is written GX/GX respectively. Finally, we utilize a grouping of observed nodes, called confounded components (c-components, Tian & Pearl (2002); Tian (2002)).

Deﬁnition 1.1. For a causal diagram G, let N be a set of unobserved variables in L U. A set C Ch(N) O is a c-component if for any pair Ui, Uj N, there exists a path between Ui and Uj in G such that every observed node Vk O on the path is a collider (i.e., Vk ).

C-components correspond to observed variables whose values are affected by related sets of unobserved common causes, such that if A, B C, (A B|O \ {A, B}). In particular, we focus on maximal c-components C , where there doesn t exist c-component C s.t. C C . The collection of maximal c-components forms a partition C1, . . . , Cm over observed variables O. For any set S O, let C(S) be the union of c-components Ci that contain variables in S. For instance, for variable Z in Fig. 1d, the c-component C({Z}) = {Z, X1}.

2 Causal Sequential Imitation Learning

We are interested in learning a policy over a series of actions X O so that an imitator gets average reward Y V identical to that of an expert demonstrator. More speciﬁcally, let variables in X be ordered by X1, . . . , Xn, n = |X|. Actions are taken sequentially by the imitator, where only information available at the time of the action can be used to inform a policy for Xi X. To encode the ordering of observations and actions in time, we ﬁx a topological ordering on the variables of G, which we call the temporal ordering . We deﬁne functions before(Xi) and after(Xi) to represent nodes that come before/after an action Xi X following the ordering, excluding Xi itself. A policy π on actions X is a sequence of decision rules {π1, . . . , πn} where each πi(Xi|Zi) is a function

mapping from domains of covariates Zi before(Xi) to the domain of action Xi. The imitator following a policy π replacing the demonstrator in an environment is encoded by replacing the expert s original policy in the SCM M with π, which gives the results of the imitator s actions as P(V |do(π)). Our goal is to learn an imitating policy π such that the induced distribution P(Y |do(π)) perfectly matches the original expert s performance P(Y ). Formally

Deﬁnition 2.1. (Zhang et al., 2020) Given a causal diagram G, Y V is said to be imitable with respect to actions X O in G if there exists π Π uniquely discernible from the observational distribution P(O) such that for all possible SCMs M compatible with G, P(Y )M = P(Y |do(π))M.

In other words, the expert s performance on reward Y is imitable if any set of SCMs must share the same imitating policy π Π whenever they generate the same causal diagram G and the observational distribution P(O). Henceforth, we will consistently refer to Def. 2.1 as the fundamental problem of causal imitation learning. For single stage decision-making problems (X = {X}), Zhang et al. (2020) demonstrated imitability for reward Y if and only if there exists a set Z before(X) such that (Y X|Z)GX, called the backdoor admissible set, (Pearl, 2000, Def. 3.3.1) (Z = {F, B, S} in Fig. 1a). It is veriﬁable that an imitating policy is given by π(X|F, B, S) = P(X|F, B, S).

Since the backdoor criterion is complete for the single-stage problem, one may be tempted to surmise that a version of the criterion generalized to multiple interventions might likewise solve the imitability problem in the general case (|X| > 1). Next we show that this is not the case. Let X1:i stand for a sequence of variables {X1, . . . , Xi}; X1:i = if i < 1. Pearl & Robins (1995) generalized the backdoor criterion to the sequential decision-making setting as follows:

Deﬁnition 2.2. (Pearl & Robins, 1995) Given a causal diagram G, a set of action variables X, and target node Y , sets Z1 before(X1), . . . , Zn before(Xn) satisfy the sequential backdoor for (G, X, Y ) if for each Xi X such that (Y Xi|X1:i 1, Z1:i)GXi Xi+1:n.

While the sequential backdoor is an extension of the backdoor to multi-stage decisions, its existence does not always guarantee the imitability of latent reward Y . As an example, consider the causal diagram G described in Fig. 1d. In this case, Z1 = {}, Z2 = {Z}, {(X1, Z1), (X2, Z2)} is a sequential backdoor set for (G, {X1, X2}, Y ), but there are distributions for which no agent can imitate the demonstrator s performance (Y ) without knowledge of either the latent U1 or U2. To witness, suppose that the adversary sets up an SCM with binary variables as follows: U1, U2 Bern(0.5), with X1 := U1, Z := U1 U2, X2 := Z and Y = (X1 X2 U2), with as a binary XOR. The fact that U U = 0 is exploited to generate a chain where each latent variable appears exactly twice in Y , making Y = (U1 (U1 U2) U2) = 1. On the other hand, when imitating, X1 can no longer base its value on U1, making the imitated ˆY = ( ˆX1 ˆX2 U2). The imitator can do no better than IE[ ˆY ] = 0.5! We refer readers to (Kumor et al., 2021, Proposition C.1) for a more detailed explanation.

2.1 Sequential Backdoor for Causal Imitation

We now introduce the main result of this paper: a generalized backdoor criterion that allows one to learn imitating policies in the sequential setting. For a sequence of covariate sets Z1 before(X1), . . . , Zn before(Xn), let G i, i = 1, . . . , n, be the manipulated graph obtained from a causal diagram G by ﬁrst (1) removing all arrows coming into nodes in Xi+1:n; and (2) adding arrows Zi+1 Xi+1, . . . , Zn Xn. We can then deﬁne a sequential backdoor criterion for causal imitation as follows:

Deﬁnition 2.3. Given a causal diagram G, a set of action variables X, and target node Y , sets Z1 before(X1), . . . , Zn before(Xn) satisfy the sequential π-backdoor"1 for (G, X, Y ) if at each Xi X, either (1) (Xi Y |Zi) in (G i)Xi, or (2) Xi / An(Y ) in G i.

The ﬁrst condition of Def. 2.3 is similar to the backdoor criterion where Zi is a set of variables that effectively encodes all information relevant to imitating Xi with respect to Y . In other words, if the joint P(Zi {Xi}) matches when both expert and imitator are acting, then an adversarial Y cannot distinguish between the two. The critical modiﬁcation of the original π-backdoor for the sequential setting comes from the causal graph in which this check happens. G i can be seen as G with all future

1The π in π-backdoor is part of the name, and does not refer to any speciﬁc policy.

Figure 3: Examples of G 1. In Fig. 1c, we can have Z1 = , Z2 = {Z}, so X2 has its parents cut, and a new arrow added from Z to X2 (blue). The independence check (X1 Y | ) is done in graph (a) with edges outgoing from X1 removed (orange). In Fig. 2b, using Z1 = , Z2 = {X1}, we ﬁrst replace the parents of X2 with just X1 (b), and then remove both resulting outgoing edges from X1 to check if (X1 Y ). On the other hand, in Fig. 2c, if Z2 = {Z}, we get (c), which means Xi / An(Y ), passing condition 2 of Def. 2.3. Finally, in Fig. 2d, with Z2 = {W}, X1 must condition on either Z or W to be independent of Y in (d) once the edge X1 Y is removed.

actions of the imitator already encoded in the graph. That is, when performing a check for Xi, it is done with all actions after i being performed by the imitator rather than expert, with the associated parents of each future Xj>i replaced with their corresponding imitator s conditioning set. Several examples of G i are shown in Fig. 3.

The second condition allows for the case where an action at Xi has no effect on the value of Y once future actions are taken. Since G i has modiﬁed parents for future Xj>i, the value of Xi might no longer be relevant at all to Y , i.e. Y would get the same input distribution no matter what policy is chosen for Xi. This allows Xi to fail condition (1), meaning that it is not imitable by itself, but still be part of an imitable set X, because future actions can shield Y from errors made at Xi.

The distinction between condition 1 and condition 2 is shown in Fig. 3c: in the original graph G described in Fig. 2c, if Z comes after X1, then there is no valid adjustment set that can d-separate X1 from Y . However, if the imitating policy for X2 uses Z instead of W or X1 (i.e. πX2 = P(X2|Z)), X1 will no longer be an ancestor of Y in G 1. In effect, the action made at X2 ignores the inevitable mistakes made at X1 due to not having access to confounder U1 when taking the action.

Indeed, the sequential π-backdoor criterion can be seen as a recursively applying the single-action π-backdoor. Starting from the last action Xk in temporal order, one can directly show that Y is imitable using a backdoor admissible set Zk (or Xk doesn t affect Y by any causal path). Replacing Xk in the SCM with this new imitating policy, the resulting SCM with graph G k 1 has an identical distribution over Y as G. The procedure can then be repeated for Xk 1 using G k 1 as the starting graph, and continued recursively, showing imitability for the full set: Theorem 2.1. Given a causal diagram G, a set of action variables X, and target node Y , if there exist sets Z1, Z2, ..., Zk that satisfy the sequential π-backdoor criterion with respect to (G, X, Y ), then Y is imitable with respect to X in G with policy π(Xi|Zi) = P(Xi|Zi) for each Xi X.

Thm. 2.1 establishes the sufﬁciency of the sequential π-backdoor for imitation learning. Consider again the diagram in Fig. 2c. It is veriﬁable that covariate sets Z1 = {}, Z2 = {Z} are sequential π-backdoor admissible. Thm. 2.1 implies that the imitating policy is given by π1(X1) = P(X1) and π2(X2|Z) = P(X2|Z). Once π-backdoor admissible sets are obtained, the imitating policy can be learned from the observational data through standard density estimation methods for stochastic policies, and supervised learning methods for deterministic policies. This means that the sequential π-backdoor is a method for choosing a set of covariates to use when performing imitation learning, which can be used instead of Pa(Xi) for each Xi X in the case when the imitator does not observe certain elements of Pa(X). With the covariates chosen using the sequential π-backdoor, one can use domain-speciﬁc algorithms for computing an imitating policy based on the observational data.

3 Finding Sequential π-Backdoor Admissible Sets

At each Xi, the sequential π-backdoor criterion requires that Zi is a back-door adjustment set in the manipulated graph G i. There already exist efﬁcient methods for ﬁnding adjustment sets in the

literature (Tian & Paz, 1998; van der Zander & Li skiewicz, 2020), so if the adjustment were with reference to G, one could run these algorithms on each Xi independently to ﬁnd each backdoor admissible set Zi. However, each action Xi has its Zi in the manipulated graph G i, which is dependent on the adjustment used for future actions Xi+1:n. This means that certain adjustment sets Zj for Xj>i will make there not exist any Zi for Xi in G i that satisﬁes the criterion! As an example, in Fig. 2c, X2 can use any combination of Z, X1, W as a valid adjustment set Z2. However, if Z comes after X1 in temporal order, only Z2 = {Z} leads to valid imitation over X = {X1, X2}.

The direct approach towards solving this problem would involve enumerating all possible backdoor admissible sets Zi for each Xi, but there are both exponentially many backdoor admissible sets Zi, and exponentially many combinations of sets over multiple Xi:n. Such a direct exponential enumeration is not feasible in practical settings. To address these issues, this section will see the development of Alg. 1, which ﬁnds a sequential π-backdoor admissible set Z1:n with regard to actions X in a causal diagram G in polynomial time, if such a set exists.

Before delving into the details of Alg. 1, we describe a method intuitively motivated by the Nearest Separator" from Van der Zander et al. (2015) that can generate a backdoor admissible set Zi for a single independent action Xi in the presence of unobserved variables. While it does not solve the problem of multiple actions due to the issues listed above, it is a building block for Alg. 1.

Consider the Markov Boundary (minimal Markov Blanket, Pearl (1988)) for a set of nodes OX O, which is deﬁned as the minimal set Z O \ OX such that (OX O \ OX|Z). This deﬁnition can be applied to graphs with latent variables, where it can be constructed in terms of c-components:

Lemma 3.1. Given OX O, the Markov Boundary of OX in G is Pa+(C(Ch+(OX))) \ OX

If there is a set Z before(Xi) that satisﬁes the backdoor criterion for Xi with respect to Y , then taking GY as the ancestral graph of Y , the Markov Boundary Z of Xi in GY Xi has Z before(Xi), and also satisﬁes the backdoor criterion in G (Lem. C.1). The Markov Boundary can therefore be used to generate a backdoor adjustment set wherever one exists.

A naïve algorithm that uses the Markov Boundary of Xi X in (G i)Y Xi as the corresponding Zi, and returns a failure whenever Zi / before(Xi) for the sequential π-backdoor is equivalent to the existing literature on ﬁnding backdoor-admissible sets. It cannot create a valid sequential π-backdoor for Fig. 2c, since X2 would have Z2 = {W}, but no adjustment set exists for X1 that d-separates it from Y in the resulting G 1. We must take into account interactions between actions encoded in G i.

We notice that an Xi does not require a valid adjustment set if it is not an ancestor of Y in G i (i.e. Xi does not need to satisfy (1) of Def. 2.3 if it can satisfy (2)). Furthermore, even if Xi is an ancestor of Y in G i, and therefore must satisfy condition (1) of Def. 2.3, any elements of its c-component that are not ancestors of Y in G i won t be part of (G i)Y , and therefore don t need to be conditioned.

It is therefore beneﬁcial for an action Xj to have a backdoor adjustment set that maximizes the number of nodes that are not ancestors of Y in G j 1, so that actions Xi<j can satisfy (2) of Def. 2.3 if possible, and have the smallest possible c-components in (G i)Y (increasing likelihood that backdoor set Zi before(Xi) exists if Xi must satisfy condition (1)).

To demonstrate this intuition, we once again look at Fig. 2c, focusing only on action X2. If we were to use {W} as Z2, we still have the same set of ancestors of Y in G 1. If we switch to {X1}, then W would no longer be an ancestor of Y in G 1 - meaning that X1 is better as a backdoor adjustment set for X2 than {W} if we only know that X2 is an action (i.e. W would directly satisfy (2) of Def. 2.3 if it were the other action). Finally, using {Z} as Z2 makes both X1 and W no longer ancestors of Y in G 1, meaning that it is the best option for the adjustment set Z2.

FINDOX in Alg. 1 employs the above ideas to iteratively grow a set OX O of ancestors of X (and including X) in GY whose elements (possibly excluding X) will not be ancestors of Y once the actions in their descendants are taken. That is, an element Oi OX where ch+(Oi) OX is not present in (G i)Y for all actions Xi that come before it in temporal order. Combined with the Markov Boundary, FINDOX can be used to generate sequential π-backdoors.

We exemplify the use of Alg. 1 through Fig. 2c. OX represents a map of observed variables which are not ancestors of Y in G i<j to the earliest action Xj in their descendants. The keys of OX will be the set OX. Considering the temporal order {X1, Z, W, X2, Y }, the algorithm starts from the last

Algorithm 1 Find largest valid OX in ancestral graph of Y given G, X and target Y

1: function HASVALIDADJUSTMENT(G,OX,Oi,Xi) 2: C the c-component of Oi in GY

3: GC the subgraph of GY containing only Pa+(C) and intermediate latent variables 4: OC C \ (OX {Oi}) (elements of c-component that might be ancestors of Y in G i) 5: return Oi OC|(OC before(Xi)) in GC 6: function FINDOX(G,X,Y ) 7: OX empty map from elements of O to elements of X 8: do 9: for Oi O of GY (ancestral graph of Y ) in reverse temporal order do 10: if |ch+(Oi)| > 0 and ch+(Oi) keys(OX) then 11: Xi earliest element of OX[ch+(Oi)] in temporal order 12: if HASVALIDADJUSTMENT(G,keys(OX),Oi,Xi) then 13: OX[Oi] Xi 14: else if Oi X and HASVALIDADJUSTMENT(G,keys(OX),Oi,Oi) then 15: OX[Oi] Oi 16: while |OX| changed in most recent pass 17: return keys(OX)

node, Y , which has no children and is not an element of X, so is not added to OX. It then carries on to X2, which is checked for the existence of a valid backdoor adjustment set. Here, the subgraph of the c-component of X2 and its parents (HASVALIDADJUSTMENT) is simply W X2 , meaning that we can condition on W to make X2 independent of all other observed variables, including Y , in GX2 (W is a Markov Boundary for X2 in GX2). The algorithm therefore sets OX = {X2 : X2}, because X2 is an action with a valid adjustment set. Notice that if the algorithm returned at this point with OX = {X2}, the Markov Boundary of OX in GX2 is W, and corresponds to a sequential π-backdoor for the single action X2 (ignoring X1), with policy π(X2|W) = P(X2|W).

Next, W has X2 as its only child, which itself maps to X2 in OX. The subgraph of W s c-component and its parents is X1 W , giving (W O|X1)GW , and {X1} before(X2), allowing us to conclude that there is a backdoor admissible set for X2 where W is no longer an ancestor of Y . We set OX = {X2 : X2, W : X2}, and indeed with OX = {X2, W}, the Markov Boundary of OX in GX2 is X1, and is once again a valid sequential π-backdoor for the single action X2 (ignoring X1), with policy π(X2|X1) = P(X2|X1). The W in OX was correctly labeled as not being an ancestor of Y after action X2 is taken.

Since Z doesn t have its children in the keys of OX, and is not an element of X, it is skipped, leaving only X1. X1 s children (W) are in OX, we check conditioning using X2 instead of X1 (i.e. we check if X1 can satisfy (2) of Def. 2.3, and not be an ancestor of Y in G 1). This time, we have X1 Z as the c-component subgraph, and Z comes before X2, satisfying the check

(X1 Z|Z) in HASVALIDADJUSTMENT, resulting in OX = {X2 : X2, W : X2, X1 : X2}, and OX = {X2, W, X1}. Indeed, the Markov Boundary of OX in GX2 is {Z}, and we can construct a valid sequential π-backdoor by using Z1 = {} and Z2 = {Z}, where X1 is no longer an ancestor of Y in G 1! In this case, we call X2 a boundary action", because it is an ancestor of Y in G 2. On the other hand, X1 is not a boundary action, because it is not an ancestor of Y in G 1.

Deﬁnition 3.1. The set XB X called the boundary actions" for OX := FINDOX(G, X, Y ) are all elements Xi X OX where ch+(Xi) OX.

Alg. 1 is general: the set OX returned by FINDOX can always be used in conjunction with its Markov Boundary to construct a sequential π-backdoor if one exists:

Lemma 3.2. Let OX := FINDOX(G, X, Y ), and X := OX X. Taking Z as the Markov Boundary of OX in GY X and XB as the boundary actions of OX, the sets Zi = (Z XB) before(X i) for each X i X are a valid sequential π-backdoor for (G, X , Y ).

Lemma 3.3. Let OX := FINDOX(G, X, Y ). Suppose that there exists a sequential π-backdoor for X X. Then X OX.

Combined together, Lems. 3.2 and 3.3 show that FINDOX ﬁnds the maximal subset of X where a sequential π-backdoor exists, and the adjustment sets Z1:n can be constructed using the subset of a Markov Boundary over OX that comes before each corresponding action Xi (Lem. 3.2). FINDOX is therefore both necessary and sufﬁcient for generating a sequential π-backdoor:

Theorem 3.1. Let OX be the output of FINDOX(G, X, Y ). A sequential π-backdoor exists for (G, X, Y ) if and only if X OX.

4 Necessity of Sequential π-Backdoor for Imitation

In this section, we show that the sequential π-backdoor is necessary for imitability, meaning that the sequential π-backdoor is complete.

A given imitation problem can have multiple possible conditioning sets satisfying the sequential π-backdoor, and a violation of the criterion for one set does not preclude the existence of another that satisﬁes the criterion. To avoid this issue, we will use the output of Algorithm FINDOX, which returns a unique set OX for each problem:

Lemma 4.1. Let OX := FINDOX(G, X, Y ). Suppose Xi X s.t. Xi X \ OX. Then X is not imitable with respect to Y in G.

Our next proposition establishes the necessity of the sequential π-backdoor criterion for the imitability of the expert s performance (Def. 2.1), which follows immediately from Lem. 4.1 and Thm. 3.1.

Theorem 4.1. If there do not exist adjustment sets satisfying the sequential π-backdoor criterion for (G, X, Y ), then X is not imitable with respect to Y in G.

The proof of Lem. 4.1 relies on the construction of an adversarial SCM for which Y can detect the imitator s lack of access to the latent variables. For example, in Fig. 2a, Z can carry information about the latent variable U to Y , and is only determined after the decision for the value of X is made. Setting U Bern(0.5), X := U, Z := U, Y := X Z leaves the imitator with a performance of IE[ ˆY ] = 0.5, while the expert can get perfect performance (IE[Y ] = 1).

Another example with similar mechanics can be seen in Fig. 2c. If the variables are determined in the order (X1, W, X2, Z, Y ), then the sequence of actions is not imitable, since Z can transfer information about the latent variable U to Y , while X2 has no way of gaining information about U, because the action at X needed to be taken without context.

Finally, observe Fig. 2d. If Z is determined after X1, the imitator must guess a value for X1 without this side information, which is then combined with U2 at W. An adversary can exploit this to construct a distribution where guessing wrong can be detected at Y as follows: U1 Bern(0.5), Z, X := U1, U2 (Bern(0.5), Bern(0.5)) (that is, U2 is a tuple of two binary variables, or a single variable with a uniform domain of 0, 1, 2, 3). Then setting W = U2[Z] ([] represents array access, meaning ﬁrst element of tuple if Z = 0 and second if Z = 1), and X2 := W, Y := (U2[X1] == X2) gives IE[Y ] = 1 only if π1 guesses the value of U1, meaning that the imitator can never achieve the expert s performance. This construction also demonstrates non-imitability when X1 and Z are switched (i.e., Fig. 2c with W Y added, and X1 coming before Z in temporal order).

Due to these results, after running Alg. 1 on the domain s causal structure, the imitator gets two pieces of information:

1. Is the problem imitable? In other words, is it possible to use only observable context variables, and still get provably optimal imitation, despite the expert and imitator having different information?

2. If so, what context should be included in each action? Including/removing certain observed covariates in an estimation procedure can lead to different conclusions/actions, only one of which is correct (known as Simpson s Paradox in the statistics literature (Pearl, 2000)). Furthermore, as demonstrated in Fig. 2c, when performing actions sequentially, some actions might not be imitable themselves (X1 if Z after X1), which leads to bias in observed

# Structure Order Seq. π-Backdoor π-Backdoor Observed Parents All Observed

Z, X1, X2, Y 0.04 0.04% 0.04 0.03% 0.05 0.04% 0.13 0.18%

Z, X1, X2, Y 0.05 0.03% 0.05 0.03% 0.20 0.25% 0.05 0.03%

X1, Z, X2, Y 0.04 0.03% Not Imitable 0.27 0.40% 0.26 0.39%

U1 U2 X1, Z, X2, Y Not Imitable Not Imitable 0.19 0.29% 0.19 0.29%

Table 1: Values of |IE[Y ] IE[ ˆY ]| from behavioral cloning using different contexts in randomly sampled models consistent with each causal graph. Cases with incorrect imitation are shown in red.

descendants (W) - the correct context takes this into account, using only covariates known not to be affected by incorrectly guessed actions.

Finally, the obtained context Zi for every action Xi could be be used as input to existing algorithms for behavioral cloning, giving an imitating policy with an unbiased result.

5 Simulations

We performed 2 experiments (for full details, refer to Kumor et al. (2021, Appendix B)), comparing the performance of 4 separate approaches to determining which variables to include in an imitating policy:

1. All Observed (AO) - Take into account all variables available to the imitator at the time of each action. This is the approach most commonly used in the literature.

2. Observed Parents (OP) - The expert used a set of variables to take an action - use the subset of these that are available to the imitator.

3. π-Backdoor - In certain cases, each individual action can be imitated independently, so the individual single-action covariate sets are used.

4. Sequential π-Backdoor (ours) - The method developed in this paper, which takes into account multiple actions in sequence.

Figure 4: Results of applying supervised learning techniques to continuous data with different sets of variables as input at each action. OPT is the ground truth expert s performance, SπBD represents our method, AO is all observed, and OP represents observed parents.

The ﬁrst simulation consists of running behavioral cloning on randomly sampled distributions consistent with a series of causal graphs designed to showcase aspects of our method. For each causal graph, 10,000 random discrete causal models were sampled, representing the environment as well as expert performance, and then the expert s policy X was replaced with imitating policies approximating π(Xi) = P(Xi|ctx(Xi)), with context ctx determined by each of the 4 tested methods in turn. Our results are shown in Table 1, with causal graphs shown in the ﬁrst column, temporal ordering of variables in the second column, and absolute distance between expert and imitator for the 4 methods in the remaining columns.

In the ﬁrst row, including Z when developing a policy for X leads to a biased answer, which makes the average error of using all observed covariates (red) larger than just the sampling ﬂuctuations present in the other columns. Similarly, Z needs to be taken into account in row 2, but

it is not explicitly used by X, so a method relying only on observed parents leads to bias here. In the next row, Z is not observed at the time of action X1, making the π-backdoor incorrectly claim non-imitability. Our method recognizes that X2 s policy can ﬁx the error made at X1, and is the only method that leads to an unbiased result. Finally, in the 4th row, the non-causal approaches have no way to determine non-imitability, and return biased results in all such cases.

The second simulation used a synthetic, adversarial causal model, enriched with continuous data from the High D dataset (Krajewski et al., 2018) altered to conform to the causal model, to demonstrate that different covariate sets can lead to signiﬁcantly different imitation performance. A neural network was trained for each action-policy pair using standard supervised learning approaches, leading to the results shown in Fig. 4. The causal structure was not imitable from the single-action setting, so the remaining 3 methods were compared to the optimal reward, showing that our method approaches the performance of the expert, whereas non-causal methods lead to biased results. Full details of model construction, including the full causal graph are given in (Kumor et al., 2021, Appendix B)

6 Limitations & Societal Impact

There are two main limitations to our approach: (1) Our method focuses on the causal diagram, requiring the imitator to provide the causal structure of its environment. This is a fundamental requirement: any agent wishing to operate in environments with latent variables must somehow encode the additional knowledge required to make such inferences from observations. (2) Our criterion only takes into consideration the causal structure, and not the associated data P(o). Datadependent methods can be computationally intensive, often requiring density estimation. If our approach returns imitable", then the resulting policies are guaranteed to give perfect imitation, without needing to process large datasets to determine imitability.

Finally, advances in technology towards improving imitation can easily be transferred to methods used for impersonation - our method provides conditions under which an imposter (imitator) can fool a target (Y ) into believing they are interacting with a known party (expert). Our method shows when it is provably impossible to detect an impersonation attack. On the other hand, our results can be used to ensure that the causal structure of a domain cannot be imitated, helping mitigate such issues.

7 Conclusion

Great care needs to be taken in choosing which covariates to include when determining a policy for imitating an expert demonstrator when expert and imitator have different views of the world. The wrong set of variables can lead to biased, or even outright incorrect predictions. Our work provides general and complete results for the graphical conditions under which behavioral cloning is possible, and provides an agent with the tools needed to determine the variables relevant to its policy.

Funding Transparency

The authors were partially supported by grants from NSF IIS-1704352 and IIS-1750807 (CAREER). Daniel Kumor also acknowledges additional revenue from an internship at Amazon.

Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-ﬁrst international conference on Machine learning, pp. 1, 2004.

Argall, B. D., Chernova, S., Veloso, M., and Browning, B. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469 483, 2009.

Billard, A., Calinon, S., Dillmann, R., and Schaal, S. Survey: Robot programming by demonstration. Handbook of robotics, 59(BOOK_CHAP), 2008.

de Haan, P., Jayaraman, D., and Levine, S. Causal confusion in imitation learning. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran

Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ 947018640bf36a2bb609d3557a285329-Paper.pdf.

Etesami, J. and Geiger, P. Causal transfer for imitation learning and decision making under sensorshift. In Proceedings of the 34th AAAI Conference on Artiﬁcial Intelligence, New York, NY, 2020. AAAI Press.

Hussein, A., Gaber, M. M., Elyan, E., and Jayne, C. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):1 35, 2017.

Koller, D. and Friedman, N. Probabilistic Graphical Models: Principles and Techniques. MIT press, 2009.

Krajewski, R., Bock, J., Kloeker, L., and Eckstein, L. The highd dataset: A drone dataset of naturalistic vehicle trajectories on german highways for validation of highly automated driving systems. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 2118 2125, 2018. doi: 10.1109/ITSC.2018.8569552.

Kumor, D., Zhang, J., and Bareinboim, E. Sequential causal imitation learning with unobserved confounders. Technical Report R-76, Causal AI Lab, Columbia University., 2021.

Mahler, J. and Goldberg, K. Learning deep policies for robot bin picking by simulating robust grasping sequences. In Conference on robot learning, pp. 515 524, 2017.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing Atari with Deep Reinforcement Learning. ar Xiv:1312.5602 [cs], December 2013.

Muller, U., Ben, J., Cosatto, E., Flepp, B., and Cun, Y. L. Off-road obstacle avoidance through end-to-end learning. In Advances in neural information processing systems, pp. 739 746, 2006.

Mülling, K., Kober, J., Kroemer, O., and Peters, J. Learning to select and generalize striking movements in robot table tennis. The International Journal of Robotics Research, 32(3):263 279, 2013.

Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, pp. 663 670, 2000.

Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., Peters, J., et al. An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(1-2):1 179, 2018.

Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.

Pearl, J. Causality: Models, Reasoning and Inference. 2000.

Pearl, J. and Robins, J. M. Probabilistic Evaluation of Sequential Plans from Causal Models with Hidden Variables. ar Xiv:1302.4977 [cs], 1995.

Pomerleau, D. A. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pp. 305 313, 1989.

Rizzolatti, G. and Craighero, L. The mirror-neuron system. Annu. Rev. Neurosci., 27:169 192, 2004.

Sutton, R. S., Barto, A. G., et al. Reinforcement Learning: An Introduction. MIT press, 1998.

Syed, U. and Schapire, R. E. A game-theoretic approach to apprenticeship learning. In Advances in neural information processing systems, pp. 1449 1456, 2008.

Tian, J. Studies in Causal Reasoning and Learning. Ph D thesis, Computer Science Department, University of California, Los Angeles, CA, November 2002.

Tian, J. and Paz, A. Finding Minimal D-separators. pp. 15, 1998.

Tian, J. and Pearl, J. A General Identiﬁcation Condition for Causal Effects. pp. 7, 2002.

van der Zander, B. and Li skiewicz, M. Finding minimal d-separators in linear time and applications. In Uncertainty in Artiﬁcial Intelligence, pp. 637 647. PMLR, 2020.

Van der Zander, B., Textor, J., and Liskiewicz, M. Efﬁciently ﬁnding conditional instruments for causal inference. In Twenty-Fourth International Joint Conference on Artiﬁcial Intelligence, 2015.

Widrow, B. Pattern-recognizing control systems. Compurter and Information Sciences, 1964.

Zhang, J., Kumor, D., and Bareinboim, E. Causal Imitation Learning with Unobserved Confounders. pp. 27, 2020.

Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pp. 1433 1438. Chicago, IL, USA, 2008.