# causal_imitation_learning_with_unobserved_confounders__267d405c.pdf

Causal Imitation Learning with Unobserved Confounders

Junzhe Zhang Columbia University junzhez@cs.columbia.edu

Daniel Kumor Purdue University dkumor@purdue.edu

Elias Bareinboim Columbia University eb@cs.columbia.edu

One of the common ways children learn is by mimicking adults. Imitation learning focuses on learning policies with suitable performance from demonstrations generated by an expert, with an unspeciﬁed performance measure, and unobserved reward signal. Popular methods for imitation learning start by either directly mimicking the behavior policy of an expert (behavior cloning) or by learning a reward function that prioritizes observed expert trajectories (inverse reinforcement learning). However, these methods rely on the assumption that covariates used by the expert to determine her/his actions are fully observed. In this paper, we relax this assumption and study imitation learning when sensory inputs of the learner and the expert differ. First, we provide a non-parametric, graphical criterion that is complete (both necessary and sufﬁcient) for determining the feasibility of imitation from the combinations of demonstration data and qualitative assumptions about the underlying environment, represented in the form of a causal model. We then show that when such a criterion does not hold, imitation could still be feasible by exploiting quantitative knowledge of the expert trajectories. Finally, we develop an efﬁcient procedure for learning the imitating policy from experts trajectories.

1 Introduction

A unifying theme of Artiﬁcial Intelligence is to learn a policy from observations in an unknown environment such that a suitable level of performance is achieved [33, Ch. 1.1]. Operationally, a policy is a decision rule that determines an action based on a certain set of covariates; observations are possibly generated by a human demonstrator following a different behavior policy. The task of evaluating policies from a combination of observational data and assumptions about the underlying environment has been studied in the literature of causal inference [29] and reinforcement learning [37]. Several criteria, algorithms, and estimation methods have been developed to solve this problem [29, 36, 3, 6, 35, 32, 44]. In many applications, it is not clear which performance measure the demonstrator is (possibly subconsciously) optimizing. That is, the reward signal is not labeled and accessible in the observed expert s trajectories. In such settings, the performance of candidate policies is not uniquely discernible from the observational data due to latent outcomes, even when inﬁnitely many samples are gathered, complicating efforts to learn policy with satisfactory performance.

An alternative approach used to circumvent this issue is to ﬁnd a policy that mimics a demonstrator s behavior, which leads to the imitation learning paradigm [2, 4, 14, 28]. The expectation (or rather hope) is that if the demonstrations are generated by an expert with near-optimal reward, the performance of the imitator would also be satisfactory. Current methods of imitation learning can be categorized into behavior cloning [45, 31, 23, 24, 22] and inverse reinforcement learning [25, 1, 38, 46]. The former focuses on learning a nominal expert policy that approximates the conditional distribution mapping observed input covariates of the behavior policy to the action domain. The latter attempts to learn a reward function that prioritizes observed behaviors of the expert; reinforcement learning methods are then applied using the learned reward function to obtain a nominal

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Figure 1: Causal diagrams where X represents an action (shaded red) and Y represents a latent reward (shaded blue). Input covariates of the policy space are shaded in light red and minimal imitation surrogates relative to action X and reward Y are shaded in light blue.

policy. However, both families of methods rely on the assumption that the expert s input observations match those available to the imitator. When unobserved covariates exist, however, naively imitating the nominal expert policy does not necessarily lead to a satisfactory performance, even when the expert him or herself behaves optimally.

Figure 2: The tail light of the front car is unobserved in highway (aerial) drone data.

For concreteness, consider a learning scenario depicted in Fig. 2, describing trajectories of human-driven cars collected by drones ﬂying over highways [18, 8]. Using such data, we want to learn a policy (x|z) deciding on the acceleration (action) X of the demonstrator car based on the velocity and locations of both the demonstrator and front cars, summarized as covariates Z. In reality, the human demonstrator also uses the tail light L of the front car to coordinate his/her actions. The demonstrator s performance is evaluated with a latent reward function Y taking X, Z, L as input. However, only observations of X, Z are collected by the drone, summarized as probabilities P(x, z). Fig. 1a describes the graphical representation of this environment. A naïve approach would estimate the conditional distribution P(x|z) and use it as policy . A preliminary analysis reveals that this naive cloning approach leads to sub-optimal performance. Consider an instance where variables X, Y , Z, L, U 2 {0, 1}; their values are decided by functions: L Z U, X Z L, Y X Z L; Z, U are independent variables drawn uniformly over {0, 1}; represents the exclusive-or operator. The expected reward E[Y |do( )] induced by (x|z) = P(x|z) is equal to 0.5, which is quite far from the optimal demonstrator s performance, E[Y ] = 1.

This example shows that even when one is able to perfectly mimic an optimal demonstrator, the learned policy can still be suboptimal. In this paper, we try to explicate this phenomenon and, more broadly, understand imitability through a causal lens1. Our task is to learn an imitating policy that achieves the expert s performance from demonstration data in a structural causal model [29, Ch. 7], allowing for unobserved confounders (UCs) affecting both action and outcome variables. Speciﬁcally, our contributions are summarized as follows. (1) We introduce a complete graphical criterion for determining the feasibility of imitation from demonstration data and qualitative knowledge about the data-generating process represented as a causal graph. (2) We develop a sufﬁcient algorithm for identifying an imitating policy when the given criterion does not hold, by leveraging the quantitative knowledge in the observational distribution. (3) We provide an efﬁcient and practical procedure for ﬁnding an imitating policy through explicit parametrization of the causal model, and use it to validate our results on high-dimensional, synthetic datasets. For the sake of space constraints, we provide all proofs in the complete technical report [15, Appendix A].

1.1 Preliminaries

In this section, we introduce the basic notations and deﬁnitions used throughout the paper. We use capital letters to denote random variables (X) and small letters for their values (x). DX represents the domain of X and PX the space of probability distributions over DX. For a set X, |X| denotes its dimension. We consistently use the abbreviation P(x) to represent the probabilities P(X = x). Finally, I{Z=z} is an indicator function that returns 1 if Z = z holds true; otherwise 0.

1Some recent progress in the ﬁeld of causal imitation has been reported, albeit oblivious to the phenomenon described above and our contributions. Some work considered settings in which the input to the expert policy is fully observed [7], while another assumed that the primary outcome is observed (e.g., Y in Fig. 1a) [8].

Calligraphic letters, e.g., G, will be used to represent directed acyclic graphs (DAGs) (e.g., Fig. 1). We denote by GX the subgraph obtained from G by removing arrows coming into nodes in X; GX is a subgraph of G by removing arrows going out of X. We will use standard family conventions for graphical relationships such as parents, children, descendants, and ancestors. For example, the set of parents of X in G is denoted by pa(X)G = [X2Xpa(X)G. ch, de and an are similarly deﬁned, We write Pa, Ch, De, An if arguments are included as well, e.g. De(X)G = de(X)G [ X. A path from a node X to a node Y in G is a sequence of edges which does not include a particular node more than once. Two sets of nodes X, Y are said to be d-separated by a third set Z in a DAG G, denoted by (X ?? Y |Z)G, if every edge path from nodes in one set to nodes in another are blocked . The criterion of blockage follows [29, Def. 1.2.3].

The basic semantic framework of our analysis rests on structural causal models (SCMs) [29, Ch. 7]. An SCM M is a tuple h U, V , F, P(u)i where V is a set of endogenous variables and U is a set of exogenous variables. F is a set of structural functions where f V 2 F decides values of an endogenous variable V 2 V taking as argument a combination of other variables. That is, V f V (Pa V , UV ), Pa V V , UV U. Values of U are drawn from an exogenous distribution P(u). Each SCM M induces a distribution P(v) over endogenous variables V . An intervention on a subset X V , denoted by do(x), is an operation where values of X are set to constants x, replacing the functions {f X : 8X 2 X} that would normally determine their values. For an SCM M, let Mx be a submodel of M induced by intervention do(x). For a set S V , the interventional distribution P(s|do(x)) induced by do(x) is deﬁned as the distribution over S in the submodel Mx, i.e., P(s|do(x); M) , P(s; Mx). We leave M implicit when it is obvious from the context. For a detailed survey on SCMs, we refer readers to [29, Ch. 7].

2 Imitation Learning in Structural Causal Models

In this section, we formalize and study the imitation learning problem in causal language. We ﬁrst deﬁne a special type of SCM that explicitly allows one to model the unobserved nature of some endogenous variables, which is called the partially observable structural causal model (POSCM).2

Deﬁnition 1 (Partially Observable SCM). A POSCM is a tuple h M, O, Li, where M is a SCM h U, V , F, P(u)i and h O, Li is a pair of subsets forming a partition over V (i.e., V = O [ L and O \ L = ;); O and L are called observed and latent endogenous variables, respectively.

Each POSCM M induces a probability distribution over V , of which one can measure the observed variables O. P(o) is usually called the observational distribution. M is associated with a causal diagram G (e.g., see Fig. 1) where solid nodes represent observed variables O, dashed nodes represent latent variables L, and arrows represent the arguments Pa V of each functional relationship f V . Exogenous variables U are not explicitly shown; a bi-directed arrow between nodes Vi and Vj indicates the presence of an unobserved confounder (UC) affecting both Vi and Vj, i.e., UVi \UVj 6= ;.

Consider a POSCM h M, O, Li with M = h U, V , F, P(u)i. Our goal is to learn an efﬁcient policy to decide the value of an action variable X 2 O. The performance of the policy is evaluated using the expected value of a reward variable Y . Throughout this paper, we assume that reward Y is latent and X affects Y (i.e., Y 2 L \ De(X)G). A policy is a function mapping from values of covariates Pa O \ De(X)GX

3 to a probability distribution over X, which we denote by (x|pa ). An intervention following a policy , denoted by do( ), is an operation that draws values of X independently following , regardless of its original (natural) function f X. Let M denote the manipulated SCM of M induced by do( ). Similar to atomic settings, the interventional distribution P(v|do( )) is deﬁned as the distribution over V in the manipulated model M , given by,

P(v|do( )) =

P(v|pa V , u V ) (x|pa ). (1)

The expected reward of a policy is thus given by the causal effect E[Y |do( )]. The collection of all possible policies deﬁnes a policy space, denoted by = { : DPa 7! PX} (if Pa = ;, = { : PX}). For convenience, we deﬁne function Pa( ) = Pa . A policy space 0 is a

2This deﬁnition will facilitate the more explicitly articulation of which endogenous variables are available to the demonstrator and corresponding policy at each point in time.

3GX is a causal diagram associated with the submodel Mx induced by intervention do(x).

subspace of if Pa( 0) Pa( ). We will consistently highlight action X in dark red, reward Y in dark blue and covariates Pa( ) in light red. For instance, in Fig. 1a, the policy space over action X is given by = { : DZ 7! PX}; Y represents the reward; 0 = { : PX} is a subspace of .

Our goal is to learn an efﬁcient policy 2 that achieves satisfactory performance, e.g., larger than a certain threshold E[Y |do( )] , without knowledge of underlying system dynamics, i.e., the actual, true POSCM M. A possible approach is to identify the expected reward E[Y |do( )] for each policy 2 from the combinations of the observed data P(o) and the causal diagram G. Optimization procedures are applicable to ﬁnd a satisfactory policy . Let Mh Gi denote a hypothesis class of POSCMs that are compatible with a causal diagram G. We deﬁne the non-parametric notion of identiﬁability in the context of POSCMs and conditional policies, adapted from [29, Def. 3.2.4]. Deﬁnition 2 (Identiﬁability). Given a causal diagram G and a policy space , let Y be an arbitrary subset of V . P(y|do( )) is said to be identiﬁable w.r.t. h G, i if P (y|do( ); M) is uniquely computable from P(o; M) and for any POSCM M 2 Mh Gi and any 2 .

In imitation learning settings, however, reward Y is often not speciﬁed and remains latent, which precludes approaches that attempt to identify E[Y |do( )]: Corollary 1. Given a causal diagram G and a policy space , let Y be an arbitrary subset of V . If not all variables in Y are observed (i.e., Y \ L 6= ;), P(y|do( )) is not identiﬁable.

In other words, Corol. 1 shows that when the reward Y is latent, it is infeasible to uniquely determine values of E[Y |do( )] from P(o). A similar observation has been noted in [20, Prop. 1]. This suggests that we need to explore learning through other modalities.

2.1 Causal Imitation Learning

To circumvent issues of non-identiﬁability, a common solution is to assume that the observed trajectories are generated by an expert demonstrator with satisfactory performance E[Y ], e.g., no less than a certain threshold (E[Y ] ). If we could ﬁnd a policy that perfectly imitates" the expert with respect to reward Y , E[Y |do( )] = E[Y ], the performance of the learner is also guaranteed to be satisfactory. Formally, Deﬁnition 3 (Imitability). Given a causal diagram G and a policy space , let Y be an arbitrary subset of V . P(y) is said to be imitable w.r.t. h G, i if there exists a policy 2 uniquely computable from P(o) such that P(y|do( ); M) = P(y; M) for any POSCM M 2 Mh Gi.

Our task is to determine the imitability of the expert performance. More speciﬁcally, we want to learn an imitating policy 2 from P(o) such that P(y|do( )) = P(y)4, in any POSCM M associated with the causal diagram G. Consider Fig. 3a as an example. P(y) is imitable with policy (x) = P(x) since by Eq. (1) and marginalization, P(y|do( )) = P

x,w P(y|w)P(w|x) (x) = P

x,w P(y|w)P(w|x)P(x) = P(y). In practice, unfortunately, the expert s performance cannot always be imitated. To understand this setting, we ﬁrst write, more explicitly, the conditions under which this is not the case: Lemma 1. Given a causal diagram G and a policy space , let Y be an arbitrary subset of V . P(y) is not imitable w.r.t. h G, i if there exists two POSCMs M1, M2 2 Mh Gi satisfying P(o; M1) = P(o; M2) while there exists no policy 2 such that for i = 1, 2, P(y|do( ); Mi) = P(y; Mi).

Figure 3: Imitability v. Identiﬁability.

It follows as a corollary that P(y) is not imitable if there exists a POSCM M compatible with G such that no policy 2 could ensure P(y|do( ); M) = P(y; M). For instance, consider the causal diagram G and policy space in Fig. 3b. Here, the expert s reward P(y) is not imitable: consider a POSCM with functions X U, W X, Y U W; values U are drawn uniformly over {0, 1}. In this model, P(Y = 1|do( )) = 0.5 for any policy , which is far from the optimal expert reward, P(Y = 1) = 1.

An interesting observation from the above example of Fig. 3b is that the effect P(y|do( )) is identiﬁable, following the front-door criterion in [29, Thm. 3.3.4], but no

4The imitation is trivial if Y 62 De(X)G: by Rule 3 of [29, Thm. 3.4.1] (or [6, Thm. 1]), P(y|do( )) = P(y) for any policy . This paper aims to ﬁnd a speciﬁc satisfying P(y|do( )) = P(y) even when Y 2 De(X)G.

policy imitates the corresponding P(y). However, in some settings, the expert s reward P(y) is imitable but the imitator s reward P(y|do( )) cannot be uniquely determined. To witness, consider again the example in Fig. 3a. The imitability of P(y) has been previously shown; while P(y|do( )) is not identiﬁable due to latent reward Y (Corol. 1).

In general, the problem of imitability is orthogonal to identiﬁability, and, therefore, requires separate consideration. Since imitability does not always hold, we introduce a useful graphical criterion for determining whether imitating an expert s performance is feasible, and if so, how. Theorem 1 (Imitation by Direct Parents). Given a causal diagram G and a policy space , P(y) is imitable w.t.r. h G, i if pa(X)G Pa( ) and there is no bi-directed arrow pointing to X in G. Moreoever, the imitating policy 2 is given by (x|pa( )) = P (x|pa(X)G).

In words, Thm. 1 says that if the expert and learner share the same policy space, then the policy is always imitable. In fact, this result can be seen as a causal justiﬁcation for when the method of behavior cloning , widely used in practice, is valid, leading to proper imitation. When the original

behavior policy f X is contained in the policy space , the leaner could imitate the expert s reward P(y) by learning a policy 2 that matches the distribution P(x|pa( )) [45, 31]. Next, we consider the more challenging setting when policy spaces of the expert and learner disagree (i.e., the learner and expert have different views of the world, f X 62 ). We will leverage a graphical condition adopted from the celebrated backdoor criterion [29, Def. 3.3.1]. Deﬁnition 4 ( -Backdoor). Given a causal diagram G and a policy space , a set Z is said to satisfy the -backdoor criterion w.r.t. h G. i if and only if Z Pa( ) and (Y ?? X|Z)GX, which is called the -backdoor admissible set w.r.t. h G, i.

For concreteness, consider again the highway driving example in Fig. 1a. There exists no -backdoor admissible set due to the path X L ! Y . Now consider a modiﬁed graph in Fig. 1b where edge L ! Y is removed. {Z} is -backdoor admissible since Z 2 Pa( ) and (Y ?? X|Z)GX. Leveraging the imitation backdoor condition, our next theorem provides a full characterization for when imitating expert s performance is achievable, despite the fact that the reward Y is latent. Theorem 2 (Imitation by -Backdoor). Given a causal diagram G and a policy space , P(y) is imitable w.r.t. h G, i if and only if there exists an -backdoor admissible set Z w.r.t. h G, i. Moreover, the imitating policy 2 is given by (x|pa( )) = P (x|z).

That is, one can learn an imitating policy from a policy space 0 = { : DZ 7! PX} that mimics the conditional probabilities P(x|z) if and only if Z is -backdoor admissible. If that is the case, such a policy can be learned from data through standard density estimation methods. For instance, Thm. 2 ascertains that P(y) in Fig. 1a is indeed non-imitable. On the other hand, P(y) in Fig. 1b is imitable, guaranteed by the -backdoor admissible set {Z}; the imitating policy is given by (x|z) = P(x|z).

3 Causal Imitation Learning with Data Dependency

One may surmise that the imitation boundary established by Thm. 2 suggests that when there exists no -backdoor admissible set, it is infeasible to imitate the expert performance from observed trajectories of demonstrations. In this section, we will circumvent this issue by exploiting actual parameters of the observational distribution P(o). In particular, we denote by Mh G,P i a subfamily of candidate models in Mh Gi that induce both the causal diagram G and the observational distribution P(o), i.e., Mh G,P i =

8M 2 Mh Gi : P(o; M) = P(o)

. We introduce a reﬁned notion of imitability that will explore the quantitative knowledge of observations P(o) (to be exempliﬁed). Formally, Deﬁnition 5 (Practical Imitability). Given a causal diagram G, a policy space , and an observational distribution P(o), let Y be an arbitrary subset of V . P(y) is said to be practically imitable (for short, p-imitable) w.r.t. h G, , P(o)i if there exists a policy 2 uniquely computable from P(o) such that P(y|do( ); M) = P(y; M) for any POSCM M 2 Mh G,P i.

The following corollary can be derived based on the deﬁnition of practical imitability. Corollary 2. Given a causal diagram G, a policy space and an observational distribution P(o), let a subset Y V . If P(y) is imitable w.r.t. h G, i, P(y) is p-imitable w.r.t. h G, , P(o)i.

Compared to Def. 3, the practical imitability of Def. 5 aims to ﬁnd an imitating policy for a subset of candidate POSCMs Mh G,P i restricted to match a speciﬁc observational distribution P(o). Def. 3, on

the other hand, requires only the causal diagram G. In other words, for an expert s performance P(y) that is non-imitable w.r.t. h G, i, it could still be p-imitable after analyzing actual probabilities of the observational distribution P(o).

For concreteness, consider again P(y) in Fig. 3b which is not imitable due to the bi-directed arrow X $ Y . However, new imitation opportunities arise when actual parameters of the observational distribution P(x, w, y) are provided. Suppose the underlying POSCM is given by: X UX UY , W X UW , Y W UY where UX, UY , UW are independent binary variables drawn from P(UX = 1) = P(UY = 1) = P(UW = 0) = 0.9. Here, the causal effect P(y|do(x)) is identiﬁable from P(x, w, y) following the front-door formula P(y|do(x)) = P

x0 P(y|w, x0)P(x0) [29, Thm. 3.3.4]. We thus have P(Y = 1|do(X = 0)) = 0.82 which coincides with P(Y = 1) = 0.82, i.e., P(y) is p-imitable with atomic intervention do(X = 0). In the most practical settings, the expert reward P(y) rarely equates to P(y|do(x)); stochastic policies (x) are then applicable to imitate P(y) by re-weighting P(y|do(x)) induced by the corresponding atomic interventions 5. To tackle p-imitability in a general way, we proceed by deﬁning a set of observed variables that serve as a surrogate of the unobserved Y with respect to interventions on X. Formally,

Deﬁnition 6 (Imitation Surrogate). Given a causal diagram G, a policy space , let S be an arbitrary subset of O. S is an imitation surrogate (for short, surrogate) w.r.t. h G, i if (Y ?? ˆX|S)G[ where G [ is a supergraph of G by adding arrows from Pa( ) to X; ˆX is a new parent to X.

An surrogate S is said to be minimal if there exists no subset S0 S such that S0 is also a surrogate w.r.t. h G, i. Consider as an example Fig. 1c where the supergraph G [ coincides with the causal diagram G. By Def. 6, both {W, S} and {S} are valid surrogate relative to h X, Y i with {S} being the minimal one. By conditioning on S, the decomposition of Eq. (1) implies P(y|do( )) = P

s,w,u P(y|s)P(s|w, u)P(w|x) (x)P(u) = P

s P(y|s)P(s|do( )). That is, the surrogate S mediates all inﬂuence of interventions on action X to reward Y . It is thus sufﬁcient to ﬁnd an imitating policy such that P(s|do( )) = P(s) for any POSCM M associated with Fig. 1c. The resultant policy is guaranteed to imitate the expert s reward P(y).

When a surrogate S is found and P(s|do( )) is identiﬁable, one could compute P(s|do( )) for each policy and check if it matches P(s). In many settings, however, P(s|do( )) is not identiﬁable w.r.t. h G, i. For example, in Fig. 1d, S is a surrogate w.r.t. h G, i, but P(s|do( )) is not identiﬁable due to collider Z ( uses non-descendants as input by default). Fortunately, identifying P(s|do( )) may still be feasible in some subspaces of :

Deﬁnition 7 (Identiﬁable Subspace). Given a causal diagram G, a policy space , and a subset S O, let 0 be a policy subspace of . 0 is said to be an identiﬁable subspace (for short, id-subspace) w.r.t. h G, , Si if P(s|do( )) is identiﬁable w.r.t. h G, 0i.

Consider a policy subspace 0 = { : PX} described in Fig. 1d (i.e. that does not exploit information from covariates Z). P(s|do( )) is identiﬁable w.r.t. h G, 0i following the front-door adjustment on W [29, Thm. 3.3.4]. We could then evaluate interventional probabilities P(s|do( )) for each policy 2 0 from the observational distribution P(x, w, s, z); the imitating policy is obtainable by solving the equation P(s|do( )) = P(s). In other words, {S} and 0 forms an instrument that allows one to solve the imitation learning problem in Fig. 1d.

Deﬁnition 8 (Imitation Instrument). Given a causal diagram G and a policy space , let S be a subset of O and 0 be a subspace of . h S, 0i is said to be an imitation instrument (for short, instrument) if S is a surrogate w.r.t. h G, 0i and 0 is an id-subspace w.r.t. h G, , Si.

Lemma 2. Given a causal diagram G, a policy space , and an observational distribution P(o), let h S, 0i be an instrument w.r.t. h G, i. If P(s) is p-imitable w.r.t. h G, 0, P(o)i, then P(y) is p-imitable w.r.t. h G, , P(o)i. Moreover, an imitating policy for P(s) w.r.t. h G, 0, P(o)i is also imitating policy for P(y) w.r.t. h G, , P(o)i.

In words, Lem. 2 shows that when an imitation instrument h S, 0i is present, we could reduce the original imitation learning on a latent reward Y to a p-imitability problem over observed surrogate variables S using policies in an identiﬁable subspace 0. The imitating policy is obtainable by solving the equation P(s|do( )) = P(s).

5Consider a variation of the model where P(UW = 1) = 0.7. P(y) is p-imitable with (X = 0) = 0.75.

3.1 Confounding Robust Imitation

Our task in this section is to introduce a general algorithm that ﬁnds instruments, and learns a p-imitating policy given h G, , P(o)i. A naïve approach is to enumerate all pairs of subset S and subspace 0 and check whether they form an instrument; if so, we can compute an imitating policy for P(s) w.r.t. h G, 0, P(o)i. However, the challenge is that the number of all possible subspaces 0 (or subsets S) can be exponentially large. Fortunately, we can greatly restrict this search space. Let G [ {Y } denote a causal diagram obtained from G by making reward Y observed. The following proposition suggests that it sufﬁces to consider only identiﬁable subspaces w.r.t. h G [ {Y }, , Y i. Lemma 3. Given a causal diagram G, a policy space , let a subspace 0 . If there exists S O such that h S, 0i is an instrument w.r.t. h G, i, 0 is an id-subspace w.r.t. h G [ {Y }, , Y i.

Algorithm 1: IMITATE

1: Input: G, , P(o). 2: while LISTIDSPACE(G [ {Y }, , Y ) outputs a

policy subspace 0 do 3: while LISTMINSEP(G [ 0, ˆX, Y , {}, O) outputs a surrogate set S do 4: if IDENTIFY(G, 0, S) = YES then 5: Solve for a policy 2 0 such that

P(s|do( ); M) = P(s)

for any POSCM M 2 Mh G,P i.

6: Return if it exists; continue otherwise. 7: end if 8: end while 9: end while Return FAIL.

Our algorithm IMITATE is described in Alg. 1. We assume access to an IDENTIFY oracle [41, 34, 6] that takes as input a causal diagram G, a policy space and a set of observed variables S. If P(s|do( )) is identiﬁable w.r.t. h G, i, IDENTIFY returns YES ; otherwise, it returns NO . For details about the IDENTIFY oracle, we refer readers to [15, Appendix B]. More speciﬁcally, IMITATE takes as input a causal diagram G, a policy space and an observational distribution P(o). At Step 2, IMITATE applies a subroutine LISTIDSPACE to list identiﬁable subspaces 0 w.r.t. h G[{Y }, , Y i, following the observation made in Lem. 3. The implementation details of LISTIDSPACE are provided in [15, Appendix C]. When an identiﬁable subspace 0 is found, IMITATE tries to obtain a surrogate S w.r.t the diagram G and subspace 0. While there could exist multiple such surrogates, the following proposition shows that it is sufﬁcient to consider only minimal ones. Lemma 4. Given a causal diagram G, a policy space , an observational distribution P(o) and a subset S O. P(s) is p-imitable only if for any S0 S, P(s0) is p-imitable w.r.t. h G, , P(o)i.

We apply a subroutine LISTMINSEP in [43] to enumerate minimal surrogates in O that d-separate ˆX and Y in the supergraph G [ 0. When a minimal surrogate S is found, IMITATE uses the IDENTIFY oracle to validate if P(s|do( )) is identiﬁable w.r.t. h G, 0i, i.e., h S, 0i form an instrument. Consider Fig. 1d as an example. While P(y|do( )) is not identiﬁable for every policy in had Y been observed, contains an id-subspace { : PX} w.r.t. h G [ {Y }, , Y i, which is associated with a minimal surrogate {S}. Applying IDENTIFY conﬁrms that h{S}, { : PX}i is an instrument.

At Step 5, IMITATE solves for a policy in the subspace 0 that imitates P(s) for all instances in the hypothesis class Mh G,P i. If such a policy exists, IMITATE returns ; otherwise, the algorithm continues. Since h S, 0i is an instrument, Lem. 2 implies that the learned policy , if it exists, is ensured to imitate the expert reward P(y) for any POSCM M 2 Mh G,P i.

Theorem 3. Given a causal diagram G, a policy space , and an observational distribution P(o), if

IMITATE returns a policy 2 , P(y) is p-imitable w.r.t. h G, , P(o)i. Moreover, is an imitating policy for P(y) w.r.t. h G, , P(o)i.

3.2 Optimizing Imitating Policies

We now introduce optimization procedures to solve for an imitating policy at Step 5 of IMITATE algorithm. Since the pair h S, 0i forms a valid instrument (ensured by Step 4), the interventional distribution P(s|do( ); M) remains invariant among all models in Mh Gi, i.e., P(s|do( )) is identiﬁable w.r.t. h G, i. We could thus express P(s|do( ); M) for any M 2 Mh G,P i as a function of the observational distribution P(o); for simplicity, we write P(s|do( )) = P(s|do( ); M). The imitating policy is obtainable by solving the equation P(s|do( )) = P(s). We could derive a closed-form formula for P(s|do( )) following standard causal identiﬁcation algorithms in [41, 34, 6]. As an

(a) Highway Driving

(b) Highway Driving

(c) MNIST Digits

Figure 4: (a) Causal diagram for highway driving example where a left-side car exists; (b,c) P(y|do( )) induced by the causal imitation method (ci) and the naive behavior cloning (bc) compared with the actual distribution P(y) over the expert s reward (opt).

example, consider again the setting of Fig. 1c with binary X, W, S, Z; parameters of P(x, w, s, z) could be summarized using an 8-entry probability table. The imitating policy (x) is thus a solution of a series of linear equations P

x (x)P(s|do(x)) = P(s) and P

x (x) = 1, given by:

(x0) = P(s1) P(s1|do(x0)) P(s1|do(x1)) P(s1|do(x0)), (x1) = P(s1|do(x1)) P(s1) P(s1|do(x1)) P(s1|do(x0)).

Among quantities in the above equation, xi, sj represent assignments X = i, S = j for i, j 2 {0, 1}. The interventional distribution P(s|do(x)) could be identiﬁed from P(x, w, s, z) using the front-door adjustment formula P(s|do(x)) = P

x0 P(s|x0, w)P(x0) [29, Thm. 3.3.4].

However, evaluating interventional probabilities P(s|do( )) from the observational distribution P(o) could be computational challenging if some variables in O are high-dimensional (e.g., W in Fig. 1d). Properties of the imitation instrument h S, 0i suggest a practical approach to address this issue. Since P(s|do( )) is identiﬁable w.r.t. h G, 0i, by Def. 2, it remains invariant over the models in the hypothesis class Mh Gi. This means that we could compute interventional probabilities P(s|do( ); M) in an arbitrary model M 2 Mh G,P i; such an evaluation will always coincide with the actual, true causal effect P(s|do( )) in the underlying model. This observation allows one to obtain an imitating policy through the direct parametrization of POSCMs [21]. Let Nh Gi be a parametrized subfamily of POSCMs in Mh Gi. We could obtain an POSCM M 2 Nh Gi such that its observational distribution P(o; M) = P(o); an imitating policy is then computed in the parametrized model M. Corol. 3 shows that such a policy is an imitating policy for the expert s reward P(y). Corollary 3. Given a causal diagram G, a policy space , and an observational distribution P(o), let h S, 0i be an instrument w.r.t. h G, i. If there exists a POSCM M 2 Mh G,P i and a policy 2 0

such that P(s|do( ); M) = P(s), then P(y) is p-imitable w.r.t. h G, , P(o)i. Moreover, is an imitating policy for P(y) w.r.t. h G, , P(o)i.

In practical experiments, we consider a parametrized family of POSCMs Nh Gi where functions associated with each observed variable in O are parametrized by a family of neural networks, similar to [21]. Using the computational framework of Generative Adversarial Networks (GANs) [9, 27], we obtain a model M 2 Nh Gi satisfying the observational constraints P(o; M) = P(o). The imitating policy is trained through explicit interventions in the learned model M; a different GAN is then deployed to optimize the policy so that it imitates the observed trajectories drawn from P(s).

4 Experiments

We demonstrate our algorithms on several synthetic datasets, including high D [18] consisting of natural trajectories of human driven vehicles, and on MNIST digits. In all experiments, we test our causal imitation method (ci): we apply Thm. 2 when there exists an -backdoor admissible set; otherwise, Alg. 1 is used to leverage the observational distribution. As a baseline, we also include naïve behavior cloning (bc) that mimics the observed conditional distribution P(x|pa( )), as well as the actual reward distribution generated by an expert (opt). We found that our algorithms consistently imitate distributions over the expert s reward in imitable (p-imitable) cases; and p-imitable instances commonly exist. We refer readers to [15, Appendix D] for more experiments, details, and analysis.

Highway Driving We consider a modiﬁed example of the drone recordings of human-driven cars in Sec. 1 where the driver s braking action W of the left-side car is also observed. Fig. 4a shows the causal diagram of this environment; Z represent the velocity of the front-car; action X represents the velocity of the driving car; W and the reward signal Y are both affected by an unobserved confounder U, representing the weather condition. In Fig. 4a, {Z} is -backdoor admissible while {Z, W} is not due to active path X L ! W $ Y . We obtain policies for the causal and naive imitators training two separate GANs. Distributions P(y|do( )) induced by all algorithms are reported in Fig. 4b. We also measure the L1 distance between P(y|do( )) and the expert s reward P(y). We ﬁnd that the causal approach (ci), using input set {Z}, successfully imitates P(y) (L1 = 0.0018). As expected, the naive approach (bc) utilizing all covariates {Z, W} is unable to imitate the expert (L1 = 0.2937).

MNIST Digits We consider an instance of Fig. 1c where X, S, Y are binary variables; binary values of W are replaced with corresponding images of MNIST digits (pictures of 1 or 0), determined based on the action X. For the causal imitator (ci), we learn a POSCM ˆ M such that P(x, w, s; ˆ M) = P(x, w, s). To obtain ˆ M, we train a GAN to imitate the observational distribution P(x, w, s), with a separate generator for each X, W, S. We then train a separate discriminator measuring the distance between observed trajectories P(s) and interventional distribution P(s|do( ); ˆ M) over the surrogate {S}. The imitating policy is obtained by minimizing such a distance. Distributions P(y|do( )) induced by all algorithms are reported in Fig. 4c. We ﬁnd that the causal approach (ci) successfully imitates P(y) (L1 = 0.0634). As expected, the naive approach (bc) mimicking distribution P(x) is unable to imitate the expert (L1 = 0.1900).

5 Conclusion

We investigate the imitation learning in the semantics of structural causal models. The goal is to ﬁnd an imitating policy that mimics the expert behaviors from combinations of demonstration data and qualitative knowledge about the data-generating process represented as a causal diagram. We provide a graphical criterion that is complete (i.e., sufﬁcient and necessary) for determining the feasibility of learning an imitating policy that mimics the expert s performance. We also study a data-dependent notion of imitability depending on the observational distribution. An efﬁcient algorithm is introduced which ﬁnds an imitating policy, by exploiting quantitative knowledge contained in the observational data and the presence of surrogate endpoints. Finally, we propose a practical procedure for estimating such an imitating policy from observed trajectories of the expert s demonstrations.

Broader Impact

This paper investigates the theoretical framework of learning a policy that imitates the distribution over a primary outcome from natural trajectories of an expert demonstrator, even when the primary outcome itself is unobserved and input covariates used by the expert determining original values of the action are unknown. Since in practice, the actual reward is often unspeciﬁed and the learner and the demonstrator rarely observe the environment in the same fashion, our methods are likely to increase the progress of automated decision systems. Such systems may be applicable to various ﬁelds, including the development of autonomous vehicle, industrial automation and the management of chronic disease. These applications may have a broad spectrum of societal implications. The adoption of autonomous driving and industrial automation systems could save cost and reduce risks such as occupational injuries; while it could also create unemployment. Treatment recommendation in the clinical decision support system could certainly alleviate the stress on the healthcare workers. However, this also raise questions concerning with the accountability in case of medical malpractice; collection of private personal information could also make the hospital database valuable targets for malicious hackers. Overall, we would encourage research to understand the risks arising from automated decision systems and mitigations for its negative impact.

Recently, there is a growing amount of dataset of natural vehicle trajectories like high D [18] being licensed for commercial use. An immediate positive impact of this work is that we discuss potential risk of training decision-making policy from the observational data due to the presence of unobserved confounding, as shown in Sec. 1 and 4. More broadly, since our method is based on the semantics of structural causal models [29, Ch. 7], its adoption could cultivate machine learning practitioners with proper training in causal reasoning. A favorable characteristic of causal inference methods is

that they are inherently robust: for example, the deﬁnition of imitability Def. 3 requires the imitating policy to perfectly mimics the expert performance in any model compatible to the causal diagram. Automated decision systems using the causal inference methods prioritize the safety and robustness in decision-making, which is increasingly essential since the use of black-box AI systems is prevalent and our understandings of their potential implications are still limited.

Acknowledge

The authors were partially supported by grants from NSF IIS-1704352 and IIS-1750807 (CAREER).

[1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In

Proceedings of the twenty-ﬁrst international conference on Machine learning, page 1, 2004.

[2] B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from

demonstration. Robotics and autonomous systems, 57(5):469 483, 2009.

[3] E. Bareinboim and J. Pearl. Causal inference and the data-fusion problem. Proceedings of the

National Academy of Sciences, 113:7345 7352, 2016.

[4] A. Billard, S. Calinon, R. Dillmann, and S. Schaal. Survey: Robot programming by demonstra-

tion. Handbook of robotics, 59(BOOK_CHAP), 2008.

[5] J. Correa and E. Bareinboim. From statistical transportability to estimating the effect of

stochastic interventions. In S. Kraus, editor, Proceedings of the 28th International Joint Conference on Artiﬁcial Intelligence, pages 1661 1667, Macao, China, 2019. International Joint Conferences on Artiﬁcial Intelligence Organization.

[6] J. Correa and E. Bareinboim. A calculus for stochastic interventions: Causal effect identiﬁcation

and surrogate experiments. In Proceedings of the 34th AAAI Conference on Artiﬁcial Intelligence, New York, NY, 2020. AAAI Press.

[7] P. de Haan, D. Jayaraman, and S. Levine. Causal confusion in imitation learning. In Advances

in Neural Information Processing Systems, pages 11693 11704, 2019.

[8] J. Etesami and P. Geiger. Causal transfer for imitation learning and decision making under

sensor-shift. In Proceedings of the 34th AAAI Conference on Artiﬁcial Intelligence, New York, NY, 2020. AAAI Press.

[9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and

Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems,

pages 2672 2680, 2014.

[10] O. Goudet, D. Kalainathan, P. Caillou, I. Guyon, D. Lopez-Paz, and M. Sebag. Causal Generative

Neural Networks. ar Xiv:1711.08936 [stat], Nov. 2017.

[11] O. Goudet, D. Kalainathan, P. Caillou, D. Lopez-Paz, I. Guyon, M. Sebag, A. Tritas,

and P. Tubaro. Learning Functional Causal Models with Generative Neural Networks. ar Xiv:1709.05321 [stat], Sept. 2017.

[12] R. D. Hjelm, A. P. Jacob, T. Che, A. Trischler, K. Cho, and Y. Bengio. Boundary-Seeking

Generative Adversarial Networks. ar Xiv:1702.08431 [cs, stat], Feb. 2018.

[13] Y. Huang and M. Valtorta. Pearl s calculus of intervention is complete. In R. Dechter and

T. Richardson, editors, Proceedings of the Twenty-Second Conference on Uncertainty in Artiﬁcial Intelligence, pages 217 224. AUAI Press, Corvallis, OR, 2006.

[14] A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne. Imitation learning: A survey of learning

methods. ACM Computing Surveys (CSUR), 50(2):1 35, 2017.

[15] E. B. Junzhe Zhang, Daniel Kumor. Causal imitation learning with unobserved confounders.

Technical Report R-66, Causal AI Lab, Columbia University., 2020.

[16] M. Kocaoglu, C. Snyder, A. G. Dimakis, and S. Vishwanath. Causal GAN: Learning Causal

Implicit Generative Models with Adversarial Training. ar Xiv:1709.02023 [cs, math, stat], Sept. 2017.

[17] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT

press, 2009.

[18] R. Krajewski, J. Bock, L. Kloeker, and L. Eckstein. The highd dataset: A drone dataset of

naturalistic vehicle trajectories on german highways for validation of highly automated driving systems. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pages 2118 2125, 2018.

[19] S. Lee and E. Bareinboim. Structural causal bandits with non-manipulable variables. In

Proceedings of the 33rd AAAI Conference on Artiﬁcial Intelligence, pages 4164 4172, Honolulu, Hawaii, 2019. AAAI Press.

[20] S. Lee and E. Bareinboim. Causal effect identiﬁability under partial-observability. In Proceed-

ings of the 37th International Conference on Machine Learning (ICML-20), 2020.

[21] C. Louizos, U. Shalit, J. Mooij, D. Sontag, R. Zemel, and M. Welling. Causal Effect Inference

with Deep Latent-Variable Models. ar Xiv:1705.08821 [cs, stat], May 2017.

[22] J. Mahler and K. Goldberg. Learning deep policies for robot bin picking by simulating robust

grasping sequences. In Conference on robot learning, pages 515 524, 2017.

[23] U. Muller, J. Ben, E. Cosatto, B. Flepp, and Y. L. Cun. Off-road obstacle avoidance through

end-to-end learning. In Advances in neural information processing systems, pages 739 746, 2006.

[24] K. Mülling, J. Kober, O. Kroemer, and J. Peters. Learning to select and generalize striking

movements in robot table tennis. The International Journal of Robotics Research, 32(3):263 279, 2013.

[25] A. Y. Ng, S. J. Russell, et al. Algorithms for inverse reinforcement learning. In Icml, volume 1,

pages 663 670, 2000.

[26] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the

likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847 5861, Nov. 2010.

[27] S. Nowozin, B. Cseke, and R. Tomioka. F-GAN: Training Generative Neural Samplers using

Variational Divergence Minimization. ar Xiv:1606.00709 [cs, stat], June 2016.

[28] T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, J. Peters, et al. An algorithmic

perspective on imitation learning. Foundations and Trends in Robotics, 7(1-2):1 179, 2018.

[29] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York,

2000. 2nd edition, 2009.

[30] J. Pearl. Remarks on the method of propensity scores. Statistics in Medicine, 28:1415 1416,

2009. <http://ftp.cs.ucla.edu/pub/stat_ser/r345-sim.pdf>.

[31] D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in

neural information processing systems, pages 305 313, 1989.

[32] P. Rosenbaum and D. Rubin. The central role of propensity score in observational studies for

causal effects. Biometrika, 70:41 55, 1983.

[33] S. Russell and P. Norvig. Artiﬁcial intelligence: a modern approach. 2002.

[34] I. Shpitser and J. Pearl. Identiﬁcation of joint interventional distributions in recursive semi-

Markovian causal models. In Proceedings of the Twenty-First National Conference on Artiﬁcial Intelligence, pages 1219 1226. 2006.

[35] I. Shpitser and E. Sherman. Identiﬁcation of personalized effects associated with causal pathways. In UAI, 2018.

[36] P. Spirtes, C. N. Glymour, and R. Scheines. Causation, prediction, and search. MIT press,

[37] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 1998.

[38] U. Syed and R. E. Schapire. A game-theoretic approach to apprenticeship learning. In Advances

in neural information processing systems, pages 1449 1456, 2008.

[39] J. Tian. Studies in Causal Reasoning and Learning. Ph D thesis, Computer Science Department,

University of California, Los Angeles, CA, November 2002.

[40] J. Tian. Identifying dynamic sequential plans. In Proceedings of the Twenty-Fourth Conference

on Uncertainty in Artiﬁcial Intelligence, pages 554 561, 2008.

[41] J. Tian and J. Pearl. A general identiﬁcation condition for causal effects. In Proceedings of the

Eighteenth National Conference on Artiﬁcial Intelligence, pages 567 573, Menlo Park, CA, 2002. AAAI Press/The MIT Press.

[42] J. Tian and J. Pearl. A general identiﬁcation condition for causal effects. Technical Report

R-290-A, Department of Computer Science, University of California, Los Angeles, CA, 2003.

[43] B. van der Zander, M. Li skiewicz, and J. Textor. Constructing separators and adjustment sets in

ancestral graphs. In Proceedings of the UAI 2014 Conference on Causal Inference: Learning and Prediction-Volume 1274, pages 11 24, 2014.

[44] C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279 292, 1992.

[45] B. WIDROW. Pattern-recognizing control systems. Compurter and Information Sciences, 1964.

[46] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement

learning. In Aaai, volume 8, pages 1433 1438. Chicago, IL, USA, 2008.