# causal_imitation_learning_with_unobserved_confounders__267d405c.pdf Causal Imitation Learning with Unobserved Confounders Junzhe Zhang Columbia University junzhez@cs.columbia.edu Daniel Kumor Purdue University dkumor@purdue.edu Elias Bareinboim Columbia University eb@cs.columbia.edu One of the common ways children learn is by mimicking adults. Imitation learning focuses on learning policies with suitable performance from demonstrations generated by an expert, with an unspecified performance measure, and unobserved reward signal. Popular methods for imitation learning start by either directly mimicking the behavior policy of an expert (behavior cloning) or by learning a reward function that prioritizes observed expert trajectories (inverse reinforcement learning). However, these methods rely on the assumption that covariates used by the expert to determine her/his actions are fully observed. In this paper, we relax this assumption and study imitation learning when sensory inputs of the learner and the expert differ. First, we provide a non-parametric, graphical criterion that is complete (both necessary and sufficient) for determining the feasibility of imitation from the combinations of demonstration data and qualitative assumptions about the underlying environment, represented in the form of a causal model. We then show that when such a criterion does not hold, imitation could still be feasible by exploiting quantitative knowledge of the expert trajectories. Finally, we develop an efficient procedure for learning the imitating policy from experts trajectories. 1 Introduction A unifying theme of Artificial Intelligence is to learn a policy from observations in an unknown environment such that a suitable level of performance is achieved [33, Ch. 1.1]. Operationally, a policy is a decision rule that determines an action based on a certain set of covariates; observations are possibly generated by a human demonstrator following a different behavior policy. The task of evaluating policies from a combination of observational data and assumptions about the underlying environment has been studied in the literature of causal inference [29] and reinforcement learning [37]. Several criteria, algorithms, and estimation methods have been developed to solve this problem [29, 36, 3, 6, 35, 32, 44]. In many applications, it is not clear which performance measure the demonstrator is (possibly subconsciously) optimizing. That is, the reward signal is not labeled and accessible in the observed expert s trajectories. In such settings, the performance of candidate policies is not uniquely discernible from the observational data due to latent outcomes, even when infinitely many samples are gathered, complicating efforts to learn policy with satisfactory performance. An alternative approach used to circumvent this issue is to find a policy that mimics a demonstrator s behavior, which leads to the imitation learning paradigm [2, 4, 14, 28]. The expectation (or rather hope) is that if the demonstrations are generated by an expert with near-optimal reward, the performance of the imitator would also be satisfactory. Current methods of imitation learning can be categorized into behavior cloning [45, 31, 23, 24, 22] and inverse reinforcement learning [25, 1, 38, 46]. The former focuses on learning a nominal expert policy that approximates the conditional distribution mapping observed input covariates of the behavior policy to the action domain. The latter attempts to learn a reward function that prioritizes observed behaviors of the expert; reinforcement learning methods are then applied using the learned reward function to obtain a nominal 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Figure 1: Causal diagrams where X represents an action (shaded red) and Y represents a latent reward (shaded blue). Input covariates of the policy space are shaded in light red and minimal imitation surrogates relative to action X and reward Y are shaded in light blue. policy. However, both families of methods rely on the assumption that the expert s input observations match those available to the imitator. When unobserved covariates exist, however, naively imitating the nominal expert policy does not necessarily lead to a satisfactory performance, even when the expert him or herself behaves optimally. Figure 2: The tail light of the front car is unobserved in highway (aerial) drone data. For concreteness, consider a learning scenario depicted in Fig. 2, describing trajectories of human-driven cars collected by drones flying over highways [18, 8]. Using such data, we want to learn a policy (x|z) deciding on the acceleration (action) X of the demonstrator car based on the velocity and locations of both the demonstrator and front cars, summarized as covariates Z. In reality, the human demonstrator also uses the tail light L of the front car to coordinate his/her actions. The demonstrator s performance is evaluated with a latent reward function Y taking X, Z, L as input. However, only observations of X, Z are collected by the drone, summarized as probabilities P(x, z). Fig. 1a describes the graphical representation of this environment. A naïve approach would estimate the conditional distribution P(x|z) and use it as policy . A preliminary analysis reveals that this naive cloning approach leads to sub-optimal performance. Consider an instance where variables X, Y , Z, L, U 2 {0, 1}; their values are decided by functions: L Z U, X Z L, Y X Z L; Z, U are independent variables drawn uniformly over {0, 1}; represents the exclusive-or operator. The expected reward E[Y |do( )] induced by (x|z) = P(x|z) is equal to 0.5, which is quite far from the optimal demonstrator s performance, E[Y ] = 1. This example shows that even when one is able to perfectly mimic an optimal demonstrator, the learned policy can still be suboptimal. In this paper, we try to explicate this phenomenon and, more broadly, understand imitability through a causal lens1. Our task is to learn an imitating policy that achieves the expert s performance from demonstration data in a structural causal model [29, Ch. 7], allowing for unobserved confounders (UCs) affecting both action and outcome variables. Specifically, our contributions are summarized as follows. (1) We introduce a complete graphical criterion for determining the feasibility of imitation from demonstration data and qualitative knowledge about the data-generating process represented as a causal graph. (2) We develop a sufficient algorithm for identifying an imitating policy when the given criterion does not hold, by leveraging the quantitative knowledge in the observational distribution. (3) We provide an efficient and practical procedure for finding an imitating policy through explicit parametrization of the causal model, and use it to validate our results on high-dimensional, synthetic datasets. For the sake of space constraints, we provide all proofs in the complete technical report [15, Appendix A]. 1.1 Preliminaries In this section, we introduce the basic notations and definitions used throughout the paper. We use capital letters to denote random variables (X) and small letters for their values (x). DX represents the domain of X and PX the space of probability distributions over DX. For a set X, |X| denotes its dimension. We consistently use the abbreviation P(x) to represent the probabilities P(X = x). Finally, I{Z=z} is an indicator function that returns 1 if Z = z holds true; otherwise 0. 1Some recent progress in the field of causal imitation has been reported, albeit oblivious to the phenomenon described above and our contributions. Some work considered settings in which the input to the expert policy is fully observed [7], while another assumed that the primary outcome is observed (e.g., Y in Fig. 1a) [8]. Calligraphic letters, e.g., G, will be used to represent directed acyclic graphs (DAGs) (e.g., Fig. 1). We denote by GX the subgraph obtained from G by removing arrows coming into nodes in X; GX is a subgraph of G by removing arrows going out of X. We will use standard family conventions for graphical relationships such as parents, children, descendants, and ancestors. For example, the set of parents of X in G is denoted by pa(X)G = [X2Xpa(X)G. ch, de and an are similarly defined, We write Pa, Ch, De, An if arguments are included as well, e.g. De(X)G = de(X)G [ X. A path from a node X to a node Y in G is a sequence of edges which does not include a particular node more than once. Two sets of nodes X, Y are said to be d-separated by a third set Z in a DAG G, denoted by (X ?? Y |Z)G, if every edge path from nodes in one set to nodes in another are blocked . The criterion of blockage follows [29, Def. 1.2.3]. The basic semantic framework of our analysis rests on structural causal models (SCMs) [29, Ch. 7]. An SCM M is a tuple h U, V , F, P(u)i where V is a set of endogenous variables and U is a set of exogenous variables. F is a set of structural functions where f V 2 F decides values of an endogenous variable V 2 V taking as argument a combination of other variables. That is, V f V (Pa V , UV ), Pa V V , UV U. Values of U are drawn from an exogenous distribution P(u). Each SCM M induces a distribution P(v) over endogenous variables V . An intervention on a subset X V , denoted by do(x), is an operation where values of X are set to constants x, replacing the functions {f X : 8X 2 X} that would normally determine their values. For an SCM M, let Mx be a submodel of M induced by intervention do(x). For a set S V , the interventional distribution P(s|do(x)) induced by do(x) is defined as the distribution over S in the submodel Mx, i.e., P(s|do(x); M) , P(s; Mx). We leave M implicit when it is obvious from the context. For a detailed survey on SCMs, we refer readers to [29, Ch. 7]. 2 Imitation Learning in Structural Causal Models In this section, we formalize and study the imitation learning problem in causal language. We first define a special type of SCM that explicitly allows one to model the unobserved nature of some endogenous variables, which is called the partially observable structural causal model (POSCM).2 Definition 1 (Partially Observable SCM). A POSCM is a tuple h M, O, Li, where M is a SCM h U, V , F, P(u)i and h O, Li is a pair of subsets forming a partition over V (i.e., V = O [ L and O \ L = ;); O and L are called observed and latent endogenous variables, respectively. Each POSCM M induces a probability distribution over V , of which one can measure the observed variables O. P(o) is usually called the observational distribution. M is associated with a causal diagram G (e.g., see Fig. 1) where solid nodes represent observed variables O, dashed nodes represent latent variables L, and arrows represent the arguments Pa V of each functional relationship f V . Exogenous variables U are not explicitly shown; a bi-directed arrow between nodes Vi and Vj indicates the presence of an unobserved confounder (UC) affecting both Vi and Vj, i.e., UVi \UVj 6= ;. Consider a POSCM h M, O, Li with M = h U, V , F, P(u)i. Our goal is to learn an efficient policy to decide the value of an action variable X 2 O. The performance of the policy is evaluated using the expected value of a reward variable Y . Throughout this paper, we assume that reward Y is latent and X affects Y (i.e., Y 2 L \ De(X)G). A policy is a function mapping from values of covariates Pa O \ De(X)GX 3 to a probability distribution over X, which we denote by (x|pa ). An intervention following a policy , denoted by do( ), is an operation that draws values of X independently following , regardless of its original (natural) function f X. Let M denote the manipulated SCM of M induced by do( ). Similar to atomic settings, the interventional distribution P(v|do( )) is defined as the distribution over V in the manipulated model M , given by, P(v|do( )) = P(v|pa V , u V ) (x|pa ). (1) The expected reward of a policy is thus given by the causal effect E[Y |do( )]. The collection of all possible policies defines a policy space, denoted by = { : DPa 7! PX} (if Pa = ;, = { : PX}). For convenience, we define function Pa( ) = Pa . A policy space 0 is a 2This definition will facilitate the more explicitly articulation of which endogenous variables are available to the demonstrator and corresponding policy at each point in time. 3GX is a causal diagram associated with the submodel Mx induced by intervention do(x). subspace of if Pa( 0) Pa( ). We will consistently highlight action X in dark red, reward Y in dark blue and covariates Pa( ) in light red. For instance, in Fig. 1a, the policy space over action X is given by = { : DZ 7! PX}; Y represents the reward; 0 = { : PX} is a subspace of . Our goal is to learn an efficient policy 2 that achieves satisfactory performance, e.g., larger than a certain threshold E[Y |do( )] , without knowledge of underlying system dynamics, i.e., the actual, true POSCM M. A possible approach is to identify the expected reward E[Y |do( )] for each policy 2 from the combinations of the observed data P(o) and the causal diagram G. Optimization procedures are applicable to find a satisfactory policy . Let Mh Gi denote a hypothesis class of POSCMs that are compatible with a causal diagram G. We define the non-parametric notion of identifiability in the context of POSCMs and conditional policies, adapted from [29, Def. 3.2.4]. Definition 2 (Identifiability). Given a causal diagram G and a policy space , let Y be an arbitrary subset of V . P(y|do( )) is said to be identifiable w.r.t. h G, i if P (y|do( ); M) is uniquely computable from P(o; M) and for any POSCM M 2 Mh Gi and any 2 . In imitation learning settings, however, reward Y is often not specified and remains latent, which precludes approaches that attempt to identify E[Y |do( )]: Corollary 1. Given a causal diagram G and a policy space , let Y be an arbitrary subset of V . If not all variables in Y are observed (i.e., Y \ L 6= ;), P(y|do( )) is not identifiable. In other words, Corol. 1 shows that when the reward Y is latent, it is infeasible to uniquely determine values of E[Y |do( )] from P(o). A similar observation has been noted in [20, Prop. 1]. This suggests that we need to explore learning through other modalities. 2.1 Causal Imitation Learning To circumvent issues of non-identifiability, a common solution is to assume that the observed trajectories are generated by an expert demonstrator with satisfactory performance E[Y ], e.g., no less than a certain threshold (E[Y ] ). If we could find a policy that perfectly imitates" the expert with respect to reward Y , E[Y |do( )] = E[Y ], the performance of the learner is also guaranteed to be satisfactory. Formally, Definition 3 (Imitability). Given a causal diagram G and a policy space , let Y be an arbitrary subset of V . P(y) is said to be imitable w.r.t. h G, i if there exists a policy 2 uniquely computable from P(o) such that P(y|do( ); M) = P(y; M) for any POSCM M 2 Mh Gi. Our task is to determine the imitability of the expert performance. More specifically, we want to learn an imitating policy 2 from P(o) such that P(y|do( )) = P(y)4, in any POSCM M associated with the causal diagram G. Consider Fig. 3a as an example. P(y) is imitable with policy (x) = P(x) since by Eq. (1) and marginalization, P(y|do( )) = P x,w P(y|w)P(w|x) (x) = P x,w P(y|w)P(w|x)P(x) = P(y). In practice, unfortunately, the expert s performance cannot always be imitated. To understand this setting, we first write, more explicitly, the conditions under which this is not the case: Lemma 1. Given a causal diagram G and a policy space , let Y be an arbitrary subset of V . P(y) is not imitable w.r.t. h G, i if there exists two POSCMs M1, M2 2 Mh Gi satisfying P(o; M1) = P(o; M2) while there exists no policy 2 such that for i = 1, 2, P(y|do( ); Mi) = P(y; Mi). Figure 3: Imitability v. Identifiability. It follows as a corollary that P(y) is not imitable if there exists a POSCM M compatible with G such that no policy 2 could ensure P(y|do( ); M) = P(y; M). For instance, consider the causal diagram G and policy space in Fig. 3b. Here, the expert s reward P(y) is not imitable: consider a POSCM with functions X U, W X, Y U W; values U are drawn uniformly over {0, 1}. In this model, P(Y = 1|do( )) = 0.5 for any policy , which is far from the optimal expert reward, P(Y = 1) = 1. An interesting observation from the above example of Fig. 3b is that the effect P(y|do( )) is identifiable, following the front-door criterion in [29, Thm. 3.3.4], but no 4The imitation is trivial if Y 62 De(X)G: by Rule 3 of [29, Thm. 3.4.1] (or [6, Thm. 1]), P(y|do( )) = P(y) for any policy . This paper aims to find a specific satisfying P(y|do( )) = P(y) even when Y 2 De(X)G. policy imitates the corresponding P(y). However, in some settings, the expert s reward P(y) is imitable but the imitator s reward P(y|do( )) cannot be uniquely determined. To witness, consider again the example in Fig. 3a. The imitability of P(y) has been previously shown; while P(y|do( )) is not identifiable due to latent reward Y (Corol. 1). In general, the problem of imitability is orthogonal to identifiability, and, therefore, requires separate consideration. Since imitability does not always hold, we introduce a useful graphical criterion for determining whether imitating an expert s performance is feasible, and if so, how. Theorem 1 (Imitation by Direct Parents). Given a causal diagram G and a policy space , P(y) is imitable w.t.r. h G, i if pa(X)G Pa( ) and there is no bi-directed arrow pointing to X in G. Moreoever, the imitating policy 2 is given by (x|pa( )) = P (x|pa(X)G). In words, Thm. 1 says that if the expert and learner share the same policy space, then the policy is always imitable. In fact, this result can be seen as a causal justification for when the method of behavior cloning , widely used in practice, is valid, leading to proper imitation. When the original behavior policy f X is contained in the policy space , the leaner could imitate the expert s reward P(y) by learning a policy 2 that matches the distribution P(x|pa( )) [45, 31]. Next, we consider the more challenging setting when policy spaces of the expert and learner disagree (i.e., the learner and expert have different views of the world, f X 62 ). We will leverage a graphical condition adopted from the celebrated backdoor criterion [29, Def. 3.3.1]. Definition 4 ( -Backdoor). Given a causal diagram G and a policy space , a set Z is said to satisfy the -backdoor criterion w.r.t. h G. i if and only if Z Pa( ) and (Y ?? X|Z)GX, which is called the -backdoor admissible set w.r.t. h G, i. For concreteness, consider again the highway driving example in Fig. 1a. There exists no -backdoor admissible set due to the path X L ! Y . Now consider a modified graph in Fig. 1b where edge L ! Y is removed. {Z} is -backdoor admissible since Z 2 Pa( ) and (Y ?? X|Z)GX. Leveraging the imitation backdoor condition, our next theorem provides a full characterization for when imitating expert s performance is achievable, despite the fact that the reward Y is latent. Theorem 2 (Imitation by -Backdoor). Given a causal diagram G and a policy space , P(y) is imitable w.r.t. h G, i if and only if there exists an -backdoor admissible set Z w.r.t. h G, i. Moreover, the imitating policy 2 is given by (x|pa( )) = P (x|z). That is, one can learn an imitating policy from a policy space 0 = { : DZ 7! PX} that mimics the conditional probabilities P(x|z) if and only if Z is -backdoor admissible. If that is the case, such a policy can be learned from data through standard density estimation methods. For instance, Thm. 2 ascertains that P(y) in Fig. 1a is indeed non-imitable. On the other hand, P(y) in Fig. 1b is imitable, guaranteed by the -backdoor admissible set {Z}; the imitating policy is given by (x|z) = P(x|z). 3 Causal Imitation Learning with Data Dependency One may surmise that the imitation boundary established by Thm. 2 suggests that when there exists no -backdoor admissible set, it is infeasible to imitate the expert performance from observed trajectories of demonstrations. In this section, we will circumvent this issue by exploiting actual parameters of the observational distribution P(o). In particular, we denote by Mh G,P i a subfamily of candidate models in Mh Gi that induce both the causal diagram G and the observational distribution P(o), i.e., Mh G,P i = 8M 2 Mh Gi : P(o; M) = P(o) . We introduce a refined notion of imitability that will explore the quantitative knowledge of observations P(o) (to be exemplified). Formally, Definition 5 (Practical Imitability). Given a causal diagram G, a policy space , and an observational distribution P(o), let Y be an arbitrary subset of V . P(y) is said to be practically imitable (for short, p-imitable) w.r.t. h G, , P(o)i if there exists a policy 2 uniquely computable from P(o) such that P(y|do( ); M) = P(y; M) for any POSCM M 2 Mh G,P i. The following corollary can be derived based on the definition of practical imitability. Corollary 2. Given a causal diagram G, a policy space and an observational distribution P(o), let a subset Y V . If P(y) is imitable w.r.t. h G, i, P(y) is p-imitable w.r.t. h G, , P(o)i. Compared to Def. 3, the practical imitability of Def. 5 aims to find an imitating policy for a subset of candidate POSCMs Mh G,P i restricted to match a specific observational distribution P(o). Def. 3, on the other hand, requires only the causal diagram G. In other words, for an expert s performance P(y) that is non-imitable w.r.t. h G, i, it could still be p-imitable after analyzing actual probabilities of the observational distribution P(o). For concreteness, consider again P(y) in Fig. 3b which is not imitable due to the bi-directed arrow X $ Y . However, new imitation opportunities arise when actual parameters of the observational distribution P(x, w, y) are provided. Suppose the underlying POSCM is given by: X UX UY , W X UW , Y W UY where UX, UY , UW are independent binary variables drawn from P(UX = 1) = P(UY = 1) = P(UW = 0) = 0.9. Here, the causal effect P(y|do(x)) is identifiable from P(x, w, y) following the front-door formula P(y|do(x)) = P x0 P(y|w, x0)P(x0) [29, Thm. 3.3.4]. We thus have P(Y = 1|do(X = 0)) = 0.82 which coincides with P(Y = 1) = 0.82, i.e., P(y) is p-imitable with atomic intervention do(X = 0). In the most practical settings, the expert reward P(y) rarely equates to P(y|do(x)); stochastic policies (x) are then applicable to imitate P(y) by re-weighting P(y|do(x)) induced by the corresponding atomic interventions 5. To tackle p-imitability in a general way, we proceed by defining a set of observed variables that serve as a surrogate of the unobserved Y with respect to interventions on X. Formally, Definition 6 (Imitation Surrogate). Given a causal diagram G, a policy space , let S be an arbitrary subset of O. S is an imitation surrogate (for short, surrogate) w.r.t. h G, i if (Y ?? ˆX|S)G[ where G [ is a supergraph of G by adding arrows from Pa( ) to X; ˆX is a new parent to X. An surrogate S is said to be minimal if there exists no subset S0 S such that S0 is also a surrogate w.r.t. h G, i. Consider as an example Fig. 1c where the supergraph G [ coincides with the causal diagram G. By Def. 6, both {W, S} and {S} are valid surrogate relative to h X, Y i with {S} being the minimal one. By conditioning on S, the decomposition of Eq. (1) implies P(y|do( )) = P s,w,u P(y|s)P(s|w, u)P(w|x) (x)P(u) = P s P(y|s)P(s|do( )). That is, the surrogate S mediates all influence of interventions on action X to reward Y . It is thus sufficient to find an imitating policy such that P(s|do( )) = P(s) for any POSCM M associated with Fig. 1c. The resultant policy is guaranteed to imitate the expert s reward P(y). When a surrogate S is found and P(s|do( )) is identifiable, one could compute P(s|do( )) for each policy and check if it matches P(s). In many settings, however, P(s|do( )) is not identifiable w.r.t. h G, i. For example, in Fig. 1d, S is a surrogate w.r.t. h G, i, but P(s|do( )) is not identifiable due to collider Z ( uses non-descendants as input by default). Fortunately, identifying P(s|do( )) may still be feasible in some subspaces of : Definition 7 (Identifiable Subspace). Given a causal diagram G, a policy space , and a subset S O, let 0 be a policy subspace of . 0 is said to be an identifiable subspace (for short, id-subspace) w.r.t. h G, , Si if P(s|do( )) is identifiable w.r.t. h G, 0i. Consider a policy subspace 0 = { : PX} described in Fig. 1d (i.e. that does not exploit information from covariates Z). P(s|do( )) is identifiable w.r.t. h G, 0i following the front-door adjustment on W [29, Thm. 3.3.4]. We could then evaluate interventional probabilities P(s|do( )) for each policy 2 0 from the observational distribution P(x, w, s, z); the imitating policy is obtainable by solving the equation P(s|do( )) = P(s). In other words, {S} and 0 forms an instrument that allows one to solve the imitation learning problem in Fig. 1d. Definition 8 (Imitation Instrument). Given a causal diagram G and a policy space , let S be a subset of O and 0 be a subspace of . h S, 0i is said to be an imitation instrument (for short, instrument) if S is a surrogate w.r.t. h G, 0i and 0 is an id-subspace w.r.t. h G, , Si. Lemma 2. Given a causal diagram G, a policy space , and an observational distribution P(o), let h S, 0i be an instrument w.r.t. h G, i. If P(s) is p-imitable w.r.t. h G, 0, P(o)i, then P(y) is p-imitable w.r.t. h G, , P(o)i. Moreover, an imitating policy for P(s) w.r.t. h G, 0, P(o)i is also imitating policy for P(y) w.r.t. h G, , P(o)i. In words, Lem. 2 shows that when an imitation instrument h S, 0i is present, we could reduce the original imitation learning on a latent reward Y to a p-imitability problem over observed surrogate variables S using policies in an identifiable subspace 0. The imitating policy is obtainable by solving the equation P(s|do( )) = P(s). 5Consider a variation of the model where P(UW = 1) = 0.7. P(y) is p-imitable with (X = 0) = 0.75. 3.1 Confounding Robust Imitation Our task in this section is to introduce a general algorithm that finds instruments, and learns a p-imitating policy given h G, , P(o)i. A naïve approach is to enumerate all pairs of subset S and subspace 0 and check whether they form an instrument; if so, we can compute an imitating policy for P(s) w.r.t. h G, 0, P(o)i. However, the challenge is that the number of all possible subspaces 0 (or subsets S) can be exponentially large. Fortunately, we can greatly restrict this search space. Let G [ {Y } denote a causal diagram obtained from G by making reward Y observed. The following proposition suggests that it suffices to consider only identifiable subspaces w.r.t. h G [ {Y }, , Y i. Lemma 3. Given a causal diagram G, a policy space , let a subspace 0 . If there exists S O such that h S, 0i is an instrument w.r.t. h G, i, 0 is an id-subspace w.r.t. h G [ {Y }, , Y i. Algorithm 1: IMITATE 1: Input: G, , P(o). 2: while LISTIDSPACE(G [ {Y }, , Y ) outputs a policy subspace 0 do 3: while LISTMINSEP(G [ 0, ˆX, Y , {}, O) outputs a surrogate set S do 4: if IDENTIFY(G, 0, S) = YES then 5: Solve for a policy 2 0 such that P(s|do( ); M) = P(s) for any POSCM M 2 Mh G,P i. 6: Return if it exists; continue otherwise. 7: end if 8: end while 9: end while Return FAIL. Our algorithm IMITATE is described in Alg. 1. We assume access to an IDENTIFY oracle [41, 34, 6] that takes as input a causal diagram G, a policy space and a set of observed variables S. If P(s|do( )) is identifiable w.r.t. h G, i, IDENTIFY returns YES ; otherwise, it returns NO . For details about the IDENTIFY oracle, we refer readers to [15, Appendix B]. More specifically, IMITATE takes as input a causal diagram G, a policy space and an observational distribution P(o). At Step 2, IMITATE applies a subroutine LISTIDSPACE to list identifiable subspaces 0 w.r.t. h G[{Y }, , Y i, following the observation made in Lem. 3. The implementation details of LISTIDSPACE are provided in [15, Appendix C]. When an identifiable subspace 0 is found, IMITATE tries to obtain a surrogate S w.r.t the diagram G and subspace 0. While there could exist multiple such surrogates, the following proposition shows that it is sufficient to consider only minimal ones. Lemma 4. Given a causal diagram G, a policy space , an observational distribution P(o) and a subset S O. P(s) is p-imitable only if for any S0 S, P(s0) is p-imitable w.r.t. h G, , P(o)i. We apply a subroutine LISTMINSEP in [43] to enumerate minimal surrogates in O that d-separate ˆX and Y in the supergraph G [ 0. When a minimal surrogate S is found, IMITATE uses the IDENTIFY oracle to validate if P(s|do( )) is identifiable w.r.t. h G, 0i, i.e., h S, 0i form an instrument. Consider Fig. 1d as an example. While P(y|do( )) is not identifiable for every policy in had Y been observed, contains an id-subspace { : PX} w.r.t. h G [ {Y }, , Y i, which is associated with a minimal surrogate {S}. Applying IDENTIFY confirms that h{S}, { : PX}i is an instrument. At Step 5, IMITATE solves for a policy in the subspace 0 that imitates P(s) for all instances in the hypothesis class Mh G,P i. If such a policy exists, IMITATE returns ; otherwise, the algorithm continues. Since h S, 0i is an instrument, Lem. 2 implies that the learned policy , if it exists, is ensured to imitate the expert reward P(y) for any POSCM M 2 Mh G,P i. Theorem 3. Given a causal diagram G, a policy space , and an observational distribution P(o), if IMITATE returns a policy 2 , P(y) is p-imitable w.r.t. h G, , P(o)i. Moreover, is an imitating policy for P(y) w.r.t. h G, , P(o)i. 3.2 Optimizing Imitating Policies We now introduce optimization procedures to solve for an imitating policy at Step 5 of IMITATE algorithm. Since the pair h S, 0i forms a valid instrument (ensured by Step 4), the interventional distribution P(s|do( ); M) remains invariant among all models in Mh Gi, i.e., P(s|do( )) is identifiable w.r.t. h G, i. We could thus express P(s|do( ); M) for any M 2 Mh G,P i as a function of the observational distribution P(o); for simplicity, we write P(s|do( )) = P(s|do( ); M). The imitating policy is obtainable by solving the equation P(s|do( )) = P(s). We could derive a closed-form formula for P(s|do( )) following standard causal identification algorithms in [41, 34, 6]. As an (a) Highway Driving (b) Highway Driving (c) MNIST Digits Figure 4: (a) Causal diagram for highway driving example where a left-side car exists; (b,c) P(y|do( )) induced by the causal imitation method (ci) and the naive behavior cloning (bc) compared with the actual distribution P(y) over the expert s reward (opt). example, consider again the setting of Fig. 1c with binary X, W, S, Z; parameters of P(x, w, s, z) could be summarized using an 8-entry probability table. The imitating policy (x) is thus a solution of a series of linear equations P x (x)P(s|do(x)) = P(s) and P x (x) = 1, given by: (x0) = P(s1) P(s1|do(x0)) P(s1|do(x1)) P(s1|do(x0)), (x1) = P(s1|do(x1)) P(s1) P(s1|do(x1)) P(s1|do(x0)). Among quantities in the above equation, xi, sj represent assignments X = i, S = j for i, j 2 {0, 1}. The interventional distribution P(s|do(x)) could be identified from P(x, w, s, z) using the front-door adjustment formula P(s|do(x)) = P x0 P(s|x0, w)P(x0) [29, Thm. 3.3.4]. However, evaluating interventional probabilities P(s|do( )) from the observational distribution P(o) could be computational challenging if some variables in O are high-dimensional (e.g., W in Fig. 1d). Properties of the imitation instrument h S, 0i suggest a practical approach to address this issue. Since P(s|do( )) is identifiable w.r.t. h G, 0i, by Def. 2, it remains invariant over the models in the hypothesis class Mh Gi. This means that we could compute interventional probabilities P(s|do( ); M) in an arbitrary model M 2 Mh G,P i; such an evaluation will always coincide with the actual, true causal effect P(s|do( )) in the underlying model. This observation allows one to obtain an imitating policy through the direct parametrization of POSCMs [21]. Let Nh Gi be a parametrized subfamily of POSCMs in Mh Gi. We could obtain an POSCM M 2 Nh Gi such that its observational distribution P(o; M) = P(o); an imitating policy is then computed in the parametrized model M. Corol. 3 shows that such a policy is an imitating policy for the expert s reward P(y). Corollary 3. Given a causal diagram G, a policy space , and an observational distribution P(o), let h S, 0i be an instrument w.r.t. h G, i. If there exists a POSCM M 2 Mh G,P i and a policy 2 0 such that P(s|do( ); M) = P(s), then P(y) is p-imitable w.r.t. h G, , P(o)i. Moreover, is an imitating policy for P(y) w.r.t. h G, , P(o)i. In practical experiments, we consider a parametrized family of POSCMs Nh Gi where functions associated with each observed variable in O are parametrized by a family of neural networks, similar to [21]. Using the computational framework of Generative Adversarial Networks (GANs) [9, 27], we obtain a model M 2 Nh Gi satisfying the observational constraints P(o; M) = P(o). The imitating policy is trained through explicit interventions in the learned model M; a different GAN is then deployed to optimize the policy so that it imitates the observed trajectories drawn from P(s). 4 Experiments We demonstrate our algorithms on several synthetic datasets, including high D [18] consisting of natural trajectories of human driven vehicles, and on MNIST digits. In all experiments, we test our causal imitation method (ci): we apply Thm. 2 when there exists an -backdoor admissible set; otherwise, Alg. 1 is used to leverage the observational distribution. As a baseline, we also include naïve behavior cloning (bc) that mimics the observed conditional distribution P(x|pa( )), as well as the actual reward distribution generated by an expert (opt). We found that our algorithms consistently imitate distributions over the expert s reward in imitable (p-imitable) cases; and p-imitable instances commonly exist. We refer readers to [15, Appendix D] for more experiments, details, and analysis. Highway Driving We consider a modified example of the drone recordings of human-driven cars in Sec. 1 where the driver s braking action W of the left-side car is also observed. Fig. 4a shows the causal diagram of this environment; Z represent the velocity of the front-car; action X represents the velocity of the driving car; W and the reward signal Y are both affected by an unobserved confounder U, representing the weather condition. In Fig. 4a, {Z} is -backdoor admissible while {Z, W} is not due to active path X L ! W $ Y . We obtain policies for the causal and naive imitators training two separate GANs. Distributions P(y|do( )) induced by all algorithms are reported in Fig. 4b. We also measure the L1 distance between P(y|do( )) and the expert s reward P(y). We find that the causal approach (ci), using input set {Z}, successfully imitates P(y) (L1 = 0.0018). As expected, the naive approach (bc) utilizing all covariates {Z, W} is unable to imitate the expert (L1 = 0.2937). MNIST Digits We consider an instance of Fig. 1c where X, S, Y are binary variables; binary values of W are replaced with corresponding images of MNIST digits (pictures of 1 or 0), determined based on the action X. For the causal imitator (ci), we learn a POSCM ˆ M such that P(x, w, s; ˆ M) = P(x, w, s). To obtain ˆ M, we train a GAN to imitate the observational distribution P(x, w, s), with a separate generator for each X, W, S. We then train a separate discriminator measuring the distance between observed trajectories P(s) and interventional distribution P(s|do( ); ˆ M) over the surrogate {S}. The imitating policy is obtained by minimizing such a distance. Distributions P(y|do( )) induced by all algorithms are reported in Fig. 4c. We find that the causal approach (ci) successfully imitates P(y) (L1 = 0.0634). As expected, the naive approach (bc) mimicking distribution P(x) is unable to imitate the expert (L1 = 0.1900). 5 Conclusion We investigate the imitation learning in the semantics of structural causal models. The goal is to find an imitating policy that mimics the expert behaviors from combinations of demonstration data and qualitative knowledge about the data-generating process represented as a causal diagram. We provide a graphical criterion that is complete (i.e., sufficient and necessary) for determining the feasibility of learning an imitating policy that mimics the expert s performance. We also study a data-dependent notion of imitability depending on the observational distribution. An efficient algorithm is introduced which finds an imitating policy, by exploiting quantitative knowledge contained in the observational data and the presence of surrogate endpoints. Finally, we propose a practical procedure for estimating such an imitating policy from observed trajectories of the expert s demonstrations. Broader Impact This paper investigates the theoretical framework of learning a policy that imitates the distribution over a primary outcome from natural trajectories of an expert demonstrator, even when the primary outcome itself is unobserved and input covariates used by the expert determining original values of the action are unknown. Since in practice, the actual reward is often unspecified and the learner and the demonstrator rarely observe the environment in the same fashion, our methods are likely to increase the progress of automated decision systems. Such systems may be applicable to various fields, including the development of autonomous vehicle, industrial automation and the management of chronic disease. These applications may have a broad spectrum of societal implications. The adoption of autonomous driving and industrial automation systems could save cost and reduce risks such as occupational injuries; while it could also create unemployment. Treatment recommendation in the clinical decision support system could certainly alleviate the stress on the healthcare workers. However, this also raise questions concerning with the accountability in case of medical malpractice; collection of private personal information could also make the hospital database valuable targets for malicious hackers. Overall, we would encourage research to understand the risks arising from automated decision systems and mitigations for its negative impact. Recently, there is a growing amount of dataset of natural vehicle trajectories like high D [18] being licensed for commercial use. An immediate positive impact of this work is that we discuss potential risk of training decision-making policy from the observational data due to the presence of unobserved confounding, as shown in Sec. 1 and 4. More broadly, since our method is based on the semantics of structural causal models [29, Ch. 7], its adoption could cultivate machine learning practitioners with proper training in causal reasoning. A favorable characteristic of causal inference methods is that they are inherently robust: for example, the definition of imitability Def. 3 requires the imitating policy to perfectly mimics the expert performance in any model compatible to the causal diagram. Automated decision systems using the causal inference methods prioritize the safety and robustness in decision-making, which is increasingly essential since the use of black-box AI systems is prevalent and our understandings of their potential implications are still limited. Acknowledge The authors were partially supported by grants from NSF IIS-1704352 and IIS-1750807 (CAREER). [1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1, 2004. [2] B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469 483, 2009. [3] E. Bareinboim and J. Pearl. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113:7345 7352, 2016. [4] A. Billard, S. Calinon, R. Dillmann, and S. Schaal. Survey: Robot programming by demonstra- tion. Handbook of robotics, 59(BOOK_CHAP), 2008. [5] J. Correa and E. Bareinboim. From statistical transportability to estimating the effect of stochastic interventions. In S. Kraus, editor, Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 1661 1667, Macao, China, 2019. International Joint Conferences on Artificial Intelligence Organization. [6] J. Correa and E. Bareinboim. A calculus for stochastic interventions: Causal effect identification and surrogate experiments. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, 2020. AAAI Press. [7] P. de Haan, D. Jayaraman, and S. Levine. Causal confusion in imitation learning. In Advances in Neural Information Processing Systems, pages 11693 11704, 2019. [8] J. Etesami and P. Geiger. Causal transfer for imitation learning and decision making under sensor-shift. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, 2020. AAAI Press. [9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672 2680, 2014. [10] O. Goudet, D. Kalainathan, P. Caillou, I. Guyon, D. Lopez-Paz, and M. Sebag. Causal Generative Neural Networks. ar Xiv:1711.08936 [stat], Nov. 2017. [11] O. Goudet, D. Kalainathan, P. Caillou, D. Lopez-Paz, I. Guyon, M. Sebag, A. Tritas, and P. Tubaro. Learning Functional Causal Models with Generative Neural Networks. ar Xiv:1709.05321 [stat], Sept. 2017. [12] R. D. Hjelm, A. P. Jacob, T. Che, A. Trischler, K. Cho, and Y. Bengio. Boundary-Seeking Generative Adversarial Networks. ar Xiv:1702.08431 [cs, stat], Feb. 2018. [13] Y. Huang and M. Valtorta. Pearl s calculus of intervention is complete. In R. Dechter and T. Richardson, editors, Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, pages 217 224. AUAI Press, Corvallis, OR, 2006. [14] A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):1 35, 2017. [15] E. B. Junzhe Zhang, Daniel Kumor. Causal imitation learning with unobserved confounders. Technical Report R-66, Causal AI Lab, Columbia University., 2020. [16] M. Kocaoglu, C. Snyder, A. G. Dimakis, and S. Vishwanath. Causal GAN: Learning Causal Implicit Generative Models with Adversarial Training. ar Xiv:1709.02023 [cs, math, stat], Sept. 2017. [17] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009. [18] R. Krajewski, J. Bock, L. Kloeker, and L. Eckstein. The highd dataset: A drone dataset of naturalistic vehicle trajectories on german highways for validation of highly automated driving systems. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pages 2118 2125, 2018. [19] S. Lee and E. Bareinboim. Structural causal bandits with non-manipulable variables. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, pages 4164 4172, Honolulu, Hawaii, 2019. AAAI Press. [20] S. Lee and E. Bareinboim. Causal effect identifiability under partial-observability. In Proceed- ings of the 37th International Conference on Machine Learning (ICML-20), 2020. [21] C. Louizos, U. Shalit, J. Mooij, D. Sontag, R. Zemel, and M. Welling. Causal Effect Inference with Deep Latent-Variable Models. ar Xiv:1705.08821 [cs, stat], May 2017. [22] J. Mahler and K. Goldberg. Learning deep policies for robot bin picking by simulating robust grasping sequences. In Conference on robot learning, pages 515 524, 2017. [23] U. Muller, J. Ben, E. Cosatto, B. Flepp, and Y. L. Cun. Off-road obstacle avoidance through end-to-end learning. In Advances in neural information processing systems, pages 739 746, 2006. [24] K. Mülling, J. Kober, O. Kroemer, and J. Peters. Learning to select and generalize striking movements in robot table tennis. The International Journal of Robotics Research, 32(3):263 279, 2013. [25] A. Y. Ng, S. J. Russell, et al. Algorithms for inverse reinforcement learning. In Icml, volume 1, pages 663 670, 2000. [26] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847 5861, Nov. 2010. [27] S. Nowozin, B. Cseke, and R. Tomioka. F-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. ar Xiv:1606.00709 [cs, stat], June 2016. [28] T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, J. Peters, et al. An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(1-2):1 179, 2018. [29] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, 2000. 2nd edition, 2009. [30] J. Pearl. Remarks on the method of propensity scores. Statistics in Medicine, 28:1415 1416, 2009. . [31] D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, pages 305 313, 1989. [32] P. Rosenbaum and D. Rubin. The central role of propensity score in observational studies for causal effects. Biometrika, 70:41 55, 1983. [33] S. Russell and P. Norvig. Artificial intelligence: a modern approach. 2002. [34] I. Shpitser and J. Pearl. Identification of joint interventional distributions in recursive semi- Markovian causal models. In Proceedings of the Twenty-First National Conference on Artificial Intelligence, pages 1219 1226. 2006. [35] I. Shpitser and E. Sherman. Identification of personalized effects associated with causal pathways. In UAI, 2018. [36] P. Spirtes, C. N. Glymour, and R. Scheines. Causation, prediction, and search. MIT press, [37] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 1998. [38] U. Syed and R. E. Schapire. A game-theoretic approach to apprenticeship learning. In Advances in neural information processing systems, pages 1449 1456, 2008. [39] J. Tian. Studies in Causal Reasoning and Learning. Ph D thesis, Computer Science Department, University of California, Los Angeles, CA, November 2002. [40] J. Tian. Identifying dynamic sequential plans. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, pages 554 561, 2008. [41] J. Tian and J. Pearl. A general identification condition for causal effects. In Proceedings of the Eighteenth National Conference on Artificial Intelligence, pages 567 573, Menlo Park, CA, 2002. AAAI Press/The MIT Press. [42] J. Tian and J. Pearl. A general identification condition for causal effects. Technical Report R-290-A, Department of Computer Science, University of California, Los Angeles, CA, 2003. [43] B. van der Zander, M. Li skiewicz, and J. Textor. Constructing separators and adjustment sets in ancestral graphs. In Proceedings of the UAI 2014 Conference on Causal Inference: Learning and Prediction-Volume 1274, pages 11 24, 2014. [44] C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279 292, 1992. [45] B. WIDROW. Pattern-recognizing control systems. Compurter and Information Sciences, 1964. [46] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433 1438. Chicago, IL, USA, 2008.