# deconfounding_imitation_learning_with_variational_inference__c23dbbfd.pdf Published in Transactions on Machine Learning Research (09/2024) Deconfounding Imitation Learning with Variational Inference Risto Vuorio 1,2, Pim de Haan 2,3, Johann Brehmer2, Hanno Ackermann2, Daniel Dijkman2, and Taco Cohen2 1University of Oxford 2Qualcomm AI Research. Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc. 3QUVA Lab, University of Amsterdam Reviewed on Open Review: https://openreview.net/forum?id=3Fs Vts ISHW Standard imitation learning can fail when the expert demonstrators have different sensory inputs than the imitating agent. This is because partial observability gives rise to hidden confounders in the causal graph. Previously, to work around the confounding problem, policies have been trained by accessing the expert s policy or using inverse reinforcement learning (IRL). However, both approaches have drawbacks as the expert s policy may not be available and IRL can be unstable in practice. Instead, we propose to train a variational inference model to infer the expert s latent information and use it to train a latent-conditional policy. We prove that using this method, under strong assumptions, the identification of the correct imitation learning policy is theoretically possible from expert demonstrations alone. In practice, we focus on a setting with less strong assumptions where we use exploration data for learning the inference model. We show in theory and practice that this algorithm converges to the correct interventional policy, solves the confounding issue, and can under certain assumptions achieve an asymptotically optimal imitation performance. 1 Introduction (a) Expert (b) Ignorant policy Figure 1: Bayes nets for (a) an expert trajectory and (b) an imitator trajectory. The expert action depends on the latent variable θ (red arrows) whereas the imitator action does not. Successful training of policies via behavioral cloning (BC) requires a high quality expert dataset. The conditions under which the dataset is collected have to exactly match those encountered by the imitator. Sometimes such data collection may not be feasible. For example, imagine collecting data from human drivers for training a driving policy for a self-driving car. The drivers are aware of the weather forecast and lower their speed in icy conditions, even when the ice is not visible on the road. However, if the driving policy in the self-driving car does not have access to the same forecast, it is unaware of the ice on the road and may thus be unable to adapt to the conditions. In this paper, we focus on such imitation learning in settings where the expert knows more about the world than the imitator. Equal contribution. Corresponding author: risto.vuorio@gmail.com Work done during internship at Qualcomm AI Research. Published in Transactions on Machine Learning Research (09/2024) The information the expert observes about the world but the imitator does not can be modeled as a latent variable that affects the dynamics of the environment. Imitating an expert who observes the latent variable, with a policy that does not, results in the learned policy marginalizing its uncertainty about the latent variable. This can result in poor performance, as the agent always acts randomly in states where the expert acts on its knowledge about the latent. To fix this, we assume that the imitator policy is dependent on the entire history of interaction with the environment instead of just the current observation. This enables the agent to make an inference about the value of the latent variable based on everything it has observed so far, which eventually allows it to break ties over the actions in states where the latent variable matters. However, this introduces another problem where the latent variable of the environment acts as a causal confounder in the graph modeling the interaction of the agent with its environment as illustrated in Figure 1. In such a situation, an imitating agent may take their own past actions as evidence for the values of the confounder. The self-driving car, for instance, could conclude that as it is driving fast there can be no ice on the road. This issue of causal delusion was first pointed out in Ortega & Braun (2010a;b) and studied in more depth by Ortega et al. (2021). This problem can be a source of delusions in generative models including large language models Ortega et al. (2021). Instead of general generative modeling, we focus on a control setting where we have access to the true environment where the agent is going to be deployed. Ortega et al. (2021) show that using the classic dataset aggregation (DAgger) algorithm (Ross et al., 2011) solves this problem by querying the expert policy in the new situations the agent encounters to provide new supervision for learning the correct behavior. However, this kind of query access to the expert may not be available in practice. As another solution to the confounded imitation learning problem, Swamy et al. (2022b) propose to use inverse reinforcement learning (IRL) (Russell, 1998; Ng et al., 2000). However, IRL typically requires adversarial methods (Ho & Ermon, 2016; Fu et al., 2017), which are not as well behaved and scalable as supervised learning. To get around the confounding problem without query access to the expert or IRL, we propose a practical algorithm based on variational inference. In our approach, an inference model for the latent variable is learned from exploration data collected using the imitator s policy. This inference model is used for inferring the latent variable on the expert trajectories for training the imitator and for inferring the latent variable online when the imitator is deployed in the environment. We show that, in theory, this algorithm converges to the expert s policy. In other words, we show that the expert s policy can be identified. Furthermore we show that under strong assumptions this identification can be carried out purely from offline data. Finally, we validate its performance empirically in a set of confounded imitation learning problems with high-dimensional observation and action spaces. These contributions can be summarized as follows: We propose a practical method based on variational inference to address confounded IL without expert queries. We verify its performance empirically in various confounded control problems, outperforming naive BC and IRL. We propose a theoretical method for confounded IL, purely from offline data, without expert queries nor exploration. We provide theoretical insight into the proposed methods by proving under strong assumptions that the expert s policy can be identified. 2 Related work Imitation learning Learning from demonstration has a long history with applications in autonomous driving (Pomerleau, 1988; Lu et al., 2022) and robotics (Schaal, 1999; Padalkar et al., 2023). Standard algorithms include BC, IRL (Russell, 1998; Ng et al., 2000; Ziebart et al., 2008), and adversarial methods (Ho & Ermon, 2016; Fu et al., 2017). Imitation learning can suffer from a mismatch between the distributions faced by the expert and imitator due to the accumulation of errors when rolling out the imitator policy. This is commonly addressed by querying experts during the training (Ross et al., 2011) or by noise insertion in the expert actions Laskey et al. (2017). Published in Transactions on Machine Learning Research (09/2024) Note that this issue is qualitatively different from the one we discuss in the paper: it is a consequence of the limited support of the expert actions and occurs even in the absence of the latent confounders. Kumar et al. (2021) and Peng et al. (2020) consider a similar setting to ours where the environment has a latent variable, which explains the dynamics. They use privileged information to learn the dynamics encoder via supervised learning. In contrast, only the expert has access to privileged information in our setting. Causality-aware imitation learning Ortega & Braun (2010a;b) and Ortega et al. (2021) pointed out the issue of latent confounding and causal delusions that we discuss in this paper. In particular, Ortega et al. (2021) propose a training algorithm that learns the correct interventional policy. Unlike our algorithm, their approach requires querying experts during the training. However, as we discuss in section 4, their solution has weaker assumptions and also applies to non-Markovian dynamics. Most similar to our work is Swamy et al. (2022b), which finds theoretical bounds on the gap between expert behavior and an imitator in the confounded setting, when imitating via BC, DAgger (Ross et al., 2011), which requires expert queries, or inverse RL. Inverse RL suffers from two key challenges: (1) it requires reinforcement learning in an inner loop with a non-stationary reward, and (2) the reward is typically learned via potentially unstable adversarial methods (Ho & Ermon, 2016; Fu et al., 2017). In contrast, our method trains the behavior policy using a well-behaved BC objective, which often enjoys better robustness and scalability compared to inverse RL. There are various other works on the intersection of causality and IL which differ in setup. De Haan et al. (2019) consider the confusion of an imitator when the expert s decisions can be explained from causal and non-causal features of the state. It differs from our work as they assume the state to be fully observed, meaning it does not apply to our situation in which there is a latent confounder which needs to be inferred. This problem has been also discussed in Codevilla et al. (2019); Wen et al. (2020; 2022) and Spencer et al. (2021) under various names. Rezende et al. (2020) point out that the same problem appears in partial models that only use a subset of the state and find a minimal set of variables that avoid confounding. Swamy et al. (2022a) consider imitation learning with latent variables that affect the expert policy, but not the state dynamics, which is different from our case, in which the latent affects both the state and the expert s actions. Kumor et al. (2021) study the case in which a graphical model of the partially observed state is known and find which variables can be adjusted for so that conditional BC is optimal. An extension, Ruan et al. (2022), also handles sub-optimal experts. Meta-learning behaviors Our problem is related to meta-IL (Duan et al., 2017; Beck et al., 2023) where the aim is to train an imitation learning agent that can adapt to new demonstration data efficiently. Differently from our problem, the tasks in meta-IL can vary in the reward function. For imitation learning to work with the new reward functions, demonstrations of policies maximizing the new rewards are required. While meta-IL also considers a distribution of MDPs, the motivations and methods are different from ours. Our work does not consider adapting to new demonstrations, whereas meta-IL does not consider the confounding problem in the demonstrations. Furthermore, our problem is related to meta-reinforcement learning (RL) (Duan et al., 2016; Wang et al., 2016; Beck et al., 2023), where an adaptive agent is trained to learn new tasks quickly via RL. Rakelly et al. (2019) and Zintgraf et al. (2020) propose meta-RL algorithms that consist of a task encoder and a task-conditional policy, similar to our inference model and latent-conditional policy. Zhou et al. (2019) propose agents that learn to probe the environment to determine the latent variables explaining the dynamics. Differently from our problem, the true reward function of the task is known. 3 Background We begin by introducing confounded imitation learning. Following Ortega et al. (2021), we discuss how BC fails in the presence of latent confounders. We then define the interventional policy, the ideal (but a priori intractable) solution to the confounding problem. Published in Transactions on Machine Learning Research (09/2024) 3.1 Imitation learning Imitation learning learns a policy from a dataset of expert demonstrations via supervised learning. The expert is a policy that acts in a (reward-free) Markov decision process (MDP) defined by a tuple M = (S, A, P(s | s, a), P(s0)), where S is the set of states, A is the set of actions, P(s | s, a) is the transition probability, and P(s0) is a distribution over initial states. The expert s interaction with the environment produces a trajectory τ = (s0, a0, . . . , a T 1, s T ). The expert may maximize the expectation over some reward function, but this is not necessary (and some tasks cannot be expressed through Markov rewards (Abel et al., 2021)). In the simplest form of imitation learning, a BC policy πη(a | s) parametrized by η is learned by maximizing the likelihood of the expert data, i.e., minimizing the loss P s,a D log πη(a | s), where D is the dataset of state-action pairs collected by the expert s policy. 3.2 Confounded imitation learning We now extend the imitation learning setup to allow for some latent variables θ Θ that are observed by the expert, but not the imitator. We define a family of Markov decision processes as a latent space Θ, a distribution P(θ), and for each θ Θ, a reward-free MDP Mθ = (S, A, P(s | s, a, θ), P(s0 | θ)). Here, we make the crucial assumption that the latent variable is constant in each trajectory. We assume there exists an expert policy πexp(a | s, θ) for each MDP. When it interacts with the environment, it generates the following distribution over trajectories τ: Pexp(τ |θ) = P(s0 |θ) t=0 P(st+1 |st, at; θ)πexp(at |st; θ). In this setting, the trajectories from the expert distributions are called confounded, because the states and actions have an common ancestor, the latent variable θ. The imitator does not observe this latent variable. It may thus need to implicitly infer it from the past transitions, so we take it to be a non-Markovian policy πη(at | s1, a1, . . . , st), parameterized by η. The imitator generates the following distribution over trajectories: Pη(τ | θ) = P(s0 | θ) t=0 P(st+1 | st, at; θ) πη(at | s0, a0, . . . , st) . The Bayesian networks associated to these distributions are shown in Figure 1. The goal of imitation learning in this setting is to learn imitator parameters η such that when the imitator is executed in the environment, the imitator agrees with the expert s decisions. This means we wish to maximise Eθ P (θ) Eτ Pη(τ;θ)[P st,at τ log πexp(at | st, θ)]. If the expert solves some task (e. g. maximizes some reward function), this amounts to solving the same task. The latent variable θ stays fixed during the entire interaction of the agent with the environment. 3.3 Naive behavioral cloning If we have access to a data set of expert demonstrations, we can use BC on the demonstrations to learn the maximum likelihood estimate of the expert s policy. At optimality, this learns the conditional policy: πcond(at |s1, a1, . . . , st) = E θ pcond(θ|τ)πexp(at |st, θ) , (1) pcond(θ | τ) p(θ) Y t p(st+1 | st, at, θ)πexp(at | st, θ) . Following Ortega et al. (2021), consider the following example of a confounded multi-armed bandit with A = Θ = {1, . . . , 5} and S = {0, 1}: 5, πexp(at | st, θ) = ( 6 10 if at = θ 1 10 if at = θ , P(st+1 = 1 | st, at, θ) = ( 3 4 if at = θ 1 4 if at = θ. (2) Published in Transactions on Machine Learning Research (09/2024) Conditional Interventional Figure 2: Actions from rollouts from bandit environment defined by Equation 2. The x-axis is episode time. In the y-axis five roll-outs are shown from the expert and policies Equation 1 and Equation 3. Colors denote actions, with the correct arm labelled green. The interventional imitator tends to the expert policy, while the conditional policy tends to repeat itself. The expert knows which bandit arm is special (and labeled by θ) and pulls it with high probability, while the imitating agent does not have access to this information. We define the reward in this bandit environment as rt = st+1. However, note that we compare imitation learning algorithms that do not access the reward of the environment and only use the rewards for comparing the learned behaviors at evaluation time. If we roll out the naive BC policy in this environment, shown in Figure 2, we see the causal delusion at work. At time t, inferring the latent by pcond takes past actions as evidence for the latent variable. This makes sense on the expert demonstrations, as the expert knows the latent variable. However, during an imitator roll-out, the past actions are not evidence of the latent, as the imitator is blind to it. Concretely, the imitator will take its first action uniformly and later tends to repeat that action, as it mistakenly takes the first action to be evidence for the latent. 3.4 Interventional policy A solution to this issue is to only take as evidence the data that was actually informed by the latent, which is the dynamics distribution p(st+1 | st, at, θ). This defines the following imitator policy: πint(at | s1, a1, . . . , st) = E θ pint(θ|τ)πexp(at | st, θ) , pint(θ | τ) p(θ) Y t p(st+1 | st, at, θ) (3) In a causal framework, this corresponds to treating the choice of past actions as interventions. To see this, consider the distribution p(at, s1, . . . , st|do(a1, . . . , at 1). By classic rules of the do-calculus (Pearl, 2009), this is equal to Z Θ p(θ)p(at|st, θ) Y τ