# a_coupled_flow_approach_to_imitation_learning__0051217c.pdf A Coupled Flow Approach to Imitation Learning Gideon Freund 1 Elad Sarafian 1 Sarit Kraus 1 In reinforcement learning and imitation learning, an object of central importance is the state distribution induced by the policy. It plays a crucial role in the policy gradient theorem, and references to it along with the related state-action distribution can be found all across the literature. Despite its importance, the state distribution is mostly discussed indirectly and theoretically, rather than being modeled explicitly. The reason being an absence of appropriate density estimation tools. In this work, we investigate applications of a normalizing flow-based model for the aforementioned distributions. In particular, we use a pair of flows coupled through the optimality point of the Donsker-Varadhan representation of the Kullback Leibler (KL) divergence, for distribution matching based imitation learning. Our algorithm, Coupled Flow Imitation Learning (CFIL), achieves state-of-the-art performance on benchmark tasks with a single expert trajectory and extends naturally to a variety of other settings, including the subsampled and state-only regimes. 1. Introduction Reinforcement learning (RL) (Sutton & Barto, 2018) concerns the optimization of an agent s behavior in an environment. Its characterizing difficulties of exploration vs exploitation and credit assignment, stem from its typical incarnations where the agent must learn from sparse feedback. In order to provoke desired behavior, one may need to craft a sophisticated reward function or provide demonstrations for the agent to imitate. Imitation Learning (IL) deals precisely with the latter: learning from expert demonstrations. Although RL and IL share the same ultimate goal of producing a good policy, they differ fundamentally in that RL is guided by the environment s feedback, while IL is guided 1Department of Computer Science, Bar-Ilan University, Israel. Correspondence to: Gideon Freund . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). only by the ability of the agent to reproduce the expert s behavior. Central to both RL and IL are the state and state-action distributions induced by the policy. Their importance cannot be overstated, with the state distribution forming the basis of policy gradient methods through the policy gradient theorem (Sutton et al., 2000), and the state-action distribution being core to the common distribution matching formulation of IL (Ke et al., 2020; Ghasemipour et al., 2020). They are also foundational to other applications, like curiosity-based exploration (Pathak et al., 2017), constrained RL (Qin et al., 2021) and Batch RL (Fujimoto et al., 2019), some of which have recently been unified under the umbrella of convex RL (Zhang et al., 2020; Zahavy et al., 2021; Mutti et al., 2022), which studies objectives that are convex functions of the state distribution. Despite their ubiquity across the literature, explicit modeling of the distributions is scarce. Instead, they mostly find use as a theoretical tool for derivations. This is, of course, barring some approaches that do attempt to model them (Hazan et al., 2019; Qin et al., 2021; Lee et al., 2019; Kim et al., 2021b) or their ratios (Nachum et al., 2019; Liu et al., 2018; Gangwani et al., 2020), further discussion of which is delegated to the related work. The reason for this lack of modeling is due to the difficulty of density estimation, especially in the case of complex and high-dimensional distributions. This is where normalizing flows (NF) (Dinh et al., 2014; 2016; Papamakarios et al., 2017), a recent approach to density estimation, will be of service. We believe there is no shortage of applications for such modeling, but we focus on imitation learning, a quite natural and suitable place, given its modern formulation as stateaction distribution matching (Ghasemipour et al., 2020). Many approaches to distribution matching based imitation have been presented (Ho & Ermon, 2016; Fu et al., 2017; Kostrikov et al., 2018; Ke et al., 2020; Ghasemipour et al., 2020; Kostrikov et al., 2019; Sun et al., 2021; Kim et al., 2021b; Dadashi et al., 2020; Schroecker, 2020). The common theme for such methods begins with the selection of a divergence, followed by the development of a unique approach. This may involve a direct attack on the objective by reformulating it (Kostrikov et al., 2019), or by derivation of a surrogate objective (Zhu et al., 2020; Dadashi et al., A Coupled Flow Approach to Imitation Learning 2020; Kim et al., 2021b), with some utilizing mechanisms such as an inverse action model (Zhu et al., 2020) and focusing on learning from states alone (a setting our approach naturally lends to). Other methods first derive estimates for the gradient of the state distribution with respect to the policy s parameters (Schroecker, 2020), while some devise unifying algorithms and frameworks encompassing previous approaches (Ke et al., 2020; Ghasemipour et al., 2020). The most popular divergence of choice is the reverse KL, which some favor due to its mode-seeking behavior (Ghasemipour et al., 2020). Others attempt to get the best of both worlds, combining both mode-seeking and modecovering elements (Zhu et al., 2020). A priori, it is difficult to say which choice of divergence is advantageous, it s more about the ensuing approach to its minimization. In this work, we propose a unique approach to distribution matching based imitation, by coupling a pair of flows through the optimality point of the Donsker-Varadhan (Donsker & Varadhan, 1976) representation of the KL. More specifically, by noting this point occurs at the log distribution ratio, while the IL objective with the reverse KL can be seen as an RL problem with the ratio inverted. We propose setting the point of optimality as the difference of two normalizing flows, then training in an alternating fashion akin to other adversarial IL methods (Ho & Ermon, 2016; Kostrikov et al., 2018; Ghasemipour et al., 2020; Kostrikov et al., 2019; Sun et al., 2021). This method proves far more accurate than estimating the log distribution ratio by naively training a pair of flows independently. We show this in part by analyzing their respective BC graphs: a simple tool we present for gauging how well a proposed estimator captures the expert s behavior. While most IL works neglect analysis of their learned reward function, we think this can be a potential guiding tool for future IL researchers. Our resulting algorithm, Coupled Flow Imitation Learning (CFIL) shows strong performance on standard benchmark tasks, while extending naturally to the subsampled and stateonly regimes. In the state-only regime in particular, CFIL exhibits significant advantage over prior state-of-the-art work, despite the competition being specifically designed for that domain. This work also aims to inspire more research incorporating explicit modeling of the state-action distribution. 2. Background 2.1. Markov Decision Process Environments in RL are typically expressed as a Markov Decision process (MDP). An MDP is a tuple (S, A, P, R, p0, γ), where S is a set of states termed the state space, A is a set of actions termed the action space, P : S A S [0, ] is a transition density function describing the environment s Markovian dynamics, R : S A S R is a reward function, p0 is an initial state distribution, and γ [0, 1) is a discount factor. A policy π : S A dictates an agent s actions when roaming the environment, and the goal in an MDP is to find the optimal one, denoted π . The standard optimality criterion for a policy is the expected discounted reward: J(π, r) = Eτ π [P t=0 γtrt] where τ π symbolizes the trajectory distribution induced by the policy π, and rt is the reward at time t along a trajectory. Each policy has an associated value function and Q-function: V π(s) = Eπ [P t=0 γtrt|s0 = s] and Qπ(s, a) = Eπ [P t=0 γtrt|s0 = s, a0 = a] , where the value function represents the expected discounted reward of following π from state s, while the Q-function represents the expected discounted reward of following π after taking action a from state s. Another object induced by the policy is the improper discounted state distribution dγ π(s) = P t=0 γt Pr(st = s|s0 p0), as well as its undiscounted counterpart, dπ(s) = P t=0 Pr(st = s|s0 p0), in which we take greater interest for reasons described in Appendix C. dπ(s) is called the state distribution and is of central interest to this work. Closely related to dπ(s) is the state-action distribution pπ(s, a) = π(a|s)dπ(s), which is also of interest to this work. 2.2. Normalizing Flows Normalizing Flows (NF) (Dinh et al., 2014; 2016) are exactlikelihood generative models that use a simple base distribution p Z(z) and a sequence of invertible and differentiable transformations fθ = fk fk 1 f1 to model an arbitrary target distribution p X(x) as z = fθ(x). 1 This means the target log-likelihood can be written using the change of variables formula: log p X(x) = log p Z(z) + i=1 log det dfi allowing for parameters θ to be trained using maximum likelihood. With a trained flow at hand, generation is performed by sampling from the base distribution and applying f 1 θ , and density estimation can be done by evaluating the RHS of (1). Real NVP (Dinh et al., 2016), which is of particular interest to us, uses coupling layers (not to be confused with our coupled approach) to construct fθ. Masked Autoregressive Flow (MAF) (Papamakarios et al., 2017) is a generalization of Real NVP which substitutes its coupling layers with autoregressive ones. While improving expressiveness, this impedes single pass sampling, but still allows efficient 1More often presented as x = g(z) (Papamakarios et al., 2021; Kobyzev et al., 2020), which is equivalent with f = g 1, due to their invertibility. Under this model, Equation (1) would look similar with only the + substituted by a . A Coupled Flow Approach to Imitation Learning density evaluation our main concern using masked autoencoders (Germain et al., 2015). Real NVP and MAF are reviewed more thoroughly in Appendix D. Although much research (Kingma & Dhariwal, 2018; Huang et al., 2018; Durkan et al., 2019; Ho et al., 2019) has gone into improving the capacity of MAF, due to its simplicity, efficiency and adequacy for our task, it is the architecture used in this work to model the state and state-action distributions dπ(s) and pπ(s, a). 2.3. Imitation Learning Imitation Learning (IL) concerns the optimization of an agent s behavior in an environment, given expert demonstrations. Perhaps the simplest approach to imitation, behavioral cloning (BC), performs supervised regression or maximumlikelihood on given expert state-action pairs {(se, ae)}N t=1: t=0 Loss(π(st), at) or max π t=0 log π(at|st) (2) for deterministic and stochastic policies, respectively. BC suffers from compounding errors and distributional shift, wherein unfamiliar states cause the policy to misstep, leading to even less familiar states and an eventual complete departure from the expert trajectory (Ross et al., 2011). Recent distribution matching (DM) approaches (Ho & Ermon, 2016; Kostrikov et al., 2018; 2019; Kim et al., 2021b) successfully overcome these issues. One common formulation (Ke et al., 2020; Ghasemipour et al., 2020) encompassing most of these methods, views DM as an attempt to match the agent s state-action distribution pπ, with the expert s pe, by minimizing some f-divergence2 Df: arg min π Df(pπ||pe). (3) These methods hinge on the one-to-one relationship between a policy and its state-action distribution and have shown significant improvement over BC, particularly when few expert trajectories are available (Ghasemipour et al., 2020) or expert trajectories are subsampled (Li et al., 2022). 3. Related Work Attempts at modeling the state distribution have been made for various applications and countless distribution matching approaches to imitation learning exist, with varying degrees of relevancy to our own. The most pertinent works those intersecting both the former and the latter are reviewed in most detail towards the end of this section. 2Other distances, such as the Wasserstein metric have also been used (Sun et al., 2021; Dadashi et al., 2020). Some general attempts to model the state distribution include (Hazan et al., 2019) who use discretization and kernel density estimates (KDE) for curiosity-based exploration, while (Lee et al., 2019) uses variational autoencoders (VAE) in the same context. VAE s have also been used by (Islam et al., 2019) to model dπ(s) for constraining the distribution shift in off-policy RL algorithms, and KDE s have also been used by (Qin et al., 2021) for density constrained reinforcement learning. Approaches to distribution matching-based imitation include GAIL (Ho & Ermon, 2016) who uses a GAN-like (Goodfellow et al., 2014) objective to minimize the JS divergence. Originally born out of max-entropy inverse reinforcement learning (Ziebart et al., 2008), GAIL paved the way for later works such as AIRL (Fu et al., 2017) which modified GAIL s objective to allow reward recovery; DAC (Kostrikov et al., 2018), an off-policy extension of GAIL also addressing reward bias; and general f-divergence approaches like (Ke et al., 2020; Ghasemipour et al., 2020). Other works include the DICE (Nachum et al., 2019) family comprised of Value DICE (Kostrikov et al., 2019) and its successors Soft DICE (Sun et al., 2021), Sparse DICE (Camacho et al., 2021) and Demo DICE (Kim et al., 2021a). Using the Donsker Varadhan representation along with a change of variables, Value DICE derives a fully off-policy objective from the reverse KL, completely drubbing BC in the offline regime. Its successors are of tangential relevancy here, each augmenting it for a different domain: Soft DICE makes various amendments notably using the Wasserstein metric (also used by PWIL (Dadashi et al., 2020)), while Sparse Dice adds a host of regularizers to extend Value DICE to subsampled trajectories and Demo DICE focuses on imitation with supplementary imperfect demonstrations. We note the apparent advantages of the DICE family over BC in the fully offline regime have recently been questioned (Li et al., 2022). Our exploitation of Donsker-Varadhan was certainly inspired by Value DICE and is somewhat reminiscent of MINE (Belghazi et al., 2018) who utilize it to estimate mutual information with a regular neural estimator. Their context and precise method however, differ substantially. Another work of relation is the often overlooked (Schroecker, 2020) which includes three approaches: SAIL, GPRIL and VDI (Schroecker & Isbell, 2017; Schroecker et al., 2019; Schroecker & Isbell, 2020). All three attack the IL problem via the forward KL. Essentially, the forward KL is broken into a behavioral cloning term as well as a state distribution term, and optimizing the objective then requires an estimate of θ log dπθ(s). SAIL, GPRIL and VDI each propose a unique way of estimating this gradient, with GPRIL and VDI incorporating flows. (Chang et al., 2022) also utilize flows, proposing a state-only IL approach that models the state-next-state transition using conditional flows, but require dozens of expert trajectories to find success. A Coupled Flow Approach to Imitation Learning Finally, the most similar work is NDI (Kim et al., 2021b) who rewrite the reverse KL as DKL(pπ||pe) = Epπ [log pe log pπ] = J(π, r= log pe) + H(pπ). That is, RL with rewards log pe, along with a state-action entropy term. They continue by deriving a lower bound on H(pπ), termed the SAELBO. NDI+MADE is then proposed, where in a first phase log pe is estimated using flows followed by a second phase of optimization. NDI has many limitations and differs dramatically from our approach. NDI claims to be non-adversarial, while true, it is only due to their loose bound. Moreover, this bound was proven superfluous in their ablation where setting λf=0 showed no reduction in performance. On top of that, we far outperform them, our evaluation is far more comprehensive and although we both use flows, the method and employment differ significantly in CFIL given the coupling of a pair with Donsker-Varadhan. Pausing the deturpation, each of these methods above has strong merits and heavily inspired our own: We adopted the use of MAF over Real NVP from NDI+MADE (our work began in ignorance of NDI); and it was studying Value DICE that inspired our direction to exploit the Donsker-Varadhan representation of the KL. The way in which we do however, is unique, and all the approaches above are distinct from our own. 4. Our Approach We begin with the reverse KL as our divergence of choice, since, noting as others have (Kostrikov et al., 2019; Hazan et al., 2019; Kim et al., 2021b; Camacho et al., 2021), its minimization may be viewed as an RL problem with rewards being the log distribution ratios: arg min π DKL(pπ||pe) = arg max π Epπ(s,a) log pe(s, a) = arg max π J(π, r= log pe Thus permitting the use of any RL algorithm for solving the IL objective, provided one has an appropriate estimate of the ratio. Before continuing however, we first motivate our coupled approach for such estimation, by illustrating the failure of what is perhaps a more natural next step: using two independent density estimators say flows for each of the densities pe and pπ directly. Practically, this would mean alternating between learning the flows and using their logratio as reward in an RL algorithm. The table in Section 6 concisely showcases a clear failure of this approach on all the standard benchmarks we later evaluate with. The failure can be further understood through analyzing what we term the BC graph corresponding to an estima- Figure 1. Left: The BC graph of an uncoupled flow for the Half Cheetah-v2 environment. Right: The BC graph of a coupled flow for the Half Cheetah-v2 environment. BC graphs for an estimator are generated by updating the estimator analogously to an RL run using N saved BC rollouts. This yields N estimators corresponding to N intermediate BC agents. The BC graph is then, for all i, the scatter of the i th estimator s evaluation of an i th BC agent s trajectory against its true environment reward. tor of the log distribution ratio. That is, we first train a behavioral cloning agent once on sufficient expert trajectories, while rolling it out every few iterations. We then train the estimator analogously to an RL run, using a single expert trajectory along with the N saved BC rollouts. This quick process yields N estimators corresponding to N intermediate BC agents. The BC graph is then, for all i, the i th estimator s evaluation of an i th BC agent s trajectory, scattered against the true environment reward of that same trajectory. 3 Intuitively, a non increasing graph, means a potential RL learner with an analogously trained reward may struggle to overcome those misleading behaviors ranked high by the synthetic reward, thus preventing them from reaching expert level. In practice of course, one would want some form of approximate monotonicity or upward trend. Though importantly, a BC graph s monotonicity by no means implies the success of an RL learner with the correspondingly constructed reward. This is more than a theoretical idiosyncrasy: Many estimators will emit a perfect BC graph while completely failing all attempts in RL (see Appendix F.1). Only the reverse is true: its complete lack of an upward trend will usually imply an agent s failure. Loosely formalized, given BC s objective in Equation 2 and assuming a stationary (time independent) reward function r, a monotonically increasing BC graph essentially4 means that for all policies π1, π2: J(π1, r) > J(π2, r) if Epe[log π1(a|s)] > Epe[log π2(a|s)]. Thus, further assuming continuity, a non monotonically increasing graph implies it either monotonically decreases or the existence of a local maximum. In both cases an RL learner with ob- 3Meaningfulness of the BC graph depends on how erratic the training was and how iteration number, loss and true environment reward line up during training. Our choice of Cheetah for illustration is motivated due to all these measures mostly aligning. 4Once again assumes alignment of environment reward with loss. A Coupled Flow Approach to Imitation Learning jective arg maxπ J(π, r) may converge on a policy with suboptimal BC loss. Since only at optimality BC recovers the expert policy, this would guarantee the agent will not meet its truly intended goal of expert mimicry. Of course, in reality, RL can overcome a certain lack of an upward trend. Moreover, the rewards are neither stationary nor identical between the BC and RL runs, only analogously constructed, so such graphs are only loosely representative. Nonetheless, we find they can be highly insightful. As Figure 1 suggests, the independently trained flows BC graph is quite lacking. An agent would have no incentive according to the synthetic reward to make any progress, which is precisely what occurs as the table in Section 6 demonstrates. This poor BC graph is due in part to each flow being evaluated on data completely out of its distribution (OOD), which flows are known to struggle with (Kirichenko et al., 2020). Since the two flows estimates lack true meaning when evaluated on each others data, we need to tie them together somehow: They must be coupled. To perform our coupling, we employ the Donsker-Varadhan (Donsker & Varadhan, 1976) form of the KL divergence: DKL(pπ||pe) = sup x:S A R Epπ(s,a) [x(s, a)] log Epe(s,a) h ex(s,a)i . In the above, optimality occurs with x = log pπ pe + C for any C R (Kostrikov et al., 2019; Gangwani et al., 2020; Belghazi et al., 2018). Thus after computing x , one recovers the log distribution ratio by simple negation, enabling use of x as reward in an RL algorithm to optimize our IL objective. This leads directly to our proposed approach for estimating the log distribution ratio, by coupling two flows through x(s, a). That is, instead of training two flows independently, we propose to do so through maximization of Equation 5. More specifically, we inject the following inductive bias, modeling x as xψ,φ(s, a) = log pψ(s, a) log qφ(s, a), where pψ and qφ are normalizing flows. This coupling guarantees more meaningful values when the flows are evaluated on each others data, since it has already occurred during the maximization phase, hence sidestepping the OOD issue described earlier. Figure 1 illustrates this advantage. The right shows the coupled flows BC graph, clearly remedying the issue with their uncoupled counterparts: A potential learner will now have the proper incentive to reproduce the expert s behavior. The drop in synthetic reward (i.e. x ) towards the end of the BC graph may seem daunting, but it actually expresses how well our estimator captures the expert s behavior: The drop occurs precisely beyond expert level, where the agent, while good in the environment, diverges from the true expert s behavior.5 Given this improved estimator, our full IL objective can then be written as: arg max π min pψ,qφ log Epe(s,a) qφ Epπ(s,a) (6) As is commonplace in adversarial like approaches (Goodfellow et al., 2014; Ho & Ermon, 2016; Kostrikov et al., 2019), the max-min objective above is trained in an alternating fashion, switching between learning x and using x as reward in an RL algorithm. Moreover, we find it useful to use a squashing function on x, avoiding a potentially exploding KL, due to a lack of common support (subtly different then the earlier issue of OOD). Our approach still enables training each flow independently along the way. We call this flow regularization and importantly, our method succeeds without such regularization. More specifically, for expert and agent batches of size M, this regularization involves incorporating the following additional loss function into the minimization step of Equation 6: i=1 log qφ(si e, ai e) + log pψ(si, ai), (7) with L to be weighted by a coefficient α. Noting that training flows is a delicate process, our approach further benefits from though again does not require use of a smoothing akin to the dequantization used when training normalizing flows on discrete data (Papamakarios et al., 2017). More specifically, since our input is unnormalized, we smooth each dimension with uniform noise scaled to its value. That is, if (s, a) is the vector to be smoothed, we sample uniform noise with dimension dim((s, a)), multiply them element-wise and add that to the original vector: (s, a) += β (s, a) u, u Uniform( 1 2)dim((s,a)), (8) with weight β controlling the smoothing level. Note if regularization is also present, smoothing still applies within the additional loss L. Finally, combining all the above is our resulting algorithm, Coupled Flow Imitation Learning (CFIL). It is summarized in Algorithm 1, with only the number of batches per density update omitted. As in Value DICE (Kostrikov et al., 2019), 5This both illustrates the tendency of BC to overfit (Li et al., 2022), while also raising concerns about the tendency for IL papers to report a table showing slight advantage in asymptotic reward as being meaningful. In truth, once at expert level, an advantage should not be claimed for slightly higher performance, unless the stated goal of the work is to do so, but that would no longer be imitation. A Coupled Flow Approach to Imitation Learning Algorithm 1 CFIL Input: Expert demos RE = {(se, ae)}N t=1; parameterized flow pair pψ, qφ; off-policy RL algorithm A; density update rate k; squashing function σ; regularization and smoothing coefficients α, β. Define: xψ,φ = σ(log pψ log qφ) 1: for timestep t = 0, 1, . . . , do 2: Take a step in A with reward r = xψ,φ, while filling agent buffer RA and potentially updating the policy and value networks according to A s settings. 3: if t mod k = 0 then 4: Sample expert and agent batches: 5: {(st e, at e)}M t=1 RE and {(st, at)}M t=1 RA 6: if smooth then 7: (s, a) += β (s, a) u, u U( 1 2)dim((s,a)) 8: end if 9: Compute loss: 10: J = log 1 M PM i=1 ex(si e,ai e) 1 M PM i=1 x(si, ai) 11: if flow reg then 12: Compute regularization loss: 13: L = 1 M PM i=1 log qφ(si e, ai e) + log pψ(si, ai) 14: J = J + αL 15: end if 16: Update ψ ψ η ψJ 17: Update φ φ η φJ 18: end if 19: end for we found the bias due to the log-exp over the mini-batches did not hurt performance and was therefore left unhandled. Another setting of interest is learning from observations (LFO) alone (Zhu et al., 2020; Torabi et al., 2018; 2021). That is, attempting to minimize: arg min π DKL(dπ(s, s )||de(s, s )). (9) While this objective is clearly underspecified in a non injective and deterministic MDP, in practice, recovering the expert s behavior is highly feasible (Zhu et al., 2020). Seeing as none of our description above is specific to the domain of states and actions, CFIL naturally extends to LFO with no need of modification. This is in stark contrast to previous works in the LFO setting which have been highly tailored (Zhu et al., 2020). We shall demonstrate CFIL s utility in the LFO setting in the following section, where remarkably, we even find success when the flows model the single state distribution d(s). 5. Experiments We evaluate CFIL on the standard Mujoco benchmarks (Todorov et al., 2012), first comparing it to state-of-theart imitation methods, including Value DICE (Kostrikov et al., 2019) and their optimized implementation of DAC (Kostrikov et al., 2018), along with a customary behavioral cloning (BC) baseline. We then move to evaluation on a variety of other settings described below. We use Value DICE s original expert demonstrations, with exception to the Humanoid environment, for which we train our own expert, since they did not originally evaluate on it. We use Value DICE s open-source implementation to comfortably run all three baselines. NDI (Kim et al., 2021b) would be the ideal candidate for comparison, given the similarities, however no code was available. Still, we reference some relevant results described in their paper. For CFIL, our RL algorithm of choice is Spinning Ups s (Achiam, 2018) SAC (Haarnoja et al., 2018). We leave all hyper-parameters unchanged, only reducing the start-steps down to 2000, matching that of the baselines above. Our choice for both flows is a single layered MAF (Papamakarios et al., 2017), amending no hyper-parameters from the following open-source implementation (Bliznashki, 2019). This lack of tuning highlights the robustness of our method. Our density update rate is 10 batches of 100, every 1000 timesteps. We use the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.001. For squashing we use σ = 6tanh( x 15), while the smoothing and regularization coefficients are 0.5 and 1 respectively. For all algorithms, we run 80 epochs, each consisting of 4000 timesteps, evaluating over 10 episodes after each. We do this across 5 random seeds and plot means and standard deviations. The plots are smoothed by a rolling mean with window length 5. All results are using a single expert trajectory. Appendix E shows CFIL s performance with more expert trajectories. The top of Figure 2 shows a comparison to the baselines in the standard state-action setting. Clearly, CFIL achieves expert level performance on all tasks with merely a single expert trajectory something the baselines fail to do. CFIL either outperforms or performs similarly to the competition in terms of asymptotic performance (though slight outperformace is not meaningful once all are at expert level), with a massive advantage in the Ant and Humanoid environments. In terms of environment interactions required until reaching expert level: Our approach slightly lags behind the optimized DAC and largely lags behind Value DICE. That is of course, only for the tasks they succeed on. Value DICE where it succeeds clearly wins in terms of interactions needed to reach expert level, though its performance then occasionally degrades as seen in Figure 2. Being semioffline (i.e. it can be instantiated fully offline), Value DICE is difficult to compete with in terms of environment interactions. However, minimizing interactions was not the objective of this work, yet still we are on par. A Coupled Flow Approach to Imitation Learning Figure 2. Top: A comparison of CFIL to Value DICE and DAC on a single expert trajectory in the standard state-action setting. Bottom: A comparison of two versions of CFIL to OPOLO on a single expert trajectory in the LFO setting, with one version limiting itself only to single states. CFIL uses identical hyperparameters on all environments in all three incarnations, showing outstanding results, particularly in the LFO setting where it far outperforms the highly tailored competitor OPOLO. Despite not being our aim, we do note that CFIL far outperforms NDI which requires an order of magnitude more interactions (see their appendix), who in turn outperform GAIL (Ho & Ermon, 2016) by another order of magnitude. Clearly, CFIL is still very impressive in terms of environment interactions and the lag between us and DAC may simply be due to a lack of tuning of CFIL s RL component. We deliberately avoid this tuning, preferring to keep the RL algorithm a black box, since too much tuning may jeopardize robustness and extendability to other settings: We aspire for robust competitiveness rather than a tailored, minor and frail improvement. We now turn to the state-only and subsampled regimes. Settings in which Value DICE finds no dice: By its very nature, Value DICE is inapplicable to the state-only scenario (see Appendix 9.8 of (Zhu et al., 2020)) and extensive experiments already showed its failure when demonstrations are subsampled (Li et al., 2022). Sparse DICE (Camacho et al., 2021) attempted to remedy this by adding a series of regularizers. However, they require at least 10 demonstrations, each subsampled at a rate of 0.1, not nearly the sparsity of the settings we next describe. First in the state-only regime, we compare against OPOLO (Zhu et al., 2020), a state-of-the-art LFO method. We avoid comparison with on-policy LFO methods, like GAIf O (Torabi et al., 2018), due to the sheer number of interactions they require. We use OPOLO s open-source implementation, with the same setup described previously, once again using only a single expert trajectory. Importantly, CFIL uses identical hyperparameters to the previous setting. The bottom of Figure 2 shows two versions of CFIL compared with OPOLO. One estimating the state-next-state distribution ratio, while the other limiting itself to the single state distribution. As can be seen, both incarnations of CFIL once again achieve expert level performance on all environments. This is in sharp contrast to OPOLO which, despite being highly tailored towards the LFO setting, employing inverse action models, discriminators and a host of practical considerations, struggles immensely when provided with a single expert trajectory (they originally report success with 4 trajectories). All the while, CFIL requires no amendment whatsoever, extending simply and elegantly, while distinctly outdoing OPOLO on every imitation learning metric. CFIL s results in the LFO setting are nothing short of remarkable. We effectively set a new state-of-theart for learning from observations alone, while our method was not originally designed for it. We finally turn to the subsampled regime. Comparing again to DAC with a similar setup as before. Our comparison here involves four different subsampling rates, sampling every 10, 20, 50 and 100 th transition. Since we are still working with a single expert trajectory, the final rate implies learning from 10 state-action pairs alone. For this setting, we found the hyperparameters used above did not extend well enough. Specifically, the use of flow regularization seemed to hurt performance in some environments. We therefore drop the regularization to 0, leaving the smoothing at 0.5 and amending the squasher to be σ = 3tanh( x 10). We use these parameters for all environments in all four new settings, with exception to cheetah who, when subsampled, A Coupled Flow Approach to Imitation Learning Figure 3. A comparison of CFIL and DAC on four subsampling rates with a single expert trajectory. Subsample N refers to sampling every N th transition in the trajectory. actually struggles without the flow regularization. Figure 3 shows the comparison in the subsampled setting, with CFIL once again demonstrating stellar performance. While DAC has the advantage in the Cheetah environment, where single crashing seeds significantly reduced the apparent performance of CFIL, CFIL still generally outperforms it, remarkably able to recover the expert behavior when only provided with a measly 10 state-action pairs. 6. Ablation We now concisely ablate various aspects of our approach. We first put into question the need for our squasher, our coupling and our inductive bias, by comparing to No Squash; Ind Flow and Ind Flow NS; and Regular Net. No Squash is CFIL stripped of the squasher. Ind Flow and Ind Flow NS refer to learning the log distribution ratio directly using independent flows (as in section 4), with and without a squasher, respectively. Regular Net alters CFIL s inductive bias by setting x to a regular MLP, completely avoiding the use of flows (reminiscent of MINE (Belghazi et al., 2018)). Along with these we also run a numerator only approach termed Numerator, which simply involves direct learning of the expert s distribution with a single flow, then using it alone as reward in an RL algorithm: J(π, r= log pe) (akin to NDI (Kim et al., 2021b) with λf = 0). We run these with our usual setup of a single expert trajectory and compute their overall score as the average normalized asymptotic reward over all 25 seeds (5 for each environment). Table 1 summarizes these results, showing all the CFIL alternatives fail, demonstrating the necessity of its components. Next we vary CFIL s smoothing and regularization coef- Table 1. A comparison of CFIL to some alternatives, ablating its squasher, coupling and inductive bias. The score is the average normalized asymptotic reward over 25 seeds (5 for each environment). Note: rows with two values indicate runs with smoothing of 0 (left) and 0.5 (right). CFIL 1.012 NOSQUASH 0.091 REGULARNET 0.196 | 0.190 INDFLOW 0.158 | 0.127 INDFLOWNS 0.090 | 0.072 NUMERATOR 0.051 | 0.001 Figure 4. Averages (as color) and standard deviations (as size) of normalized asymptotic rewards for CFIL with varied levels of smoothing and regularization. Each point summarizes 25 seeds (5 per environment). The utility of the smoothing and regularization is apparent as well as CFIL s lack of sensitivity to them. ficients to test its sensitivity. Specifically, we run CFIL (that of the three settings in Figure 2), but with pairs α, β {0, 0.25, 0.5, 0.75, 1}, along with select others, and compute their averages and standard deviations of normalized asymptotic rewards over all 25 seeds. Figure 4 illustrates the results for these runs, showcasing both the utility of the smoothing and regularization as well as CFIL s robustness to them. 7. Conclusion We presented CFIL, a unique approach to imitation learning based on the coupling of a pair of flows. CFIL introduced many novelties including its estimator for the log ratio, its smoothing and regularization and more generally its employment of flows, while we also performed a unique analysis using BC graphs and raised concerns about trends in relevant literature. Crucially, CFIL was empirically shown to outperform state-of-the-art baselines in a variety of settings with only a single expert trajectory. A future work could include coupled flows for general ratio estimation. A Coupled Flow Approach to Imitation Learning Achiam, J. Spinning Up in Deep Reinforcement Learning. 2018. Belghazi, M. I., Baratin, A., Rajeswar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, R. D. Mine: mutual information neural estimation. ar Xiv preprint ar Xiv:1801.04062, 2018. Bliznashki, K. https://github.com/ kamenbliznashki/normalizing_flows, 2019. Camacho, A., Gur, I., Moczulski, M. L., Nachum, O., and Faust, A. Sparsedice: Imitation learning for temporally sparse data via regularization. In ICML 2021 Workshop on Unsupervised Reinforcement Learning, 2021. Chang, W.-D., Higuera, J. C. G., Fujimoto, S., Meger, D., and Dudek, G. Il-flow: Imitation learning from observation using normalizing flows. ar Xiv preprint ar Xiv:2205.09251, 2022. Dadashi, R., Hussenot, L., Geist, M., and Pietquin, O. Primal wasserstein imitation learning. ar Xiv preprint ar Xiv:2006.04678, 2020. Dinh, L., Krueger, D., and Bengio, Y. Nice: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014. Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016. Donsker, M. D. and Varadhan, S. S. Asymptotic evaluation of certain markov process expectations for large time iii. Communications on pure and applied Mathematics, 29 (4):389 461, 1976. Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. Neural spline flows. In Advances in Neural Information Processing Systems, pp. 7511 7522, 2019. Fu, J., Luo, K., and Levine, S. Learning robust rewards with adversarial inverse reinforcement learning. ar Xiv preprint ar Xiv:1710.11248, 2017. Fujimoto, S., Van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. ar Xiv preprint ar Xiv:1802.09477, 2018. Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052 2062. PMLR, 2019. Gangwani, T., Peng, J., and Zhou, Y. Harnessing distribution ratio estimators for learning agents with quality and diversity. ar Xiv preprint ar Xiv:2011.02614, 2020. Germain, M., Gregor, K., Murray, I., and Larochelle, H. Made: Masked autoencoder for distribution estimation. In International conference on machine learning, pp. 881 889. PMLR, 2015. Ghasemipour, S. K. S., Zemel, R., and Gu, S. A divergence minimization perspective on imitation learning methods. In Conference on Robot Learning, pp. 1259 1277. PMLR, 2020. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ar Xiv preprint ar Xiv:1801.01290, 2018. Hazan, E., Kakade, S., Singh, K., and Van Soest, A. Provably efficient maximum entropy exploration. In International Conference on Machine Learning, pp. 2681 2691. PMLR, 2019. Ho, J. and Ermon, S. Generative adversarial imitation learning. ar Xiv preprint ar Xiv:1606.03476, 2016. Ho, J., Chen, X., Srinivas, A., Duan, Y., and Abbeel, P. Flow++: Improving flow-based generative models with variational dequantization and architecture design. In International Conference on Machine Learning, pp. 2722 2730. PMLR, 2019. Huang, C.-W., Krueger, D., Lacoste, A., and Courville, A. Neural autoregressive flows. In International Conference on Machine Learning, pp. 2078 2087. PMLR, 2018. Islam, R., Teru, K. K., Sharma, D., and Pineau, J. Offpolicy policy gradient algorithms by constraining the state distribution shift. ar Xiv preprint ar FXiv:1911.06970, 2019. Ke, L., Choudhury, S., Barnes, M., Sun, W., Lee, G., and Srinivasa, S. Imitation learning as f-divergence minimization. In International Workshop on the Algorithmic Foundations of Robotics, pp. 313 329. Springer, 2020. Kim, G.-H., Seo, S., Lee, J., Jeon, W., Hwang, H., Yang, H., and Kim, K.-E. Demodice: Offline imitation learning with supplementary imperfect demonstrations. In International Conference on Learning Representations, 2021a. A Coupled Flow Approach to Imitation Learning Kim, K., Jindal, A., Song, Y., Song, J., Sui, Y., and Ermon, S. Imitation with neural density models. Advances in Neural Information Processing Systems, 34:5360 5372, 2021b. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. In Advances in neural information processing systems, pp. 10215 10224, 2018. Kirichenko, P., Izmailov, P., and Wilson, A. G. Why normalizing flows fail to detect out-of-distribution data. ar Xiv preprint ar Xiv:2006.08545, 2020. Kobyzev, I., Prince, S. J., and Brubaker, M. A. Normalizing flows: An introduction and review of current methods. IEEE transactions on pattern analysis and machine intelligence, 43(11):3964 3979, 2020. Kostrikov, I., Agrawal, K. K., Dwibedi, D., Levine, S., and Tompson, J. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. ar Xiv preprint ar Xiv:1809.02925, 2018. Kostrikov, I., Nachum, O., and Tompson, J. Imitation learning via off-policy distribution matching. ar Xiv preprint ar Xiv:1912.05032, 2019. Lee, L., Eysenbach, B., Parisotto, E., Xing, E., Levine, S., and Salakhutdinov, R. Efficient exploration via state marginal matching. ar Xiv preprint ar Xiv:1906.05274, 2019. Li, Z., Xu, T., Yu, Y., and Luo, Z.-Q. Rethinking valuedice: Does it really improve performance? ar Xiv preprint ar Xiv:2202.02468, 2022. Liu, Q., Li, L., Tang, Z., and Zhou, D. Breaking the curse of horizon: Infinite-horizon off-policy estimation. Advances in Neural Information Processing Systems, 31, 2018. Mutti, M., De Santi, R., De Bartolomeis, P., and Restelli, M. Challenging common assumptions in convex reinforcement learning. ar Xiv preprint ar Xiv:2202.01511, 2022. Nachum, O., Chow, Y., Dai, B., and Li, L. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in Neural Information Processing Systems, 32, 2019. Nota, C. and Thomas, P. S. Is the policy gradient a gradient? ar Xiv preprint ar Xiv:1906.07073, 2019. Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338 2347, 2017. Papamakarios, G., Nalisnick, E. T., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. Normalizing flows for probabilistic modeling and inference. J. Mach. Learn. Res., 22(57):1 64, 2021. Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 16 17, 2017. Qin, Z., Chen, Y., and Fan, C. Density constrained reinforcement learning. In International Conference on Machine Learning, pp. 8682 8692. PMLR, 2021. Ross, S., Gordon, G., and Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627 635. JMLR Workshop and Conference Proceedings, 2011. Schroecker, Y. and Isbell, C. Universal value density estimation for imitation learning and goal-conditioned reinforcement learning. ar Xiv preprint ar Xiv:2002.06473, 2020. Schroecker, Y. and Isbell, C. L. State aware imitation learning. In Advances in Neural Information Processing Systems, pp. 2911 2920, 2017. Schroecker, Y., Vecerik, M., and Scholz, J. Generative predecessor models for sample-efficient imitation learning. ar Xiv preprint ar Xiv:1904.01139, 2019. Schroecker, Y. K. D. Manipulating State Space Distributions for Sample-Efficient Imitation-Learning. Ph D thesis, Georgia Institute of Technology, 2020. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Sun, M., Mahajan, A., Hofmann, K., and Whiteson, S. Softdice for imitation learning: Rethinking off-policy distribution matching. ar Xiv preprint ar Xiv:2106.03155, 2021. Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018. Sutton, R. S., Mc Allester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057 1063, 2000. Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026 5033. IEEE, 2012. A Coupled Flow Approach to Imitation Learning Torabi, F., Warnell, G., and Stone, P. Generative adversarial imitation from observation. ar Xiv preprint ar Xiv:1807.06158, 2018. Torabi, F., Warnell, G., and Stone, P. Dealio: Data-efficient adversarial learning for imitation from observation. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2391 2397. IEEE, 2021. Zahavy, T., O Donoghue, B., Desjardins, G., and Singh, S. Reward is enough for convex mdps. Advances in Neural Information Processing Systems, 34:25746 25759, 2021. Zhang, J., Koppel, A., Bedi, A. S., Szepesvari, C., and Wang, M. Variational policy gradient method for reinforcement learning with general utilities. Advances in Neural Information Processing Systems, 33:4572 4583, 2020. Zhu, Z., Lin, K., Dai, B., and Zhou, J. Off-policy imitation learning from observations. Advances in Neural Information Processing Systems, 33:12402 12413, 2020. Ziebart, B. D., Maas, A. L., Bagnell, J. A., Dey, A. K., et al. Maximum entropy inverse reinforcement learning. 2008. A Coupled Flow Approach to Imitation Learning A. Reproducibility Code for reproducibility of CFIL, including a detailed description for reproducing our environment, is available at https: //github.com/gfreund123/cfil. B. Ablation Results per Environment Table 2. Extended version of Table 1, showing the ablation results per environment. The table contains averages and standard deviations of normalized asymptotic rewards over 5 seeds. HALFCHEETAH-V2 WALKER2D-V2 ANT-V2 HOPPER-V2 HUMANOID-V2 CFIL 1.122 0.015 0.8542 0.063 1.095 0.022 0.986 0.011 1.004 0.015 NOSQUASH -0.056 0.074 -0.002 0.001 -0.411 0.377 0.001 0.000 0.010 0.006 REGULARNET (SMOOTH=0) -0.000 0.000 0.149 0.002 0.185 0.016 0.291 0.009 0.357 0.312 REGULARNET (SMOOTH=0.5) -0.000 0.000 0.152 0.001 0.169 0.017 0.289 0.002 0.340 0.155 INDFLOW (SMOOTH=0) 0.642 0.094 0.000 0.002 0.000 0.001 0.138 0.306 0.012 0.000 INDFLOW (SMOOTH=0.5) 0.611 0.068 0.011 0.018 0.000 0.001 0.001 0.000 0.012 0.000 INDFLOWNS (SMOOTH=0) 0.203 0.183 0.082 0.058 0.126 0.176 0.026 0.031 0.017 0.001 INDFLOWNS (SMOOTH=0.5) 0.290 0.255 0.014 0.025 0.030 0.029 0.010 0.017 0.017 0.001 NUMERATOR (SMOOTH=0) -0.047 0.050 0.026 0.034 -0.270 0.230 0.011 0.009 0.023 0.015 NUMERATOR (SMOOTH=0.5) -0.006 0.006 0.007 0.015 -0.037 0.035 0.011 0.000 0.015 0.001 Figure 5. Extended version of Figure 4, showing per environment the average (as color) and standard deviations (as size) of normalized asymptotic rewards for CFIL with varied levels of smoothing and regularization. Each point summarizes 5 seeds. Once again, the importance of the smoothing and regularization is apparent, as well as CFIL lack of sensitivity. C. Justification of Modeling the Undiscounted State Distribution With the goal in an MDP being policy optimization, standard RL taxonomy divides between value-based and policy-based methods. The policy gradient theorem (Sutton et al., 2000), being the foundation of policy-based methods, provides the following expression for the gradient of J with respect to a parameterized policy πθ: s dγ π(s) X dθ Qπ(s, a) (10) A Coupled Flow Approach to Imitation Learning where dγ π(s) = P t=0 γt Pr(st = s|s0 p0) is the improper discounted state distribution. Equation (10) enables the construction of policy optimization algorithms such as REINFORCE (Sutton & Barto, 2018), which are based on stochastic gradient descent (SGD). However, garnering some criticism (Nota & Thomas, 2019), most modern policy gradient methods (Schulman et al., 2017; Fujimoto et al., 2018; Haarnoja et al., 2018) opt to ignore the discount factor in dγ π(s), which disqualifies them as SGD methods, hence losing guarantees it provides. Instead, they update the policy with a similar expression to (10) in which dγ π(s) is exchanged with the undiscounted version dπ(s) = P t=0 Pr(st = s|s0 p0). The reason being due to stability and reduction of variance (see Appendix A of (Haarnoja et al., 2018)). We in this work follow suit, modeling the undiscounted dπ(s). D. More on Real NVP and MAF In pursuit of inspiring more research incorporating flow-based models of the state distribution, and since flows are presumably the topic least familiar to an IL researcher reading the paper, we now provide a brief but thorough review of Real NVP and MAF, along with some additional discussion. Following the description in Section 2.2: For both training and generation to be efficient, fθ must be easy to evaluate, easy to invert and the determinant of its Jacobian must be easy to compute. Moreover, for training to be plausible, fθ must be expressive enough to capture complex high-dimensional distributions. Real NVP (Dinh et al., 2016) uses coupling layers (distinct from our coupled approach) to achieve the above requirements. As described in the paper (Dinh et al., 2016), a coupling layer receives a D dimensional input vector x and outputs: y1:d = x1:d (11) yd+1:D = xd+1:D exp(s(x1:d)) + t(x1:d) where d < D, s and t are functions from Rd RD d and represents an element-wise product. Since coupling layers leave certain variables unchanged, they are composed in an alternating pattern to allow all variables to be transformed. Note that the Jacobian of this transformation is triangular, and its determinant is simply exp(P j s(x1:d)j). Note further that the inverse of this transformation is also a simple computation: x1:d = y1:d (12) xd+1:D = (yd+1:D t(x1:d)) exp( s(x1:d)) Finally, note that s and t have no requirement to be invertible, so they can be arbitrary neural networks. Masked Autoregressive Flow (MAF) (Papamakarios et al., 2017) is a generalization of Real NVP which substitutes (11) with the autoregressive yi = xi exp(si(x1:i 1)) + ti(x1:i 1) (13) As stated in 2.2, this improves expressiveness but prevents single pass sampling. However, density evaluation our main concern can still be performed efficiently using masked autoencoders (Germain et al., 2015). Generally, normalizing flows are sensitive objects, after all, density estimation is a difficult task, particularly with such limited data of such high dimension. Given the delicacy involved in their training, much deliberation was done over the potential influence of various things. In particular, the need for normalization, something common in other IL works (Kostrikov et al., 2018; 2019; Schroecker & Isbell, 2020). Although we did finally settle on the more robust approach of using no normalization, or any other special pre-processing, still, this could be useful and even necessary for other applications incorporating density models of the state distribution. Moreover, despite the existence of higher capacity flows (Kingma & Dhariwal, 2018; Huang et al., 2018; Durkan et al., 2019; Ho et al., 2019), when learning from such few demonstrations, seemingly more powerful flows may be far more prone to overfitting then their less expressive counterparts. We aim to inspire others to take advantage of modern powerful density estimators for explicit modeling of the state and state-action distributions, and apply them all around the reinforcement learning literature. A Coupled Flow Approach to Imitation Learning E. CFIL With More Trajectories Figure 6. CFIL s performance with varying numbers of expert trajectories (1, 5, and 10), demonstrating its consistency. Note, the single trajectory run is identical to that of Figure 2. F. BC Analysis Extras F.1. Regular Net s BC Graph Figure 7 shows the BC graph of Regular Net (see ablation) for the Half Cheetah-v2 environment. This serves as an example estimator which fails in RL despite a good BC graph (its failure in RL is illustrated in Table 1). As described in Section 4, a BC graph s utility is mainly one way. That is, when poor it is indicative of failure in RL, but the reverse is not necessarily the case. Beyond that however, it also provides insight into what is captured by an estimator as well as being valuable as a quick sanity test of an estimator s quality (can be generated quickly on saved BC trajectories). Figure 7. The BC graph of Regular Net (see ablation) for the Half Cheetah-v2 environment. F.2. Two-Dimensional BC Analysis What we have termed the BC graph is the diagonal slice of a broader two-dimensional picture. Recall the generation process described in Section 4: BC graphs show the i th estimator s evaluation of the i th BC trajectory against the true environment reward of that same trajectory. However, one may take interest in evaluating all N estimators on all N trajectories. To that end, out of the N = 375 estimator updates, we show every 25 th (along with the first), evaluated on all the BC trajectories, scattered against their true environment rewards. This is illustrated in Figures 8, 9 and 10 corresponding to coupled flows, uncoupled (independent) flows and Regular Net, respectively. The figures provide insight into what individual estimators captured as well as how they morphed over time. For example, for the coupled flows, we can see how the later estimators (synthetic rewards) incentivize the agent not to go beyond expert level, while the early ones do not. This is reasonable, since estimator i is only encountered by policy πi (hence why the BC graph is only the diagonal cut). For this reason, evaluations of early estimators on much later trajectories aren t particularly insightful with regards to what an RL agent with the analogously constructed reward will face. Still, they re interesting in their own right, showing what has been captured by the estimator at that time. A Coupled Flow Approach to Imitation Learning Figure 8. Two-dimensional BC analysis of coupled flows for the Half Cheetah-v2 environment. In contrast to the BC graph which is the i th estimator s evaluation of the i th BC trajectory vs the true environment reward of that same trajectory, here we see scatters for individual estimators along the way. That is, their evaluations of all BC trajectories scattered against the true environment rewards of the trajectories. Figure 9. Two-dimensional BC analysis of uncoupled flows for the Half Cheetah-v2 environment. See Figure 8 s caption. A Coupled Flow Approach to Imitation Learning Figure 10. Two-dimensional BC analysis of Regular Net for the Half Cheetah-v2 environment. See Figure 8 s caption.