# learning_control_by_iterative_inversion__4ddc655f.pdf Learning Control by Iterative Inversion Gal Leibovich 1 Guy Jacob 1 Or Avner 2 Gal Novik 1 Aviv Tamar 2 We propose iterative inversion an algorithm for learning an inverse function without input-output pairs, but only with samples from the desired output distribution and access to the forward function. The key challenge is a distribution shift between the desired outputs and the outputs of an initial random guess, and we prove that iterative inversion can steer the learning correctly, under rather strict conditions on the function. We apply iterative inversion to learn control. Our input is a set of demonstrations of desired behavior, given as video embeddings of trajectories (without actions), and our method iteratively learns to imitate trajectories generated by the current policy, perturbed by random exploration noise. Our approach does not require rewards, and only employs supervised learning, which can be easily scaled to use state-of-the-art trajectory embedding techniques and policy representations. Indeed, with a VQ-VAE embedding, and a transformer-based policy, we demonstrate nontrivial continuous control on several tasks (videos available at https://sites.google.com/ view/iter-inver). Further, we report an improved performance on imitating diverse behaviors compared to reward based methods. 1. Introduction The control of dynamical systems is fundamental to various disciplines, such as robotics and automation. Consider the following trajectory tracking problem. Given some deterministic but unknown actuated dynamical system, st+1 = f(st, at), (1) Equal contribution, author order determined by coin toss. 1Intel Labs, Haifa, Israel 2Department of Electrical Engineering, Technion, Haifa, Israel. Correspondence to: Gal Leibovich , Guy Jacob , Or Avner , Aviv Tamar . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). where s is the state, and a is an actuation, and some reference trajectory, s0, . . . , s T , we seek actions that drive the system in a similar trajectory to the reference. For systems that are simple enough, e.g., linear, or low dimensional, classical control theory (Bertsekas, 1995) offers principled and well-established system identification and control solutions. However, for several decades, this problem has captured the interest of the machine learning community, where the prospect is scaling up to high-dimensional systems with complex dynamics by exploiting patterns in the system (Mnih et al., 2015; Lillicrap et al., 2015). In reinforcement learning (RL), learning is driven by a manually specified reward signal r(s, a). While this paradigm has recently yielded impressive results, defining a reward signal can be difficult for certain tasks, especially when high-dimensional observations such as images are involved. An alternative to RL is inverse RL (IRL), where a reward is not manually specified. Instead, IRL algorithms learn an implicit reward function that, when plugged into an RL algorithm in an inner loop, yields a trajectory similar to the reference. The signal driving IRL algorithms is a similarity metric between trajectories, which can be manually defined, or learned (Ho & Ermon, 2016). We propose a different approach to learning control, which does not require explicit nor implicit reward functions, and also does not require access to a similarity metric between trajectories. Our main idea is that Equation (1) prescribes a mapping F from an action sequence to a state sequence, s0, . . . , s T = F(a0, . . . , a T 1). (2) The control learning problem can therefore be framed as finding the inverse function, F 1, without knowing F, but with the possibility of evaluating F on particular action sequences (a.k.a. roll-outs). Learning the inverse function F 1 using regression can be easy if one has samples of action sequences and corresponding state sequences, and a distance measure over actions. However, in our setting, we do not know the action sequences that correspond to the desired reference trajectories a problem that we term inversion distribution shift. Interestingly, for some mappings F, an iterative regression technique can be used to find F 1. In this scheme, which we term Iterative Inversion (IT-IN), we start from arbitrary Learning Control by Iterative Inversion action sequences, collect their corresponding state trajectories, and regress to learn an inverse. We then apply this inverse on the reference trajectories to obtain new action sequences, and repeat. We show that with linear regression, iterative inversion will converge under quite restrictive criteria on F, such as being strictly monotone and with a bounded ratio of derivatives, by establishing a connection between iterative inversion and Newton s method. Nevertheless, our result shows that for some systems, a controller can be found without a reward function, nor access to a distance measure on states (only actions), and that iterative inversion effectively steers the learning to overcome distribution shift. We then apply iterative inversion to several continuous control problems. In our setting, the desired behavior is expressed through a video embedding of a desired trajectory (without actions), using a VQ-VAE (Van Den Oord et al., 2017), and a deep network policy maps this embedding and a state history to the next action. The agent generates trajectories from the system using its current policy, given the desired embeddings as input, and subsequently learns to imitate its own trajectories, conditioned on their own embeddings. We find that when iterating this procedure, the input of the desired trajectories embeddings steers the learning towards the desired behavior, as in iterative inversion. Given the strict conditions for convergence of iterative inversion, there is no a-priori reason to expect that our method will work for complex non-linear systems and expressive policies. Curiously, however, we report convergence on all the scenarios we tested, and furthermore, the resulting policy generalized well to imitating trajectories that were not seen in its steering training set. This surprising observation suggests that IT-IN may offer a simple supervised learning-based alternative to methods such as RL and IRL, with several potential benefits such as a reward-less formulation, and the simplicity and stability of the (iterated) supervised learning loss function. Furthermore, on experiments where the desired behaviors are abundant and diverse, we report that IT-IN outperforms reward-based methods, even with an accurate state-based reward. 2. Iterative Inversion We describe a general problem of learning an inverse function under a distribution shift, and present the iterative inversion algorithm. We then analyse the convergence of iterative inversion in several simplified settings. In the proceeding, we will apply iterative inversion to learning control. Let F : X Y be a bijective function. We are given a set of M desired outputs y1, . . . , y M Y, and an arbitrary set of M initial inputs x1, . . . , x M X. We assume that F is not known, but we are allowed to observe F(x) for any x X that we choose during our calculations. Our goal Figure 1: Learning an inverse function under a distribution shift. We wish to learn the inverse function over outputs y1, . . . , y M, using linear least squares, having matching inputs-outputs for x1, . . . , x M. is to find a function G : Y X such that for any desired output yi, we have G(yi) = F 1(yi). More specifically, we will adopt a parametric setting, and search for a parametric function Gθ, where θ Θ is a parameter vector, that minimizes the average loss: min θ Θ 1 M i=1 L(Gθ(yi), F 1(yi)). (3) For example, Gθ could represent the space of linear functions Gθ(y) = θT y + θbias, and L could be the squared error between inputs, L(x, x ) = (x x )2. This example, which is depicted in Figure 1 for the 1-dimensional case X = Y = R, corresponds to a linear least squares fit of the inverse function. As can be seen, the challenge arises from the inversion distribution shift the mismatch between the distributions of the desired outputs and initial inputs, Definition 1 (Inversion Distribution Shift). The difference between outputs of the initial distribution F(x1), . . . , F(x M) and the desired outputs y1, . . . , yn. The iterative inversion algorithm, proposed in Algorithm 1, seeks to solve problem (3) iteratively. In the algorithm, and in the proceeding analysis, we define the initial inputs x1, . . . , x M implicitly, as the inverse using an initial parameter θ0, i.e., xi = Gθ0(yi). We next investigate when, and why, should iterative inversion produce an effective solution for (3). For our analysis, we restrict ourselves to the following setting: Assumption 1. The function class Gθ is linear, and the loss L is the squared error. We analyze convergence for different classes of functions F. Denote Xn (xn 1, . . . , xn M)T RM dim(X), F(Xn) Learning Control by Iterative Inversion Algorithm 1 Iterative Inversion Require: Desired outputs y1, . . . , y M Y, loss function L : X X R, initial parameter θ0. 1: for n = 0, 1, 2, . . . do 2: Calculate current inputs: xn 1, . . . , xn M = Gθn(y1), . . . , Gθn(yn) 3: Calculate current outputs: yn 1 , . . . , yn M = F(xn 1), . . . , F(xn M) 4: Regression: θn+1 = arg minθ Θ 1 M PM i=1 L(Gθ(yn i ), xn i ) 5: end for (F(xn 1), . . . , F(xn M))T RM dim(Y) as the input and output matrices, Xn PM i=1 xn i /M Rdim(X) and Y PM i=1 yi/M, F(Xn) PM i=1 F(xn i )/M Rdim(Y) as the current inputs, desired outputs, and current outputs means, ( ) the Moore-Penrose pseudoinverse operator, and F 1 the ground-truth inverse function. We start with the simple case of a linear F. As is clear from Figure 1, inverse distribution shift is not a problem in this case, as the inverse function is the same for any x, and iterative inversion converges in a single iteration. Theorem 1. If Assumption 1 holds, F is a linear function and rank(F(X0) F(X0)) = dim(Y) then Algorithm 1 converges in one iteration, i.e., y1 1, . . . , y1 M = y1, . . . , y M. We next analyse a non-linear F. Our insight is that iterative inversion can be interpreted as a variant of the classic Newton s method (Ortega & Rheinboldt, 2000), where we replace the unknown Jacobian J of F with a linear approximation using the current input-output pairs, and the evaluation of F with the mean of the current outputs. Recall that Newton s method seeks to find the root x of a function r(x) = F(x) y using the iterative update rule xn+1 = xn + (y F(xn))[J(xn)] 1, where [J(xn)] 1 is the Jacobian inverse of F at xn. Iterative inversion, similarly, applies the following updating rule, as proved in Appendix A.1, Xn+1 = Xn + Y F(Xn) J 1 n , (4) where J 1 n (F(Xn) F(Xn)) (Xn Xn) is the Jacobian of Gθn+1, the linear regressor plane from F(x) to x at xn 1, . . . , xn M, which can be considered to be an approximation of [J(Xn)] 1. When the approximations J 1 n [J(Xn)] 1 and F(Xn) F(Xn) are accurate, iterative inversion coincides with Newton s method, and enjoys similar convergence properties, as we establish next. Assumption 2. F : RK RK is bijective1, and F and F 1 are both continuously differentiable. 1The bijection assumption is required for our theoretical analysis, to properly define the inverse function. In our experiments, Denote J(x) the Jacobian matrix of F at x X, and J 1(x) [J(x)] 1 the inverse matrix of J(x) and the Jacobian of F 1 at F(x) Y, under Assumption 2. Also denote to be any induced norm (Horn & Johnson, 2012). We assume that the derivatives of F and F 1 are bounded. Assumption 3. J(x1) J(x2) γ, J(x) ζ and J 1(x) β x1, x2, x RK.1 Further assume that at every iteration n, the approximations J 1 n and F(Xn) are accurate enough. Assumption 4. n: F(Xn) F(Xn) λ and J 1 n = J 1(Xn)(I + n), n δ < 1/ζβ. Assumption 4 may hold, for example, when the inputs xn 1, . . . , xn M are distributed densely, relative to the curvature of F, and evenly, such that the regression problem in Algorithm 1 is well-conditioned. The requirement δ < 1/ζβ is set to ensure that J 1 n is non-singular. Theorem 2. Suppose Assumptions 1, 2, 3 and 4 hold. Let µ ζ2βδ 1 ζβδ and assume β(1 + δ)(γ + µ) < 1. Let ρ 2λβ(1+δ)(µ+ζ) 1 β(1+δ)(µ+γ). Then for every ϵ > 0 there exists k < such that F(Xk) Y ρ + ϵ. Theorem 2 shows that under sufficient conditions, the iterative inversion method is able to steer learning across any distribution shift, bringing the average output F(Xk) close to the average desired output Y , regardless of the initial X0. The term ρ can be interpreted as the radius of the ball centered at Y that the sequence convergences to. The proof for Theorem 2 builds on the analysis of Newton s method to show that IT-IN is an iterated contraction, and is reported in Section A.3 of the supplementary material. To get some intuition about Theorem 2, consider the following example. Example 1. Consider the 1-dimensional case F : R R (cf. Fig. 1), where the second approximation in Assumption 4 are perfect, i.e., δ = µ = 0. Then, the condition for convergence is β(1 + δ)(γ + µ) = βγ < 1, which is equivalent to max |F (x)| min |F (x)| < 2, i.e., an F that is close to linear . The conditions in Theorem 2 can therefore be intuitively interpreted as F being close to linear globally, and the linear approximation being accurate locally. In Appendix A.4, we provide additional convergence results that use a different analysis technique for the simple case presented in Example 1, where F : R R. These results do not require Assumption 4, but still require a condition similar to max |F (x)| min |F (x)| < 2, and show a linear convergence rate. We further remark that a quadratic convergence rate is however, we consider intent and action spaces of different dimensions, which are not bijective. In addition, we experiment with environments with non-smooth dynamics, where the bounded derivative assumption does not hold. Learning Control by Iterative Inversion known for Newton s method when the initial iterate is close to optimal; we believe that similar results can be shown for IT-IN as well. Here, however, we focused on the case of an arbitrary initial iterate, similarly to the experiments we shall describe in the sequel. 3. Iterative Inversion for Learning Control In this section, we apply iterative inversion for learning control. We first present our problem formulation, and then propose an IT-IN algorithm. We follow a standard RL formulation. Let S denote the state space, A denote the action space, and consider the dynamical system in Equation 1. We assume, for simplicity, that the initial state s0 is fixed, and that the time horizon is T.2 Given a state-action trajectory τ = s0, a0, . . . , s T 1, a T 1, s T Ω, where Ωdenotes the T-step trajectory space, we denote by τs Ωs its state component and by τa Ωa its action component, i.e., τs = s0, . . . , s T , τa = a0, . . . , a T 1, and Ω= Ωs Ωa. We will henceforth refer to τs as a state trajectory and to τa as an action trajectory. Let F denote the mapping from an action trajectory to the resulting state trajectory, as given by Equation (2). For presenting our control learning problem, we will assume that F is bijective, and therefore F 1 is well defined. We emphasize, however, that our algorithm makes no explicit use of F 1, and our empirical results are demonstrated on problems where this assumption does not hold. We represent a state trajectory using an embedding function z = Z(τs) Z, and we term z the intent. Note that z, by definition, can contain partial information about τs, such as the goal state (Ghosh et al., 2019). In all the experiments reported in the sequel, we generated intents by feeding a rendered video of the state trajectory into a VQ-VAE encoder, which we found to be simple and well performing. Consider a state-action trajectory τ, with a corresponding intent Z(τs). We would like to learn a policy that reconstructs the intent into its corresponding action trajectory τa, and can be used to control the system to produce a similar τ. Let Ht denote the space of t-length state-action histories, and a policy πt : Z Ht A. With a slight abuse of notation, we denote by π(z) Ωa the action trajectory that is obtained when applying πt sequentially for T time steps (i.e., a rollout). Similarly to the problem in Section 2, our goal is to learn a policy such that π(Z(τs)) = F 1(τs). More specifically, let L : Ωa Ωa R be a loss function between action trajectories, and let P(τs) denote a distribution over desired state trajectories, we seek a policy πθ parameterized by θ Θ that minimizes the average loss: min θ Θ Eτs P L πθ(Z(τs)), F 1(τs) . (5) 2A varying horizon can be handled as an additional input to F. In our approach we assume that P(τs) is not known, but we are given a set Dsteer of M intents, z1, . . . , z M, where zi = Z(τ i s), and τ i s are drawn i.i.d. from P(τs). Henceforth, we will refer to Dsteer as the steering dataset, as it should act to steer the learning of the inverse mapping towards the desired trajectory distribution P(τs). It is worth relating the problem above to the general inverse problem in Section 2, and the distribution shift in Definition 1. Initially, the policy is not expected to be able to produce state-action trajectories that match the state trajectories in Dsteer, but only trajectories that are output by the initial (typically random) policy. While these initial trajectories could be used as data for imitation learning, yielding an intent-conditioned policy, there is no reason to expect that this policy will be any good for intents in Dsteer, which are out-of-distribution with respect to this training data. We now propose a method for solving Problem (5) based on iterative inversion, as detailed in Algorithm 2. There are four notable differences from the iterative inversion method in Algorithm 1. First, we operate on batches of size N instead of on the whole steering data (of size M), for computational efficiency. Second, we sample a batch of intents from a mixture of the steering dataset and the intents calculated for rollouts in the previous iteration. We found that this helps stabilize the algorithm. Third, we add random exploration noise to the policy when performing the rollouts, which we found to be necessary (see Sec. 5). Fourth, we used a replay buffer for the supervised learning part of the algorithm, also for improved stability. For L, we used the MSE between action trajectories, and for the optimization in line 7, we perform several epochs of gradient-based optimization using Adam (Kingma & Ba, 2014), keeping the state history input to πθ(ˆz) fixed as τs when computing the gradient. The size of the replay buffer was set to K N. Algorithm 2 Iterative Inversion for Learning Control Require: Steering data Dsteer, exploration noise parameter η, steering ratio α [0, 1], batch size N 1: Initialize Dprev = Dsteer, θ0 arbitrary 2: for n = 0, 1, 2, . . . do 3: Sample αN intents from Dsteer and (1 α)N intents from Dprev, yielding z1, . . . , z N 4: Perform N rollouts τ 1, . . . , τ N using policy πθn with input intents z1, . . . , z N, adding exploration noise η 5: Compute intents for the rollouts ˆzi = Z(τ i s), i 1, . . . , N 6: Add intents, trajectories {ˆzi, τ i} to Replay Buffer 7: Train πθn+1 by supervised learning: θn+1 = arg minθ Θ P {ˆz,τ} Replay Buffer [L (πθ(ˆz), τa)] 8: Set Dprev = ˆzi N i=1 9: end for Learning Control by Iterative Inversion The astute reader may notice that while the desired intents Dsteer are used as input to the data collection policy, they are actually detached from the training loss throughout learning, by the relabelling in line 5. What, then, drives the learning closer to Dsteer? Our analysis in Section 2 explains how, under suitable conditions of F, the policy can in fact be steered appropriately by such an algorithm. Remark 1. Note that a small loss over the actions, per Eq. (5), does not necessarily imply a similar trajectory to the reference. This is since (1) errors accumulate over time (generally, the error in the states can grow linearly with T; cf. Assumption 3), and (2) a possible distribution shift, which grows with T, between the states used to train the policy (Line 7) and states that would be encountered during a rollout from the trained policy when given the same intent, and can lead to state errors that grow quadratically with T (Ross et al., 2011). This is a different distribution shift from our Definition 1, which concerns the difference between the rollouts of the initial policy and the desired outcomes and holds also for T = 1. While steering is not expected to fix error accumulation, in practice, we found that Algorithm 2 did lead to accurate trajectories, even for systems where a small mistake can be dramatic. We attribute this to the training of a policy to imitate many noisy trajectories, which stabilizes the learned control. Note the simplicity of the IT-IN algorithm it only involves exploration and supervised learning; there are no rewards, and the loss function is routine. In Section 5, we provide empirical evidence that, perhaps surprisingly given the strict conditions for convergence of iterative inversion ITIN yields well-performing policies on nontrivial tasks. 4. Related Work Iterative inversion is similar to the breeder algorithm (Nair et al., 2008) for inverting black-box functions, but with one crucial difference breeder starts from an in distribution input-output pair, termed a prototype code vector, and grows input-output pairs around it, to avoid distribution shift. Our method does not require any in distribution samples, and we show how distribution shift can be overcome by steering. Goal conditioned supervised learning (GCSL, Ghosh et al. 2019) is essentially a special case of iterative inversion where intents are chosen to be goal observations. Our contribution, however, is investigating the steering component of the method an idea that originated in the algorithm of Ghosh et al. (2019), but was not investigated at all to our knowledge. Ghosh et al. (2019) s theory assumes no distribution shift (coverage of all goals in the data), and the idea that steering can overcome distribution shift is, to the best of our knowledge, novel. In addition, we show that our approach can learn to track trajectories accurately, which is important for tasks where the whole trajectory is important, and not just the goal. Learning inverse dynamics models is popular in robotics (Nguyen-Tuong & Peters, 2010; Calandra et al., 2015; Meier et al., 2016; Christiano et al., 2016), and typically requires the full state trajectory for predicting corresponding actions, and training data from the desired state-action distribution; our approach requires an embedding of the desired trajectory, and our focus is on the setting of inversion distribution shift. The only study we are aware of in this direction is Hong et al. (2020), where an RL agent is trained to steer data collection to areas where the inverse model errs. However, Hong et al. (2020) require the full state trajectory as policy input, require RL training as part of their method, and report difficulty in scaling to high dimensional action spaces, where their curiosity-based exploration is not effective enough to steer learning towards desired behavior. Recently, Baker et al. (2022) used a transformer to learn an inverse model conditioned on video, but they collected human-labelled data to train their model on desired behavior trajectories. Our approach is self-supervised. In learning from demonstrations (Argall et al., 2009), data typically contains both states and actions, enabling direct supervised learning, either by behavioral cloning (Pomerleau, 1988) or interactive methods such as DAgger (Ross et al., 2011). Inverse RL (IRL) can learn from demonstrations without actions, and methods such as apprenticeship learning (Abbeel & Ng, 2004) or generative adversarial imitation learning (Ho & Ermon, 2016; Peng et al., 2022) simultaneously train a critic that discriminates between data trajectories and policy rollouts (a classification problem), and a policy that confuses the critic as best as possible (an RL problem). Our emphasis is on learning a policy that can reconstruct diverse behaviors, conditioned on an intent different from most IRL studies that consider a single task (Ho & Ermon, 2016; Edwards et al., 2019), or unconditional generation (Peng et al., 2022). While Fu et al. (2019); Ding et al. (2019) considered a goal-conditioned IRL setting, we are not aware of IRL methods that can be conditioned on a more expressive description than a target goal state, such as a complete trajectory embedding, as we explore here. Our approach also avoids the need for training a critic or an RL agent in an inner loop. In self-supervised RL, the agent does not receive a reward and uses its own experience to explore the environment by training a goal-conditioned policy and proposing novel goals (Pathak et al., 2018; Ecoffet et al., 2019; Hazan et al., 2019; Sekar et al., 2020; Mendonca et al., 2021). The space of all trajectories is much larger than the space of all states, and we are not aware of methods that demonstrably explore such a space. For this reason, in our approach we steer the exploration towards a set of desired trajectories. Learning Control by Iterative Inversion 5. Experiments In this section, we evaluate IT-IN on several domains. Our investigation is aimed at studying the unique features of IT-IN and especially, the steering behavior that we expect to observe. We start by describing our evaluation domains, and implementation details that are common to all our experiments. We then present a series of experiments aimed at answering specific questions about IT-IN. To appreciate the learned behavior, we encourage the reader to view our supporting video results at the project website 3. Common Settings: VQ-VAE Intents: For all our experiments, we generate intents using a VQ-VAE embedding of a rendered video of the trajectory. Rendering settings are provided next for each environment. We use Video GPT s VQ-VAE implementation (Yan et al., 2021). An input video of size 64 64 T (w, h, t) is encoded into a 16 16 T/4 integer intent zi given a codebook of size 50. Each integer represents a float vector of length 4. The training of the VQ-VAE is not the focus of this work, and we detail the training data for each VQ-VAE separately for each domain in the supplementary material. We remark that by visually inspecting the reconstruction quality, we found that our VQ-VAEs generalized well to the trajectories seen during learning. GPT-based policies and exploration noise The policy architecture is adapted from Video GPT (Yan et al., 2021), and consists of 8 layers, 4 heads and a hidden dimension of size 64. The model is conditioned on the intent via crossattention. In the supplementary material, we report similar results with a GRU-based policy. Our exploration noise adds a Gaussian noise of scale η to the action output. Evaluation Protocol: While our algorithm only uses a loss on actions, a loss on the resulting trajectories is often easier to interpret for measuring performance. We measure the sum of Euclidean distances between agent state variables, accumulated over time, as a proxy for trajectory similarity; in our results, this measure is denoted as MSE. Except when explicitly noted otherwise, all our results are evaluated on test trajectories (and corresponding intents) that were not in the steering data, but were generated from the same trajectory distribution. None of the trajectories we plot or our video results are cherry picked. 2D Particle: A particle robot is moved on a friction-less 2D plane, by applying a force F = [FX, FY ] for a duration of t. The observation space includes particle positions and velocities S = [X, Y, VX, VY ], and motion videos are 3https://sites.google.com/view/iter-inver rendered using Matplotlib Animation (Hunter, 2007). While relatively simple for control, this environment allows for distinct and diverse behaviors that are easy to visualize. We experiment with 2 behavior classes, for which we procedurally created training trajectories: (1) Spline motion, and (2) Deceleration motion. Both require highly coordinated actions, and are very different from the motion that a randomly initialized policy induces. Full details about the datasets are described in Appendix B.1.1. Reacher: A 2-Do F robotic arm from Open AI Gym s Mujoco Reacher-v2 environment (Brockman et al., 2016). While usually in Reacher-v2 the agent is rewarded for reaching a randomly generated target, the goal in our setting is for the policy to reconstruct the whole arm motion, as given by the intent. The intent is encoded from a video of the motion rendered using Mujoco (Todorov et al., 2012). We handcrafted a trajectory dataset, termed Fixed Joint, which is fully described in Appendix B.2.1. Hopper: From Open AI Gym s Mujoco Hopper-v2 environment (Brockman et al., 2016). The dataset is from D4RL s hopper-medium-v2 (Fu et al., 2020), and consists of mostly forward hopping behaviors (see Appendix B.3.1). There are several challenges in this domain: (1) the dynamics are non-linear, and include a non-smooth contact with the ground; (2) the desired behavior (hopping) is very different from the behavior of an untrained policy (falling), and requires applying a very specific force exactly when making contact with the ground (a bottleneck in state space); and (3) the camera is fixed on the agent, and forward movement can only be inferred from the movement of the background. Steering Evaluation The first question we investigate is whether IT-IN indeed steers learning towards the desired behavior. To answer this, we consider domains where the desired behavior is very different from the behavior of the initial random policy the Spline and Deceleration motions for the particle, and the hopping behavior for Hopper-v2. As we show in Figure 2 (for particle), and Figure 3 (for Hopper-v2), IT-IN produces a policy that can track the desired behavior with high accuracy. Videos of rollouts in the Hopper-v2 environment3 demonstrate that the policy is able to accurately reconstruct different hopping motions, based only on their encoded intents. To the best of our knowledge, such accurate motion control, with only an action-less video encoding as input, on a non-trivial dynamical system has not been demonstrated before. We further show, in Figure 8 and Figure 9 in the supplementary material, that IT-IN works well for different trajectory lengths T. Another question is whether IT-IN really steers the policy towards the desired trajectories, or improves general Learning Control by Iterative Inversion Figure 2: Particle results on Splines (top) and Deceleration (bottom). Here T = 64 and |Dsteer| = 500. All trajectories start at (0,0), marked by a blue circle. In Deceleration, the particle quickly decelerates to a stop at t = 32 note the small overshoot at the end of each reconstructed trajectory, due to imperfect reconstruction of stopping in place. Figure 3: Trajectory reconstructions in Hopper-v2, with T = 128 and |Dsteer| = 500. Additional rollouts are presented in Appendix C.6 and in the supporting video results. properties of the policy, allowing a generally better reconstruction. We explore this question by a cross-evaluation evaluating the performance of a policy trained with steering intents from Particle:Splines on test intents from Particle:Deceleration, which we will refer to as out-of-distribution intents, and vice versa. Interestingly, as Table 1 shows, performance on out-of-distribution intents is significantly worse than the performance that would have been obtained by training the policy with these intents as the steering dataset, and is even worse or comparable to training with no steering at all (cf. Table 2). Example rollouts from this experiment are shown in C.8. Evaluation of Exploration Noise We also evaluated the importance of the exploration noise. We tested Splines with T = 64 and Hopper-v2 with T = 128 with and without exploration noise, and a large Dsteer (1740 for Hopper-v2, 500 for Particle). As the results in Table 8 in the supplementary material show, exploration noise η is crucial for the training procedure to converge towards the desired behavior. We believe that exploration improves the conditioning of the supervised learning problem, and helps produce a policy that stabilizes control along the desired trajectory. Steering Dataset Size and Generalization We next evaluate the generalization performance of IT-IN to intents that were not seen in the data, but correspond to Table 1: Steering cross-evaluation. See Appendix C.8 for corresponding trajectory visualizations. In all cases |Dsteer| = 500. Test Dataset Steering Dataset MSE Splines Splines 69.2 Deceleration 210.9 Deceleration Splines 28.1 Deceleration 18.5 state trajectories drawn from P(τs). To investigate this, we consider a domain where the desired behavior is very diverse the Spline motions for the particle. We also report results on domains where the behavior is less diverse, such as Hopper-v2 and Deceleration motions for particle. Naturally, we expect generalization to correlate with M, the size of Dsteer. As our results in Table 2 show, additional steering data indeed improves generalization to unseen trajectories, albeit with diminishing returns as the amount of steering data is increased. As expected, in the more diverse distribution there was more gain to reap from additional data (significant improvement up to |Dsteer| = 50), compared with the less diverse domain (most of the improvement is achieved already with |Dsteer| = 10). Trajectory visualizations for Splines with different sizes of Dsteer are shown in Appendix C.3. Learning Control by Iterative Inversion Table 2: Steering Dataset Size and Generalization. Here T = 64, and we show MSE averaged over 3 random seeds. Note that |Dsteer| = 0 represents the case where no steering is used at all. In this case, we use trajectories sampled from a random policy to initialize |Dprev| (see Algorithm 2). (*) For Hopper-v2, the maximal |Dsteer| is 1740 due to a limited amount of data in D4RL. |Dsteer| = 0 |Dsteer| = 10 |Dsteer| = 50 |Dsteer| = 100 |Dsteer| = 500 |Dsteer| = 2000 Particle:Splines 199.7 105.4 75.8 72.7 69.2 66.9 Particle:Deceleration 30.0 20.5 21.1 20.2 17.9 18.6 Hopper-v2 173.3 68.2 67.2 64.9 67.0 63.0 Comparison with RL and IRL Baselines In essence, IT-IN performs imitation learning from observations. Thus, a natural comparison is with methods such as GAIL from observations (Torabi et al., 2018). Previous IRL literature focused on learning only a single task, and measured accuracy as the task success (e.g., the success of hopper hopping). However, the setting where IT-IN shines, and which we focus our evaluation on, is where the policy must be able to reconstruct a diverse set of behaviors (from their corresponding intents). Thus, our evaluation is not whether a single task succeeds, but how accurately any desired trajectory is tracked (e.g., hopping in a very specific motion). The multi-task setting, where the agent needs to perform many different behaviors specified by different intents, has not been studied to the best of our knowledge. We compare IT-IN with GAIL from observations (Torabi et al., 2018) in the Particle:Splines environment. As we consider the multi-task setting, we also add comparison with RL baselines where we manually set a relevant reward function. These RL baselines serve to show that any advantage of IT-IN over IRL is not due to some implementation detail (both use the same model architecture), but due to the difficulty of its RL component to learn in this multi-intent setting. We consider two reward functions tailored to the Particle:Splines environment: (1) STATE-MSE: MSE between desired position and current position, and (2) INTENT-MSE: a sparse reward that is the MSE between the intents of the desired trajectory and the executed trajectory, given at the end of the episode. STATE-MSE is privileged compared to IT-IN and is arguably stronger than any IRL method in this task, as the reward is dense, and exactly captures the desired behavior. Any IRL method will run RL in an inner loop, with a reward that is less precise. INTENT-MSE is motivated by the fact that IT-IN effectively learns some similarity measure in intent space, and this reward captures this idea explicitly. We used exactly the same policy architecture for all comparisons. We found that both RL and IRL methods did not train well with the GPT-based policy architecture4, therefore we 4Difficulty of RL with transformers was discussed in (Parisotto et al., 2020; Hausknecht & Wagener, 2022). Figure 4: Comparison of IT-IN with PPO and GAIL baselines, in the Particle:Splines environment with T = 16. All results are MSE (lower is better), each represented with a mean and standard deviation of 3 random seeds. Note that IT-IN outperforms baselines on test trajectories (graph on the right) for all Dsteer sizes. For GAIL, Dsteer is also used as the expert data for the discriminator. Additional results are shown in Figure 6. report results for the GRU policy (also for IT-IN), which is described in detail in Appendix B.5. We used PPO (Schulman et al., 2017) for RL training, based on the implementation of Kostrikov (2018). Our GAIL implementation is also based on Kostrikov (2018), with modifications to follow the setup of Torabi et al. (2018) and to add the intent as a context input. Additional details are in Appendices B.6 and B.7. In Figure 4, we report results both on a held-out test set of trajectories, and on the training trajectories. As expected, for a small Dsteer, STATE-MSE obtains near perfect reconstruction of training trajectories, yet high error on test trajectories, as the precise reward makes it easy for PPO to overfit. Interestingly, however, when increasing the size of Dsteer, it becomes more difficult to overfit with PPO, even with the STATE-MSE reward. This highlights a difficulty of RL in the multi-task setting: note that for |Dsteer| = 2000, the performance of STATE-MSE on training is worse than the performance of IT-IN on test! Our results suggest that vanilla PPO is not well suited to training policies conditioned on very diverse contexts (Our context here is a vector of length 4096). We mention that the recent related work of Peng et al. (2022) trained context embeddings together with Learning Control by Iterative Inversion RL, which may explain their success in learning diverse skills. Importantly, on test data, IT-IN significantly outperforms both RL methods for all Dsteer sizes, even though IT-IN does not use the privileged information in the reward. We attribute this finding to the combination of stable supervised learning updates, and not relying on a reward. Our results for INTENT-MSE do not come close to IT-IN, which we attribute to the more difficult learning from sparse reward (see Appendix C.1). Finally, GAIL was outperformed by both STATE-MSE RL and IT-IN (except for the single demonstration case the standard "single task" GAIL setup). As expected, GAIL was not able to find a better reward than STATE-MSE here. Additional baseline-comparison results are presented in Appendix C.1. 6. Discussion We presented a new formulation for learning control, based on an inverse problem approach, and demonstrated its application to learning deep neural network policies that can reconstruct diverse behaviors, given an embedding of the desired trajectory. We developed the fundamental theory underlying iterative inversion, and demonstrated promising results on several simple tasks. We also found that for very diverse behaviors, our formulation learns and generalizes more effectively than RL or IRL approaches, which we attribute to the stable supervised-learning method at its core. We only considered a particular trajectory embedding based on an off-the-shelf VQ-VAE, which we found to be general and practical. Important questions for future work include characterizing the effect of the embedding on performance, and training an embedding jointly with the policy. Additionally, the exploration noise, which we found to be important, can potentially be replaced with more advanced exploration strategies from the RL literature. Another question is how to generate intents from a partial description of a trajectory, such as a natural language description. Diffusion models, which have recently gained popularity for learning distributions over latent variables (Rombach et al., 2021), are one potential approach for this. Remaining open questions include the gap between the strict conditions for convergence under a linear approximation in our theory and the stable performance we observed in practice with expressive policies and non-linear dynamics, and whether iterative inversion can be extended to nondeterministic systems. Our work provides the fundamentals for further investigating these important questions. Acknowledgements This work received funding from the European Union (ERC, Bayes-RL, Project Number 101041250). Views and opin- ions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, pp. 1, 2004. Argall, B. D., Chernova, S., Veloso, M., and Browning, B. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469 483, 2009. Baker, B., Akkaya, I., Zhokhov, P., Huizinga, J., Tang, J., Ecoffet, A., Houghton, B., Sampedro, R., and Clune, J. Video pretraining (vpt): Learning to act by watching unlabeled online videos. ar Xiv preprint ar Xiv:2206.11795, 2022. Bertsekas, D. P. Dynamic programming and optimal control, volume 1. Athena Scientific, 1995. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016. Calandra, R., Ivaldi, S., Deisenroth, M. P., Rueckert, E., and Peters, J. Learning inverse dynamics models with contacts. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 3186 3191. IEEE, 2015. Christiano, P., Shah, Z., Mordatch, I., Schneider, J., Blackwell, T., Tobin, J., Abbeel, P., and Zaremba, W. Transfer from simulation to real world through learning deep inverse dynamics model. ar Xiv preprint ar Xiv:1610.03518, 2016. Ding, Y., Florensa, C., Abbeel, P., and Phielipp, M. Goalconditioned imitation learning. Advances in neural information processing systems, 32, 2019. Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K. O., and Clune, J. Go-explore: a new approach for hardexploration problems. ar Xiv preprint ar Xiv:1901.10995, 2019. Edwards, A., Sahni, H., Schroecker, Y., and Isbell, C. Imitating latent policies from observation. In International conference on machine learning, pp. 1755 1763. PMLR, 2019. Learning Control by Iterative Inversion Fu, J., Korattikara, A., Levine, S., and Guadarrama, S. From language to goals: Inverse reinforcement learning for vision-based instruction following. ar Xiv preprint ar Xiv:1902.07742, 2019. Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4rl: Datasets for deep data-driven reinforcement learning, 2020. Ghosh, D., Gupta, A., Reddy, A., Fu, J., Devin, C., Eysenbach, B., and Levine, S. Learning to reach goals via iterated supervised learning. ar Xiv preprint ar Xiv:1912.06088, 2019. Hausknecht, M. and Wagener, N. Consistent dropout for policy gradient reinforcement learning. ar Xiv preprint ar Xiv:2202.11818, 2022. Hazan, E., Kakade, S., Singh, K., and Van Soest, A. Provably efficient maximum entropy exploration. In International Conference on Machine Learning, pp. 2681 2691. PMLR, 2019. Ho, J. and Ermon, S. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016. Hong, Z.-W., Fu, T.-J., Shann, T.-Y., and Lee, C.-Y. Adversarial active exploration for inverse dynamics model learning. In Conference on Robot Learning, pp. 552 565. PMLR, 2020. Horn, R. A. and Johnson, C. R. Matrix analysis. Cambridge university press, 2012. Hunter, J. D. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):90 95, 2007. doi: 10.1109/MCSE.2007.55. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Kostrikov, I. Pytorch implementations of reinforcement learning algorithms. https://github.com/ikostrikov/ pytorch-a2c-ppo-acktr-gail, 2018. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015. Meier, F., Kappler, D., Ratliff, N., and Schaal, S. Towards robust online inverse dynamics learning. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4034 4039. IEEE, 2016. Mendonca, R., Rybkin, O., Daniilidis, K., Hafner, D., and Pathak, D. Discovering and achieving goals via world models. ar Xiv preprint ar Xiv:2110.09514, 2021. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540): 529 533, 2015. Nair, V., Susskind, J., and Hinton, G. E. Analysis-bysynthesis by learning to invert generative black boxes. In International conference on artificial neural networks, pp. 971 981. Springer, 2008. Nguyen-Tuong, D. and Peters, J. Using model knowledge for learning inverse dynamics. In 2010 IEEE international conference on robotics and automation, pp. 2677 2682. IEEE, 2010. Ortega, J. M. and Rheinboldt, W. C. Iterative Solution of Nonlinear Equations in Several Variables. Society for Industrial and Applied Mathematics, 2000. doi: 10.1137/ 1.9780898719468. Parisotto, E., Song, F., Rae, J., Pascanu, R., Gulcehre, C., Jayakumar, S., Jaderberg, M., Kaufman, R. L., Clark, A., Noury, S., et al. Stabilizing transformers for reinforcement learning. In International conference on machine learning, pp. 7487 7498. PMLR, 2020. Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P., Chen, D., Shentu, Y., Shelhamer, E., Malik, J., Efros, A. A., and Darrell, T. Zero-shot visual imitation. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 2050 2053, 2018. Peng, X. B., Guo, Y., Halper, L., Levine, S., and Fidler, S. Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters. ACM Trans. Graph., 41(4), July 2022. Pomerleau, D. A. Alvinn: An autonomous land vehicle in a neural network. Advances in neural information processing systems, 1, 1988. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models, 2021. Ross, S., Gordon, G., and Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627 635. JMLR Workshop and Conference Proceedings, 2011. Learning Control by Iterative Inversion Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Seber, G. A. F. G. A. F. Linear regression analysis George A.F. Seber, Alan J. Lee. Wiley series in probability and statistics. Wiley-Interscience, Hoboken, N.J, 2nd ed. edition, 2003. ISBN 1-280-58916-7. Sekar, R., Rybkin, O., Daniilidis, K., Abbeel, P., Hafner, D., and Pathak, D. Planning to explore via self-supervised world models. In International Conference on Machine Learning, pp. 8583 8592. PMLR, 2020. Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026 5033. IEEE, 2012. Torabi, F., Warnell, G., and Stone, P. Generative adversarial imitation from observation. ar Xiv preprint ar Xiv:1807.06158, 2018. Van Den Oord, A., Vinyals, O., et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017. Yan, W., Zhang, Y., Abbeel, P., and Srinivas, A. Videogpt: Video generation using vq-vae and transformers, 2021. Learning Control by Iterative Inversion A.1. Proof of Equation 4 Throughout this and the rest of the theoretical proofs, with a slight abuse of notation, when a vector u RN is added to a matrix A RM N, the addition is row-wise: A + u A + 1u where 1u = (u, . . . , u)T RM N. We remind the reader of the notations defined in Section 2. Denote Xn (xn 1, . . . , xn M)T RM dim(X), F(Xn) (F(xn 1), . . . , F(xn M))T RM dim(Y) as the input and output matrices, Xn PM i=1 xn i /M Rdim(X) and Y PM i=1 yi/M, F(Xn) PM i=1 F(xn i )/M Rdim(Y) as the current inputs, desired outputs, and current outputs means, and ( ) the Moore-Penrose pseudoinverse operator. We also define Y = (y1, . . . , y M)T RM dim(Y). The approximated linear function is GΘ,b(Y ) = Y Θ + b where Θ Rdim(Y) dim(X) and b R1 dim(X) (note that we explicitly add the bias parameter b). At iteration n: Θn+1, bn+1 = arg min Θ,b F(Xn)Θ + b Xn 2. This is an ordinary linear least squares problem with the solution (Seber, 2003, Section 3.11.1): Θn+1 = (F(Xn) F(Xn)) (Xn Xn), bn+1 = Xn F(Xn)Θn+1. (6) Then, Xn+1 = GΘn+1,bn+1(Y ) = Y Θn+1 + bn+1 = Xn + (Y F(Xn))Θn+1, and averaging over the points yields the result: Xn+1 = Xn + (Y F(Xn))Θn+1. A.2. Proof of Theorem 1 Using the notation defined in Appendix A.1. Assuming F(X) = XF + h is a linear function with F Rdim(X) dim(Y) and h R1 dim(Y). Assuming rank(F(X0) F(X0)) = dim(Y) then (F(X0) F(X0))T (F(X0) F(X0)) is invertible and Θ1 defined on Equation 6 is well defined. Then using the fact that F(Xn) F(Xn) = (Xn Xn)F: Θ1 = (F(X0) F(X0)) (X0 X0) = (X0 X0)F (X0 X0), and satisfies where I is the identity matrix. The bias term, according to Equation 6, is b1 = X0 F(Xn)Θ1 = X0 (X0F + h)Θ1. At the end of iteration 1, X1 = Y Θ1 + b1 and its matching outputs equal to the desired outputs: F(X1) = X1F + h = Y Θ1F + b1F + h = Y + (X0 (X0F + h)Θ1)F + h = Y + (X0F X0F h) + h Learning Control by Iterative Inversion A.3. Proof of Theorem 2 For clarity in our presentation, we will use the following notation: J 1 n J 1(Xn), Jn J(Xn), Fn F(Xn) and Fn F(Xn). We also define Hn Fn Y and Hn Fn Y First we show that J 1 n is non-singular. Since δ < 1 ζβ then ρ(Jn n J 1 n ) Jn n J 1 n δζβ < 1 where ρ(A) denotes the spectral radius of A. The first inequality is proven in Horn & Johnson (2012, Thm. 5.6.9) and the second inequality is a result of the sub-multiplicative property of the induced norm, Jn n J 1 n Jn n J 1 n . Therefore (I +Jn n J 1 n ) is non-singular, and J 1 n = J 1 n (I + Jn n J 1 n ) is non-singular as a multiplication of non-singular matrices. We denote by Jn J 1 n 1 its matrix inverse, and obtain the following bounds: Hn Hn = Fn Fn λ (7) J 1 n = (I + n)J 1 n ( ) J 1 n (1 + n ) β(1 + δ) (8) Jn Jn ( ) Jn 2 n J 1 n 1 Jn n J 1 n Jn 2 n J 1 n 1 Jn n J 1 n ζ2δβ 1 ζδβ µ (9) Jn Jn Jn + Jn µ + ζ (10) Fn Y = Hn = (Xn+1 Xn) Jn Jn Xn+1 Xn (µ + ζ) Xn+1 Xn (11) Inequality ( ) is due to the sub-multiplicative and sub-additive properties of the induced norm, (I + n)J 1 n J 1 n ( I + n ) with I = 1. Inequality ( ) is developed in Horn & Johnson (2012, p. 381), in the context of bounding the error in the inverse of an error-perturbed matrix. Also note that the rest of the inequalities in (9) are well defined since δ < 1/ζβ. The proof now continues similarly to the proof of Ortega & Rheinboldt (2000, 12.3.3). We set Gn = Xn Hn J 1 n = Xn+1, and show that Gn is an Iterated Contraction: Xn+2 Xn+1 = Xn+1 Hn+1 J 1 n+1 Xn+1 = Hn+1 J 1 n+1 (1) β(1 + δ) Hn+1 β(1 + δ) Hn+1 Hn (Xn+1 Xn) Jn (2) β(1 + δ) Hn+1 Hn (Xn+1 Xn)Jn + β(1 + δ) Jn Jn Xn+1 Xn (3) β(1 + δ) 2λ + Hn+1 Hn (Xn+1 Xn)Jn + β(1 + δ)µ Xn+1 Xn = β(1 + δ) 2λ + F(Xn+1) F(Xn) (Xn+1 Xn)J(Xn) + µ Xn+1 Xn (4) β(1 + δ) 2λ + γ Xn+1 Xn + µ Xn+1 Xn β(1 + δ) 2λ Xn+1 Xn + γ + µ Xn+1 Xn (5) β(1 + δ) 2λ(µ + ζ) Fn Y + γ + µ Xn+1 Xn where inequality (1) holds because of Bound 8, (2) is the triangle inequality, (3) is due to the Bounds 7 and 9 and the triangle inequality. Inequality (4) is proven in Ortega & Rheinboldt (2000, 3.2.12), using the assumption x1, x2 : J(x1) J(x2) γ, and inequality (5) is from Bound 11. Denote the function g : R R, g(x) β(1 + δ) (2λ(µ + ζ)/x + γ + µ). Then Xn+2 Xn+1 g( Fn Y ) Xn+1 Xn Learning Control by Iterative Inversion Assuming β(1 + δ)(γ + µ) < 1: g( Fn Y ) = 1 Fn Y = 2λβ(1 + δ)(µ + ζ) 1 β(1 + δ)(µ + γ) ρ g is strictly-decreasing function, thus if Fn Y ρ + ϵ for some ϵ > 0 then g( Fn Y ) α < 1, where α is independent of Fn Y . Then, as long as Fn 1 Y ρ + ϵ: Fn Y (µ + ζ) Xn+1 Xn (µ + ζ)αn X1 X0 where the first inequality holds due to 11. Then, for every ϵ > 0 there exists k < such that one of the following holds: 1. There exists n < k where Fn Y < ρ + ϵ and the proof is done. 2. For all n < k: Fn Y ρ + ϵ and αk ρ+ϵ (µ+ζ) X1 X0 . Then Fk Y (µ + ζ)αk X1 X0 ρ + ϵ and the proof is done. A.4. Convergence Results for 1-Dimensional F We restrict ourselves to the 1-dimensional case, where X = Y = R, and assume the function F is strictly monotone and its maximum and minimum slopes are not too different, thus the function is "close to" linear. We then show convergence at a linear rate. Let SF(x1, x2) (F(x1) F(x2))/(x1 x2) denote the slope of F between x1 and x2, and max |SF| maxx1,x2 X |SF(x1, x2)| denote the maximum absolute slope of F and similarly min |SF| minx1,x2 X |SF(x1, x2)| the minimum absolute slope. Assumption 5. F is continuous and strictly monotone, and max |SF| min |SF| 2 ϵ for some 0 < ϵ 1. Theorem 3. Assume X = Y = R, that Assumption 5 holds, and that there are only two desired outputs M = 2. Then for any i {1, 2} and any iteration n: |F(xn+1 i ) yi| (1 ϵ)|F(xn i ) yi|. When the number of desired outputs is greater than 2, then convergence for each output is generally not guaranteed. Theorem 4. Assume X = Y = R, that Assumption 5 holds and that at iteration n, i xn i < F 1(Y ) or i xn i > F 1(Y ) . Then Xn+1 F 1(Y ) (1 ϵ) Xn F 1(Y ) . Theorem 4 guarantees that after a finite number of iterations, the output segment intersects with the desired output segment. Note that Theorems 3 and 4 do not require any kind of approximations as in Assumption 4, nor for F to be differentiable. A.4.1. PROOF OF THEOREM 3 Denote SF max maxx1,x2 SF(x1, x2) and similarly SF min minx1,x2 SF(x1, x2). Assuming X = Y = R. Then the approximated linear function is Gθ,b(y) = yθ + b where θ, b R are scalars. At iteration n + 1 and for i [1, M]: xn+1 i = Gθn+1,bn+1(yi) = yiθn+1 + bn+1, (12) θn+1, bn+1 = arg min θ,b i=1 (θF(xn i ) + b xn i )2 . Lemma 5. if X = Y = R then n: 1 SF max θn+1 1 SF min if F is strictly increasing and 1 SF min θn+1 1 SF max if F is strictly decreasing. Learning Control by Iterative Inversion Proof. We will prove for strictly increasing F. The proof for strictly decreasing F is symmetrical. Without loss of generality, we assume that Xn is sorted: i: xn i xn i+1. Let k > i then: F(xn i ) + SF min(xn k xn i ) F(xn k) F(xn i ) + SF max(xn k xn i ), 1 SF max (F(xn k) F(xn i )) xn k xn i 1 SF min (F(xn k) F(xn i )), 1 M PM i=1(xn i Xn) F(xn i ) F(Xn) 1 M PM i=1 F(xn i ) F(Xn) 2 = 1 M 2 PM 1 i=1 PM k=i+1(xn k xn i ) (F(xn k) F(xn i )) 1 M 2 PM 1 i=1 PM k=i+1 (F(xn k) F(xn i ))2 1 M 2 PM 1 i=1 PM k=i+1 1 SF min (F(xn k) F(xn i ))2 1 M 2 PM 1 i=1 PM k=i+1 (F(xn k) F(xn i ))2 = 1 SF min , 1 M 2 PM 1 i=1 PM k=i+1 1 SF max (F(xn k) F(xn i ))2 1 M 2 PM 1 i=1 PM k=i+1 (F(xn k) F(xn i ))2 = 1 SF max . When M = 2, the regression line passes exactly at the points (F(xn 1), xn 1) and (F(xn 2), xn 2), and bn+1 also takes the following forms: bn+1 = xn 1 θn+1F(xn 1) = xn 2 θn+1F(xn 2). Then, plugging bn+1 in Equation 12 we get for every i [1, 2]: xn+1 i = xn i + θn+1(yi F(xn i )). Denote the slope of F between xn+1 i and xn i : SF(xn+1 i , xn i ) F(xn+1 i ) F(xn i ) xn+1 i xn i = F(xn+1 i ) F(xn i ) θn+1(yi F(xn i )). Then the following equations hold: F(xn+1 i ) = F(xn i ) + θn+1SF(xn+1 i , xn i )(yi f(xn i )), yi F(xn+1 i ) = 1 θn+1SF(xn+1 i , xn i ) (yi F(xn i )). (13) Using Lemma 5, and since F is always increasing or always decreasing, then θn+1SF(xn+1 i , xn i )) > 0 and 1 2 ϵ min |SF| max |SF| θn+1SF(xn+1 i , xn i )) max |SF| min |SF| 2 ϵ, 1 θn+1SF(xn+1 i , xn i )) max |1 1 2 ϵ|, |1 ϵ| = 1 ϵ. (14) Then, plugging into Equation 13, yi F(xn+1 i ) = 1 θn+1SF(xn+1 i , xn i ) |yi F(xn i )| (1 ϵ) |yi F(xn i )| . Note the convergence in one iteration for the linear case when ϵ = 1. Learning Control by Iterative Inversion A.4.2. PROOF OF THEOREM 4 Ln Y F(Xn) F 1(Y ) Xn = 1 M PN i=1 Y f(xn i ) 1 M PN k=1 F 1(Y ) xn k = F 1(Y ) xn i PM k=1 F 1(Y ) xn k Y f(xn i ) F 1(Y ) xn i Y f(xn i ) F 1(Y ) xn i = i=1 wn,i SF(F 1(Y ), xn i ). Where wn,i = F 1(Y ) xn i PM k=1 F 1(Y ) xn k , PM i=1 wn,i = 1 and, since we assumed i xn i < F 1(Y ) or that i xn i > F 1(Y ), then i wn,i > 0. Therefore Ln is a weighted-mean of the slopes and SF min Ln SF max. From Equation 4 the following holds: Xn+1 Xn = θn+1(Y F(Xn)) = θn+1Lj(F 1(Y ) Xn), F 1(Y ) Xn+1 = (1 θn+1Lj)(F 1(Y ) Xn). (15) Using Lemma 5 and the inequalities SF min Ln SF max, Inequality (14) from Appendix A.4.1 also applies for Ln, and we obtain: |1 θn+1Ln| 1 ϵ, F 1(Y ) Xn+1 (1 ϵ) F 1(Y ) Xn . A.5. Tightness of the Derivative Ratio Bound for 1-Dimensional F We consider the case where F is a 1-dimensional function and provide a simple negative example to demonstrate the tightness of the derivative ratio bound, max |SF| min |SF| < 2. As described in Example 1, in this case, the second approximation in Assumption 4 is perfect, and using the bounds in Assumption 3, the condition for convergence in Theorem 2 is equivalent to max |SF| min |SF| < 2. We assume this condition is not satisfied, i.e., that the maximum slope is more than twice the minimum slope, and show that convergence does not occur, and that the first initial input guess is closer to the desired inputs than all the following iterations. Example 2. Let ϵ, a, b, δ, > 0 and define the continuous 1-dimensional, increasing and piece-wise linear function F with 5 linear segments: x x a a + (2 + ϵ)(x a) a < x a + (1 + ϵ) + x a + < x a + + b a + b + (2 + ϵ)(x a b) a + b + < x a + b + 2 2(1 + ϵ) + x a + b + 2 < x Notice that the minimum slope of F is 1 and the maximum is (2 + ϵ). We set b/2+3δ ϵ . Let there be two desired inputs x 1 = a + + b/2 δ, x 2 = a + + b/2 + δ, and their outputs y 1 = F(x 1) = a + b/2 δ + (2 + ϵ) = a + b + 2 + 2δ, y 2 = F(x 2) = a + b + 2 + 4δ. They are both placed in the 3rd linear segment of the function. We set the initial inputs x0 1 = a δ, x0 2 = a (placed in the 1st linear segment), their corresponding outputs will be the same: y0 1 = x0 1, y0 2 = x0 2, and Gθ1(y) = y. At iteration n = 1, the inputs will be x1 1 = Gθ1(y 1) = y 1, x1 2 = Gθ1(y 2) = y 2 (in the 5th linear segment) and Gθ2(y) = y 2(1 + ϵ) . At iteration n = 2, the inputs will be x2 1 = Gθ2(y 1) = y 1 2(1 + ϵ) = a 4δ, x3 2 = Gθ1(y 2) = a 2δ (in the 1st linear Learning Control by Iterative Inversion Figure 5: Visual illustration of the 1-dimensional function F, described in Example 2, for which convergence is not achieved. segment), and Gθ3(y) = y again. Similarly, at every even iteration, x2n 1 = x2 1, x2n 2 = x2 2 the current input-output pairs will be equal and in the 1st segment, and at every odd iteration x2n+1 1 = x1 1, x2n+1 2 = x1 2 the current input-output pairs will be equal and in the 5th segment. Also, starting from the 2nd iteration, the distance from the current input-outputs to the desired inputs-outputs will stay constant, which is larger than the distance from the initial input-output pairs: n > 1: x 1 xn 1| = x 2 xn 2 = + b/2 + 3δ > + b/2 = x 1 x0 1 = x 2 x0 2 . Learning Control by Iterative Inversion B. Experimental Details Table 3 contains a list of common hyperparameter values that we have used for all the experiments. Table 4 contains Particle and Reacher-v2 specific hyperparameters, while Table 5 is listing Hopper-v2 specific hyperparameters. We note that the minor difference in hyperparameter values between the domains evaluated is purposed only at achieving slightly better MSE results per domain. We observed that the steering behavior was relatively robust to hyperparameter values. Table 3: Common hyperparameters for all experiments Hyperparameter Value Learning rate 5e-4 Sampled rollouts per epoch (N) 200 x Minibatch size Training iterations 2,000 Training buffer size [rollouts] (K x N) 40 x N Steering buffer Dsteer size [rollouts] 500 Ratio of steering intents in minibatch (α) 0.3 Gradient norm clipping 0.5 GPT: # layers 8 GPT: # heads 4 GPT: hidden layer size 64 GPT: dropout 0.2 GPT: attention dropout 0.3 Table 4: Particle & Reacher-v2 hyperparameters Hyperparameter Value Minibatch size (rollouts) 8 Noise scale (η) 4.0 Total epochs (n) 160 Table 5: Mujoco Hopper-v2 hyperparameters Hyperparameter Value Minibatch size (rollouts) 6 Noise scale (η) 1.0 Total epochs (n) 130 B.1. Particle Robot The 2D plane in which the robot is allowed to move is a finite square, with the maximum coordinates (denoted Cmax) increasing for longer horizons. When rendering the videos we include the entire 2D plane, up to the maximum coordinates. When evaluating policies, a validation set of 2,000 trajectories was used, which were unseen during training of the policies. B.1.1. DATASETS Splines Trajectories follow the function of a B-spline curve5. The curves are of degree 2 with 5 control points, which are uniformly sampled between [0, Cmax] in a 2-dimensional space. Deceleration Random Fx and Fy forces for the first tacc trajectory steps, and then T tacc steps of deceleration, where T is the time horizon. Deceleration at step j > tacc is done by setting F j x = 1 2 V j 1 x t , F j y = 1 2 V j 1 y t (assuming the mass of the particle is 1). V and t are defined in Section 5. We use t = 0.1. B.2. Reacher-v2 B.2.1. DATASETS Fixed Joint Trajectories were collected to represent a scenario where one of the two robot arm joints is malfunctioning and is force fixed in place. The policy can only control the other robotic arm joint. When evaluating policies, a validation set of 2,000 trajectories was used, which were unseen during training of the policies. 5https://en.wikipedia.org/wiki/B-spline Learning Control by Iterative Inversion B.3. Hopper-v2 B.3.1. DATASETS Hopping The datasets of size 2180 trajectories used for sequence-lengths 64 and 128 were extracted from D4RL s hopper-medium-v2, and consist of mostly forward hopping behaviors. When evaluating policies, a validation set of 436 trajectories was used, which were unseen during training of the policies. Unlike in the other evaluated domains, where trajectories sampled from a random policy were used to train the VQ-VAE, in Hopper-v2 we have used input videos from D4RL s Hopper-medium-v2 - the reason is that using the initial random policy, the trajectories terminated (hopper fell down) before reaching the desired T. For IT-IN training, we have modified Hopper-v2 slightly so that the episode will not terminate when the Hopper falls, thus allowing it to reach T steps. B.4. GPT-Based Architecture The model is conditioned on the intent via cross-attention. The actor network is comprised of 2 hidden Linear layers of size 64, with a tanh activation. The GPT model size hyper-parameters had an effect on the results, but these did not strictly improve with a bigger model. We experimented with several settings for number of heads (1, 2, 4) and number of layers (2, 4, 8) in the GPT model, and with these parameters, indeed the highest value gave better results. However, for the hidden dimension of the model, we tried 2 values (64, 128) and found the lower one gave better results. As for the context size, we only tried setting it to the entire trajectory length. B.5. GRU-Based Architecture The single-layer GRU s hidden state size is set to match the flattened intent size of 4096. As with the GPT-based architecture, the actor network is comprised of 2 hidden layers of size 64 with tanh activations. B.6. RL Baseline For the RL baseline, both actor and critic networks are 2-layer MLPs (of size 64) with tanh activations. Table 6 summarizes the hyperparameters used for training RL policies with PPO (Schulman et al., 2017) and a GRU policy. B.6.1. MLP-BASED ARCHITECTURE FOR RL BASELINE For the RL baseline only, we also experimented with an MLP-based architecture. In this case, both the the intent and observation go through a single Linear layer followed by Layer Normalization and a Re LU activation. The output sizes of the Linear layers we used were 256 and 64 for the intents and observations, respectively. All other hyperparameters are identical to those shown in Table 6. B.7. GAIL Baseline Our GAIL from observations (Torabi et al., 2018) implementation is based on Kostrikov (2018), and uses PPO as the RL algorithm. We note that Kostrikov (2018) implements "vanilla" GAIL (Ho & Ermon, 2016), which uses state + action pairs as input to the discriminator. Therefore, to match the setup of Torabi et al. (2018), we modified the implementation so that Table 6: RL hyperparameters Hyperparameter Value PPO Clip Ratio 0.2 GAE λ 0.95 Discount rate γ 0.99 Learning rate 1e-4 Value loss coefficient 0.5 # epochs 4 # rollouts sampled per policy update 128 Total iterations 5000 Table 7: GAIL hyperparameters Hyperparameter Value Discriminator batch size 128 Discriminator learning rate 1e-3 Gradient penalty λ 10 Learning Control by Iterative Inversion the discriminator is fed state + next-state pairs. To add the intent as context to GAIL, it is concatenated to the state transition pair. Due to the large discrepancy in size between the intent (size 4096) and the state transition pair (size 8 in the case of Particle:Splines), before concatenating the intent to the state we downscale it to size 256 using a Linear layer. In addition to the downscaled intent, we also concatenate the timestep of the transition (normalized between [ 1, 1]) to the discriminator input. We found this improved GAIL performance in our experiments on Particle:Splines. The discriminator itself is a 3-layer MLP with a hidden dimension of 100 and tanh activations. GAIL specific hyperparameters are provided in Table 7. PPO hyperparameters used in GAIL experiments are the same as in Table 6. We experimented with two reward modes involving GAIL: (1) GAIL: The standard formulation of GAIL, where the reward to the RL algorithm is the log of the discriminator output, and (2) GAIL+STATE-MSE, where we combined (by simple addition) the standard GAIL reward and the STATE-MSE reward defined in Section 5, in an attempt to see if a combination of the 2 reward signals (from the environment and from the discriminator) will result in an improved overall signal. As can be seen in Figure 6, while (2) indeed improved over (1), neither was able to outperform PPO with STATE-MSE nor IT-IN. Learning Control by Iterative Inversion C. Additional Experimental Results C.1. RL and IRL Baselines In Figure 6 we present additional results comparing IT-IN to RL and IRL baselines. As described in Section 5, INTENT-MSE is a sparse RL reward, calculated as the MSE between intents of the desired trajectory and the executed trajectory, given at the end of the episode. In our experiments, INTENT-MSE doesn t come close to IT-IN, which we attribute to the more difficult learning from sparse reward. We found the MLP-based RL policy (see Appendix B.6.1) underperformed the GRU-based RL policy (though not too significantly). This can be attributed to the task being that of tracking a trajectory (specified by the intent), which can be easier with access to the trajectory performed by the agent so far. Finally, Figure 6 shows the results for GAIL+STATE-MSE, the additional reward mode involving GAIL, as discussed in Appendix B.7. This reward mode improves upon the standard GAIL reward, but does not perform better than neither STATE-MSE or IT-IN. In Figure 7 we present example training curves comparing IT-IN and PPO with STATE-MSE reward. Figure 6: Comparison of IT-IN with PPO and GAIL baselines. This figure includes the results shown in Figure 4 and additional results discussed in Appendix C.1. All experiments are on the Particle:Splines environment with T = 16. All results are MSE (lower is better), each represented with a mean and standard deviation of 3 random seeds. Note that IT-IN outperforms baselines on test trajectories (graph on the right) for all Dsteer sizes. For GAIL, Dsteer is also used as the expert data for the discriminator. Figure 7: Comparison of test MSE convergence of IT-IN vs. PPO with STATE-MSE reward. All runs are in the Particle:Splines environment, with |Dsteer| = 500 and horizon T = 16. The MSE is calculated on a held out set of 500 trajectories. Learning Control by Iterative Inversion Table 8: Evaluation of policies trained with and without exploration. We show average MSE for 3 policies; due to different domains, MSEs are comparable only within each row. Exploration Noise η No Exploration Particle:Splines, T = 64, |Dsteer| = 500 69.2 454.7 Hopper-v2, T = 128, |Dsteer| = 1740 483.3 920.6 Table 9: Evaluation of IT-IN with a GRU policy on variable Steering Dataset size. T = 16. Note that |Dsteer| = 0 represents the case where no steering is used at all. In this case, we use trajectories sampled from a random policy to initialize |Dprev| (see Algorithm 2). Note: Since we do not normalize the MSE w.r.t. T, these results have a different scale than Table 2. |Dsteer| = 0 |Dsteer| = 10 |Dsteer| = 50 |Dsteer| = 100 |Dsteer| = 500 |Dsteer| = 5000 Particle:Splines 5.48 5.18 3.86 3.61 3.02 2.89 Particle:Deceleration 0.85 0.89 0.75 0.73 0.67 0.71 Reacher-v2:Fixed Joint 2.49 2.05 1.68 1.64 1.58 1.61 C.2. Particle:Splines - Effect of Trajectory Length We tested IT-IN on multiple horizons T in the Splines domain, and found it to work well across horizons of 32, 64 and 128. We present sample visualizations with different T values in Figure 8 (showing the final reconstructed trajectories) and in Figure 9 (showing trajectory progression during an epsiode). C.3. Particle:Splines - Effect of Steering Dataset Size In Figure 10 we present trajectory visualizations showcasing the effect of the size of the steering buffer Dsteer (cf. Table 2). C.4. Particle:Deceleration - Effect of Steering Dataset Size Similarly to Section C.3, in Figure 11, we showcase the effect of the size of the steering buffer Dsteer (cf. Table 2) in the Particle:Deceleration domain. C.5. Reacher-v2 We present sample reconstruction visualizations for random-action trajectories from Reacher-v2 on 16-step sequences in Figure 12. Sample videos for 64-step Fixed Joint sequences (trained with a GPT-based policy) can be found in the project s website: https://sites.google.com/view/iter-inver. C.6. Hopper-v2 We show additional examples of rollouts for the Hopper-v2 domain on 128-step sequences in Figure 13. C.7. Exploration Table 8 is showcasing Splines and Hopper-v2 reconstruction MSEs when trained with and without exploration noise η. C.8. Steering Cross-Evaluation In Figure 14 we show example rollouts from the experiments on steering cross-evaluation (cf. Table 1). C.9. GRU-Based Policy Experiments We report similar results with a GRU-based policy to the the results shown in Table 2 (analyzing the effect of steering dataset size) in Table 9, and similar results to Table 1 (steering cross-evaluation) in Table 10. Learning Control by Iterative Inversion Table 10: Steering cross-evaluation for a GRU-policy. Horizon T = 16. In all cases |Dsteer| = 500. Test Dataset Steering Dataset MSE Splines Splines 2.79 Deceleration 5.4 Deceleration Splines 1.46 Deceleration 0.72 C.10. Non-Deterministic Dynamics In this section we discuss the effect of non-deterministic dynamics on the performance of our method. We consider two sources of non-determinism: non-deterministic starting states and non-deterministic transitions. For non-deterministic starting states, we can look at our results in the Hopper-v2 environment as an example (see Table 2 and Figures 3 and 13). In this environment the initial state is randomized, albeit from a rather limited distribution (uniform noise of scale 5e-3 is added to initial position and velocity), and this doesn t pose a problem for our method. That said, for an initial state distribution that is very diverse, it is not clear how the policy could recover the trajectory specified by an intent with a very different starting state. This limitation is therefore inherent to the way we defined the problem (trajectory tracking). As for non-deterministic transitions, we evaluated a variant of the Particle:Splines environment with noisy transitions: zero-mean Gaussian noise with standard deviation σ is added to the action (see Figure 15a for examples of trajectories with different noise scales added). Figure 15b shows that for moderate σ our method still works well, while for σ > 5 performance starts to deteriorate considerably. Thus, we find empirically that our method can handle some stochasticity in the transitions and in the starting state. We defer a more complete characterization, including the non-trivial theoretical analysis, to future work. Learning Control by Iterative Inversion Figure 8: Example results on the Splines dataset, for different sequence lengths. In all cases shown here |Dsteer| = 500. To the left of each row we state the average MSE on an evaluation set of 3 policies trained with different seeds. Note the increasing scale of the plots as the sequence length increases. Also note that all trajectories start at (0,0), marked by the blue circle in each plot. Figure 9: Visualization of trajectories progression in the Splines domain for different horizons T. Note the increasing scale of the plots as the sequence length increases. |Dsteer| = 500. Learning Control by Iterative Inversion Figure 10: Example results on the Particle:Splines dataset for policies trained with different sizes of Dsteer. Each row corresponds to a different size. Each column corresponds to a specific reference trajectory from the dataset, the intent of which was used as input to the policies. T = 64 was used in all experiments. Learning Control by Iterative Inversion Figure 11: Example results on the Particle:Deceleration dataset for policies trained with different sizes of Dsteer. T = 64 was used in all experiments. Figure structure same as in Figure 10. Figure 12: Examples of trajectory reconstructions in the Reacher-v2 domain. In each plot, the red row is the reference trajectory and the blue row is the policy reconstruction. These are based on a GRU policy. For ease of viewing, we modified the dark colors of the original rendered images. Learning Control by Iterative Inversion Figure 13: Examples of trajectory reconstructions in the Hopper-v2 domain, with T = 128 and |Dsteer| = 500. Learning Control by Iterative Inversion (a) Testing on trajectories from Particle:Splines (b) Testing on trajectories from Particle:Deceleration Figure 14: Examples comparing how policies trained with steering intents from either Particle:Splines or Particle:Deceleration perform when tested on trajectories from either datasets. We can see that when a policy is trained with steering intents from one dataset, it performs well on that dataset and performs poorly on the other. In each column the reference trajectory is the same. (a) The Particle environment (T = 32) dynamics are altered to include a zero-mean Gaussian noise with standard deviation σ added to action. This figure shows how a fixed acceleration trajectory [action = (1, 1), for 32 consecutive steps] is affected by different noise scales. (b) Test MSE on a test set of 500 trajectories not included in the steering dataset, in the Particle:Splines environment with noise added to the actions. For all runs T = 32 and |Dsteer| = 500. X-axis is number of training iterations. Each σ value was evaluated on 3 seeds. For moderate noise scales (σ = 1 and σ = 2.5) our method still works well (MSE = 18.5 and 20.7 respectively vs. MSE = 18.0 for no noise added - see top row in Figure 8), while for larger noise scales (σ = 5.0 and σ = 10.0) the performance starts to deteriorate considerably. Figure 15: Analysis of non-deterministic dynamics (see Appendix C.10)