# tasil_taylor_series_imitation_learning__722bccdf.pdf

Ta SIL: Taylor Series Imitation Learning

Daniel Pfrommer

Massachusetts Institute of Technology Cambridge, MA dpfrom@mit.edu

Thomas T.C.K. Zhang University of Pennsylvania Philadelphia, PA ttz2@seas.upenn.edu

Stephen Tu Robotics at Google New York, NY stephentu@google.com

Nikolai Matni University of Pennsylvania Philadelphia, PA nmatni@seas.upenn.edu

We propose Taylor Series Imitation Learning (Ta SIL), a simple augmentation to standard behavior cloning losses in the context of continuous control. Ta SIL penalizes deviations in the higher-order Taylor series terms between the learned and expert policies. We show that experts satisfying a notion of incremental input-tostate stability are easy to learn, in the sense that a small Ta SIL-augmented imitation loss over expert trajectories guarantees a small imitation loss over trajectories generated by the learned policy. We provide generalization bounds for Ta SIL that scale as O(1/n) in the realizable setting, for n the number of expert demonstrations. Finally, we demonstrate experimentally the relationship between the robustness of the expert policy and the order of Taylor expansion required in Ta SIL, and compare standard Behavior Cloning, DART, and DAgger with Ta SIL-loss-augmented variants. In all cases, we show significant improvement over baselines across a variety of Mu Jo Co tasks.

1 Introduction

Imitation learning (IL), wherein expert demonstrations are used to train a policy [1, 2], has been successfully applied to a wide range of tasks, including self-driving cars [3, 4], robotics [5], and video game playing [6]. While IL is typically more sample-efficient than reinforcement learning-based alternatives, it is also known to be sensitive to distribution shift: small errors in the learned policy can lead to compounding errors and ultimately, system failure [3, 6]. In order to mitigate the effects of distribution shift caused by policy error, two broad approaches have been taken by the community. On-policy approaches such as DAgger [6] augment the data-set with expert-labeled or corrected trajectories generated by learned policies. In contrast, off-policy approaches such as DART [7] and GAIL [8] augment the data-set by perturbing the expert controlled system. In both cases, the goal is to provide examples to the learned policy of how the expert recovers from errors. While effective, these methods either require an interactive expert (on-policy), or access to a simulator for policy rollouts during training (off-policy), which may not always be practically feasible.

In this work, we take a more direct approach towards mitigating distribution shift. Rather than providing examples of how the expert policy recovers from error, we seek to endow a learned policy with the robustness properties of the expert directly. In particular, we make explicit the underlying assumption in previous work that a good expert is able to recover from perturbations through

Both authors contributed equally to this work. This work was done while affiliated with the University of Pennsylvania.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

the notion of incremental input-to-state stability, a well-studied control theoretic notion of robust nonlinear stability. Under this assumption, we show that if the p-th order Taylor series approximation of the learned and expert policies approximately match on expert-generated trajectories, then the learned policy induces closed-loop behavior similar to that of the expert closed-loop system. Here the order p is determined by the robustness properties of the expert, and makes quantitative the informal observation that more robust experts are easier to learn. Fundamentally, we seek to characterize settings where offline imitation learning is Probably Approximately Correctly (PAC)-learnable [9]; that is, defining notions of expert data and designing algorithms such that low training time error on offline expert demonstrations implies low test time error along the learned policy s trajectories, in spite of distribution shift.

Contributions We propose and analyze Taylor Series Imitation Learning (Ta SIL), which augments the standard behavior cloning imitation loss to capture errors between the higher-order terms of the Taylor series expansion of the learned and expert policies. We reduce the analysis of Ta SIL to the analysis of a supervised learning problem over expert data: in particular, we identify a robustness criterion for the expert such that a small Ta SIL-augmented imitation loss over expert trajectories guarantees that the difference between trajectories generated by the learned and expert policies is also small. We also provide a finite-difference based approximation that is applicable to experts that cannot directly query their higher-order derivatives, expanding the practical applicability of Ta SIL. We show in the realizable setting that our algorithm achieves generalization bounds scaling as O(1/n), for n the number of expert demonstrations. Finally, we empirically demonstrate (i) that the relationship between the robustness of the expert policy and the order of Taylor expansion required in Ta SIL predicted by our theory is observed in practice, and (ii) the benefits of using the Ta SIL-augmented imitation loss by comparing the sample-efficiency of standard and Ta SIL-loss-augmented behavior cloning, DART, and DAgger on a variety of Mu Jo Co tasks: on hard instances where behavior cloning fails, the Ta SIL-augmented variants show significant performance gains.

1.1 Related work

Imitation learning Behavior cloning is known to be sensitive to compounding errors induced by small mismatches between the learned and expert policies [3, 6]. On-policy [6] and off-policy [7, 8] approaches exist that seek to prevent this distribution shift by augmenting the data-set created by the expert. While DAgger is known to enjoy O(T) sample-complexity in the task horizon T for loss functions that are strongly convex in the policy parameters,3 we are not aware of finite-data guarantees for DART or GAIL. In addition to these seminal papers, there is a body of work that seeks to leverage control theoretic techniques [10, 11] to ensure (robust) stability of the learned policy. More closely related to our work are the results by Ren et al. [12] and Tu et al. [13]. In Ren et al. [12], a two-stage pipeline of imitation learning followed by policy optimization via a PAC-Bayes regularization term is used to provide generalization bounds of the learned policy across randomly drawn environments. This work is mostly complementary to ours, as Ta SIL could in principle be used to augment the imitation losses used in the first stage of their pipeline (leaving the second stage unmodified).

In Tu et al. [13], sample-complexity bounds for IL are provided under the assumption that the learned policy can be explicitly constrained to match the incremental stability properties of the expert. While conceptually appealing, practically enforcing such stability constraints is difficult, and as such Tu et al. [13] resort to heuristics in their implementation. In contrast, we provide sample-complexity guarantees for a practically implementable algorithm, and under much milder stability assumptions.

Robust stability and learning for continuous control There is a rich body of work applying Lyapunov stability or contraction theory [14] to learning for continuous control. For example [15 18] use stability-based regularizers to trim the hypothesis space, and empirically show that this leads to more sample-efficient and robust learning methods. Lyapunov stability and contraction theory have also been used to provide finite sample-complexity guarantees for adaptive nonlinear control [19] and learning stability certificates [20].

3This bound degrades to O(T 2) when loss function is only convex, and does not hold for the nonconvex loss functions we consider.

2 Problem formulation

We consider the nonlinear discrete-time dynamical system

xt+1 = f(xt, ut), x0 = ξ, (1)

where xt Rd is the system state, ut Rm is the control input, and f : Rd Rm Rd defines the system dynamics. We study system (1) evolving under the control input ut = π(xt) + t, for π : Rd Rm a suitable control policy, and t Rm an additive input perturbation (that will be used to capture policy errors). We define the corresponding perturbed closed-loop dynamics by xt+1 = f π cl(x, t) := f(x, π(x) + t), and use xπ t (ξ, { s}t 1 s=0) to denote the value of the state xt at time t evolving under control input ut = π(x) + t starting from initial condition x0 = ξ. To lighten notation, we use xπ t (ξ) to denote xπ t (ξ, {0}t 1 s=0), and overload to denote the Euclidean norm for vectors and the operator norm for matrices and tensors.

Initial conditions ξ are assumed to be sampled randomly from a distribution D with support restricted to a compact set X. We assume access to n rollouts of length T from an expert π , generated by drawing initial conditions {ξi}n i=1 i.i.d. from D. The IL task is to learn a policy ˆπ which leads to a closed-loop system with similar behavior to that induced by the expert policy π as measured by the expected imitation gap Eξ max1 t T xˆπ t (ξ) xπ t (ξ) .

The baseline approach to IL, typically referred to as behavior cloning (BC), casts the problem as an instance of supervised learning. Denote the discrepancy between an evaluation policy π and the expert policy π on a trajectory generated by a rollout policy πd starting at initial condition ξ by πd t (ξ; π) := π(xπd t (ξ)) π (xπd t (ξ)). BC directly solves the supervised empirical risk minimization (ERM) problem

ˆπbc argmin π Π

i=1 h({ π t (ξi; π)}T 1 t=0 ),

over a suitable policy class Π. Here, h : (Rm)T R is a loss function which encourages the discrepancy terms π t (ξi; π) to be small along the expert trajectories. While behavior cloning is conceptually simple, it can perform poorly in practice due to distribution shifts triggered by errors in the learned policy ˆπbc. Specifically, due to the effects of compounding errors, the closed-loop system induced by the behavior cloning policy ˆπbc may lead to a dramatically different distribution over system trajectories than that induced by the expert policy π , even when the population risk on the expert data, Eξh({ π t (ξ; ˆπbc)}T 1 t=0 ), is small.

As described in the introduction, existing approaches to mitigating distribution shift seek to augment the data with examples of the expert recovering from errors. These approaches either require an interactive oracle (e.g., DAgger) or access to a simulator for policy rollouts (e.g., DART, GAIL), and may not always be practically applicable. To address the distribution shift challenge without resorting to data-augmentation, we propose Ta SIL, an off-policy IL algorithm which provably leads to learned policies that are robust to distribution shift.

The rest of the paper is organized as follows: in Section 3, we focus on ensuring robustness to policy errors for a single initial condition ξ. We show that the imitation gap xπ t (ξ) xπ t (ξ) between a test policy π and the expert policy π can be controlled by the Ta SIL-augmented imitation loss evaluated on the expert trajectory {xπ t (ξ)}, effectively reducing the analysis of the imitation gap to a supervised learning problem over the expert data. In Section 4, we integrate these results with tools from statistical learning theory to show that n ε r/δ trajectories are sufficient to achieve imitation gap of at most ε with probability at least 1 δ; here r > 0 is a constant determined by the stability properties of the expert policy π , with more robust experts corresponding to smaller values of r. Finally, in Section 5, we validate our analysis empirically, and show that using the Ta SIL-augmented loss function in IL algorithms leads to significant gains in performance and sample efficiency.

3 Bounding the imitation gap on a single trajectory

In this section, we fix an initial condition ξ and test policy π, and seek to control the imitation gap ΓT (ξ; π) := max1 t T xπ t (ξ, {0}) xπ t (ξ, {0}) . A natural way to compare the

closed-loop behavior of a test policy π to that of the expert policy π is to view the discrepancy π t (ξ; π) := π(xπ t (ξ)) π (xπ t (ξ)) as an input perturbation to the expert closed-loop system. By writing xt+1 = f π cl(xt, 0) = f π cl (xt, π t (ξ; π)), the imitation gap can be written as ΓT (ξ; π) = maxt xπ t (ξ, { π s (ξ; π)}t 1 s=0) xπ t (ξ, {0}) , suggesting that closed-loop expert systems that are robust to input perturbations, as measured by the difference between nominal and perturbed trajectories, will lead to learned policies that enjoy smaller imitation gaps.

Stability conditions defined in terms of differences between nominal and perturbed trajectories have been extensively studied in robust nonlinear control theory via the notion of incremental input-tostate stability (δ-ISS) (see e.g., Angeli [21] and references therein). Before proceeding, we recall definitions of standard comparison functions [22]: a function γ(x) is class K if it is continuous, strictly increasing, and satisfies γ(0) = 0, and a function β(x, t) is class KL if it is continuous, β( , t) is class K for each fixed t, and β(x, ) is decreasing for each fixed x. Definition 3.1 (δ-ISS system). Consider the closed-loop system evolving under policy π and subject to perturbations t given by xt+1 = f π cl(xt, t). The closed-loop system f π cl(xt, t) is incremental input-to-state stable (δ-ISS) if there exists a class KL function β and a class K function γ such that for all initial conditions ξ1, ξ2 X, perturbation sequences { t}t 0, and t N:

xπ t (ξ1; { s}t 1 s=0) xπ t (ξ2; {0}t 1 s=0) β( ξ1 ξ2 , t) + γ max 0 k t 1 k . (2)

Definition 3.1 says that: (i) trajectories generated by δ-ISS systems converge towards each other if they begin from different initial conditions, and (ii) the effect of bounded perturbations { t} on trajectories is bounded. Our results will only require the stability conditions of Definition 3.1 to hold for a class of norm bounded perturbations. In light of this, we say that a system is η-locally δ-ISS if equation (2) holds for all input perturbations satisfying supt N t η.

By writing xt+1 = f π cl(xt, 0) = f π cl (xt, π t (ξ; π)), we conclude that if xt+1 = f π cl (xt, t) is η-locally δ-ISS , and that supt N π t (ξ; ˆπ) η, then equation (2) yields

xπ t (ξ) xπ t (ξ) γ max 0 k t 1 π k(ξ; π) = ΓT (ξ; π) γ max 0 t T 1 π t (ξ; π) . (3)

Equation (3) shows that the imitation gap xπ t (ξ) xπ t (ξ) is controlled by the maximum discrepancy π t (ξ; π) incurred on trajectories generated by the policy π. A natural way of bounding the test discrepancy π t (ξ; π) , defined over trajectories generated by π, by the training discrepancy π t (ξ; π) , defined over trajectories generated by π , is to write out a Taylor series expansion of the former around the latter, i.e., to write:4

π t (ξ; π) π t (ξ; π) + x π t (ξ; π) ΓT (ξ; π) + O Γ2 T (ξ, π) , (4)

where x π t (ξ; π) denotes the partial derivative of the discrepancy with respect to the argument of the policies π and π . The challenge however, is that the imitation gap ΓT (ξ; π) also appears in the Taylor series expansion (4), leading to an implicit constraint. We show that this can be overcome if the order p of the Taylor series expansion (4) is sufficiently large, as determined by the decay rate of the class K function γ( ) defining the robustness of the expert policy. We begin by identifying a condition reminiscent of adversarially robust training objectives that ensures a small imitation gap. Proposition 3.1. Let the expert closed-loop system f π cl be η-locally δ-ISS for some η > 0. Fix an imitation gap bound ε > 0, initial condition ξ, and policy π. Then if

max 0 t T 1 sup δ ε π (xπ t (ξ) + δ) π(xπ t (ξ) + δ) min{η, γ 1(ε)}, (5)

we have that the imitation gap satisfies ΓT (ξ; π) ε.

Proposition 3.1 states that if a policy π is sufficiently close to the expert policy π in a tube around the expert trajectory, then the imitation gap remains small. How to ensure that inequality (5) holds using only offline data from expert trajectories is not immediately obvious. The Taylor series expansion (4) suggests that a natural approach to satisfying this condition is to match derivatives of the test policy π to those of the expert policy π . We show next that a sufficient order for such a Taylor series

4We only take a first order expansion here for illustrative purposes.

expansion is naturally determined by the decay rate of the class K function γ(x) towards 0. Less robust experts have functions that decay to 0 more slowly and will lead to stricter sufficient conditions. Conversely, more robust experts have functions that decay to 0 more quickly, and will lead to more relaxed sufficient conditions. We focus on two disjoint classes of class K functions: (i) functions that decay in their argument faster than a linear function, i.e. γ(x) < O(x) as x 0+, and (ii) functions that decay in their argument no faster than a linear function, i.e. γ(x) Ω(x) as x 0+.

Rapidly decaying class K functions

We show that when the class K function γ(x) decays to 0 faster than O(x) in some neighborhood of 0, then matching the zeroth-order difference maxt π t (ξ; π) on the expert trajectory, as is done in vanilla behavior cloning, suffices to close the imitation gap. We make the following assumption on the test policy π and expert policy π . Assumption 3.1. There exists a non-negative constant Lπ such that π(x) π(y) 2 Lπ x y 2 for all x, y, and π {π, π }.

Proposition 3.1 then leads to the following guarantee on the imitation gap. Theorem 3.1. Fix a test policy π and initial condition ξ X, and let Assumption 3.1 hold. Let f π cl be η-locally δ-ISS for some η > 0, and assume that the class K function γ( ) in (2) satisfies γ(x) O(x1+r) for some r > 0. Choose constants µ, α > 0 such that

2Lπx + (x/µ) 1 1+r γ 1(x) for all 0 x α. (6)

Provided that the imitation error on the expert trajectory incurred by π satisfies:

max 0 t T 1 µ π t (ξ; π) 1+r α, max 0 t T 1 2Lπµ π t (ξ; π) 1+r + π t (ξ; π) η, (7)

then for all 1 t T the instantaneous imitation gap is bounded as

xπ t (ξ) xπ t (ξ) max 0 k t 1 µ π k (ξ; π) 1+r . (8)

Theorem 3.1 shows that if a policy π is a sufficiently good approximation of the expert policy π on an expert trajectory {xπ t (ξ)}, then the imitation gap xπ t (ξ) xπ t (ξ) can be upper bounded in terms of the discrepancy term max0 k t 1 π t (ξ; π) evaluated on the expert trajectory. To help illustrate the effect of the decay parameter r on the choices of µ and α, we make condition (6) more explicit by assuming that γ(x) Cx1+r for all x [0, 1]. Then one can choose µ = 21+r C and α = O(1)(L1+r π C) 1/r. This makes clear that a larger r, i.e., a more robust expert, leads to less restrictive conditions (7) on the policy π and a tighter upper bound on the imitation gap (8) (assuming max0 k t 1 π t (ξ; π) < 1). In particular, for such systems, vanilla behavior cloning is sufficient to ensure bounded imitation gap. This also makes clear how the result breaks down when r 0, i.e., when the decay rate is nearly linear, as the neighborhood α can become arbitrarily small, such that it may be impossible to learn a policy π that satisfies the bound (7) with a practical number of samples n. The interplay of the bounds (7) in Theorem 3.1 (and the subsequent theorems of its like) and the sample-complexity of imitation learning will be discussed in further detail in Section 4.

Slowly decaying class K functions

When the class K function γ(x) decays to 0 slowly, for reasons discussed above, controlling the zeroth-order difference π t (ξ; π) may not be sufficient to bound the imitation gap. In particular, we consider class K functions satisfying γ(x) O(x1/r) for some r 1. Setting p = r , we now show that matching up to the p-th total derivative of π is sufficient to control the imitation gap. Analogously to Assumption 3.1, we make the following regularity assumption on the test policy π and expert policy π . Assumption 3.2. For a given non-negative p N, assume that the test policy π and expert policy π are p-times continuously differentiable, and there exists a constant L pπ 0 such that π(x) Jp x0 π (x) L pπ (p + 1)! x x0 p+1 , (9)

for all x, x0 and π {π, π }, where Jp x0 π (x) := Pp j=0 1 j! j x π(x0) (x x0) j is the p-th order Taylor polynomial of π evaluated at x0, and denotes the tensor product.

With this assumption in hand, we provide the following guarantee on the imitation gap. Theorem 3.2. Let f π cl be η-locally δ-ISS for some η > 0, and assume that the class K function γ( ) in (2) satisfies γ(x) O(x1/r) for some r 1. Fix a test policy π and initial condition ξ X, and let Assumption 3.2 hold for p N satisfying p + 1 r > 0. Choose µ, α > 0 such that

(p + 1)!xp+1 + (x/µ)r γ 1(x), for all 0 x α 1

Provided the jth total derivatives, j = 0, . . . , p, of the imitation error on the expert trajectory incurred by π satisfy:

max 0 t T 1 max 0 j p µ 2

j x π t (ξ; π) 1/r α, (11)

max 0 t T 1 max 0 j p 2L pπµp+1

j x π t (ξ; π) p+1

j x π t (ξ; π) η, (12)

then for all 1 t T the instantaneous imitation gap is bounded by

xπ t (ξ) xπ t (ξ) max 0 k t 1 max 0 j p µ 2

j x π t (ξ; π) 1/r. (13)

Theorem 3.2 shows that if the p-th order Taylor series of the policy π approximately matches that of the expert policy π when evaluated on an expert trajectory {xπ t (ξ)}, then the imitation gap xπ t (ξ) xπ t (ξ) can be upper bounded in terms of the derivatives of the discrepancy term, i.e., by max0 k t 1 max0 j p j x π k (ξ; π) , evaluated on the expert trajectory. To help illustrate the effect of the choice of the order p on the constants µ and α, we make condition (10) more explicit by assuming that γ(x) Cx1/r for all x [0, 1]. Then one can choose µ = 21/r C and α = O(1)(L pπCr) 1/(p+1 r). This expression highlights a tradeoff: by picking larger order p, the right hand side α of bound (11) increases, but at the expense of having to match higher-order derivatives. This also highlights that both the order p and closeness required by Equation (11) get increasingly restrictive as r increases, matching our intuition that less robust experts lead to harder imitation learning problems.

Using estimated derivatives We show in Appendix C that the results of Theorems 3.1 and 3.2

extend gracefully to when only approximate derivatives [ j xπ (x) can be obtained, e.g., through

finite-difference methods. In particular, if [ j xπ (x) j xπ (x) ε, then it suffices to appropriately tighten the constraints (11) and (12) by O(ε1/r) and O(ε), respectively. Please refer to Appendix C for more details.

4 Algorithms and generalization bounds for Ta SIL

The analysis of Section 3 focused on a single test policy π and initial condition ξ. Theorems 3.1 and 3.2 motivate defining the p-Ta SIL loss function:

ℓπ p (ξ; π) := 1 p + 1

j=0 max 0 t T 1 j x π t (ξ; π) . (14)

The corresponding policy ˆπTa SIL,p is the solution to the empirical risk minimization (ERM) problem:

ˆπTa SIL,p argmin π Π

i=1 ℓπ p (ξi; π), (15)

which explicitly seeks to learn a policy π Π, for Π a suitable policy class, that matches the p-th order Taylor series expansion of the expert policy.5 In this section, we analyze the generalization and sample-complexity properties of the p-Ta SIL ERM problem (15).

5Although we focus on the supremum loss max0 t T 1 j x π t (ξ; π) in our analysis, we note that any surrogate loss that upper bounds the supremum loss, e.g., PT 1 t=0 j x π t (ξ; π) , can be used.

Our analysis in this section focuses on the realizable setting: we assume that for every dataset of expert trajectories {{xπ t (ξi)}T 1 t=0 }n i=1, there exists a policy π Π that achieves (near) zero empirical risk. This is true if, for example, π Π. In this setting, we demonstrate that we can attain generalization bounds that decay as O(n 1), where O( ) hides poly-log dependencies on n. These rates are referred to as fast rates in statistical learning, since they decay faster than the n 1/2 rate prescribed by the central limit theorem. We present analysis for the non-realizable setting in Appendix B: this analysis is standard and yields generalization bounds scaling as O(n 1/2).

Let G [0, 1]X be a set of functions mapping some domain X to [0, 1].6 Let D be a distribution with support restricted to X, and denote the mean of g G with respect to x D by Ex[g]. Similarly, fixing data points x1, . . . , xn X, we denote the empirical mean of g by En[g] := n 1 Pn i=1 g(xi). We focus our analysis on the following class of parametric Lipschitz function classes. Definition 4.1 (Lipschitz parametric function class). A parametric function class G [0, 1]X is called (Bθ, Lθ, q)-Lipschitz if G = {gθ( ) | θ Θ} with Θ Rq, and it satisfies the following boundedness and uniform Lipschitz conditions:

sup θ Θ θ Bθ, sup x X sup θ1,θ2 Θ,θ1 =θ2

|gθ1(x) gθ2(x)|

θ1 θ2 Lθ. (16)

We assume without loss of generality that BθLθ 1.

The description (16) is very general, and as we show next, is compatible with feed-forward neural networks with differentiable activation functions. We then have the following generalization bound, which adapts [23, Corollary 3.7] to Lipschitz parametric function classes using the machinery of local Rademacher complexities [23]. Alternatively, the result can also be derived from [24, Theorem 3]. Theorem 4.1. Let G [0, 1]X be a (Bθ, Lθ, q)-Lipschitz parametric function class. There exists a universal positive constant K < 106 such that the following holds. Given δ (0, 1), with probability at least 1 δ over the i.i.d. draws x1, . . . , xn D, for all g G, the following bound holds:

Ex[g] 2En[g] + K q log(BθLθn) + log(1/δ)

We now use Theorem 4.1 to analyze the generalization properties of the p-Ta SIL ERM problem (15). In what follows, we assume that the expert-closed loop system is stable in the sense of Lyapunov, i.e., that there exists BX > 0 such that supt N,ξ X xπ t (ξ) BX, and consider the following parametric class of p + 2 continuously differentiable policies:

Πθ,p := {π(x, θ) | θ Rq, θ Bθ, π(0, θ) = 0 θ, π is p + 2 continuously differentiable}. (18) Define the constants

Bj := sup x BX, θ Bθ

j xπ(x, θ) , Lj := sup x BX, θ Bθ

j+1 x θπ(x, θ) ,

for j = 0, . . . , p, and note that they are guaranteed to be finite under our regularity assumptions. Finally, define the loss function class:

ℓπ p Πθ,p := ℓπ p ( ; π) defined in (14) | π Πθ,p . (19)

From a repeated application of Taylor s theorem, we show in Lemma B.2 that B 1 ℓ,p(ℓπ p Πθ,p) is a (Bθ, B 1 ℓ,p Lℓ,p, q)-Lipschitz parametric function class for Bℓ,p := 2 p+1 Pp j=0 Bj and Lℓ,p :=

BX p+1 Pp j=0 Lj. We now combine this with Theorem 4.1 to bound the population risk achieved by the solution to the Ta SIL ERM problem (15). Corollary 4.1. Let the policy class Πθ,p be defined as in (18), and assume that π Πθ,p. Let the function class ℓπ p Πθ,p be defined as in (19), and constants Bℓ,p, Lℓ,p be defined as above. Let ˆπTa SIL,p be any empirical risk minimizer (15). Then with probability at least 1 δ over the initial

conditions {ξi}n i=1 i.i.d. Dn,

Eξ ℓπ p (ξ; ˆπTa SIL,p) O(1)Bℓ,p q log BθB 1 ℓ,p Lℓ,pn + log(1/δ)

6This is without loss of generality for [0, B]-bounded functions by considering the normalized function class B 1G := B 1g | g G [0, 1]X .

We note that since these generalization bounds solely concern the supervised learning problem of matching the expert on the expert trajectories, the constants do not depend on the stability properties of the trajectories generated by the learned policy ˆπTa SIL,p. To convert the generalization bound (20) to a probabilistic bound on the imitation gap ΓT (ξ; ˆπTa SIL,p), we first apply Markov s inequality to bound the probability that the conditions of Theorem 3.1 or 3.2 hold by the expected Ta SIL loss (20), and then apply Corollary 4.1 together with Markov s inequality. Theorem 4.2 (Rapidly decaying class K functions). Assume that π Πθ,0 and let the assumptions of Theorem 3.1 hold for all π Πθ,0. Let Equation (6) hold with constants µ, α > 0, and assume without loss of generality that α/µ 1, Lπµ 1/2. Let ˆπTa SIL,0 be an empirical risk minimizer of

ℓπ 0 over the policy class Πθ,0 for initial conditions {ξi} i.i.d. Dn. Fix a failure probability δ (0, 1), and assume that

n O(1) max n Bℓ,0 κα

δ log καBθLℓ,0

δ , Bℓ,0 κη

δ log κηBθLℓ,0

where κα := q(µ/α)1/(1+r), κη := q Lπµ/η. Then with probability at least 1 δ, the imitation gap evaluated on ξ D (drawn independently from {ξi}n i=1) satisfies

ΓT (ξ; ˆπTa SIL,0) O(1) µ

Bℓ,0q log BθB 1 ℓ,0 Lℓ,0n

Theorem 4.3 (Slowly decaying class K functions). Assume that π Πθ,p, and let the assumptions of Theorem 3.2 hold for all π Πθ,p. Let Equation (10) hold with constants µ, α > 0. Let ˆπTa SIL,p be an empirical risk minimizer of ℓπ p over the policy class Πθ,p for initial conditions {ξi} i.i.d. Dn. Fix a failure probability δ (0, 1), and assume

n O(1) max j p max

κα,j BθB 1 j BXLj δ

κη,j BθB 1 j BXLj δ

where κα,j := µ

j! and κη,j := L pπ

(p+1)! µp+1

(j!)(p+1)/r + 1

ηδ. Then with probability at least 1 δ, the imitation gap evaluated on ξ D (drawn independently from {ξi}n i=1) satisfies

ΓT (ξ; ˆπTa SIL,p) O(1) µ max j p

p j!δ Bjq log BθB 1 j BXLjn

In the rapidly decaying setting, corresponding to more robust experts, Theorem 4.2 states that n ε 1 1+r /δ expert trajectories are sufficient to ensure that the imitation gap ΓT (ξ; ˆπTa SIL,0) ε with probability at least 1 δ. Recall that more robust experts have larger values of r > 0, leading to smaller sample-complexity bounds. In contrast, to achieve the same guarantees on the imitation gap in the slowly decaying setting, Theorem 4.3 states n ε r/δ expert trajectories are sufficient, where we recall that less robust experts have larger values of r 1. These theorems quantitatively show how the robustness of an underlying expert affects the sample-complexity of IL, with more robust experts enjoying better dependence on ε than less robust experts. We note that analogous dependencies on expert stability are reflected in the burn-in requirements, i.e., the number of expert trajectories required to ensure with high probability no catastrophic distribution shift occurs, of each theorem.

5 Experiments

We compare three standard imitation learning algorithms, Behavior Cloning, DAgger, and DART, to Ta SIL-augmented loss versions. In Ta SIL-augmented algorithms, we replace the standard imitation loss function ℓπ IL ({ξi}n i=1; π) := 1

n Pn i=1 PT 1 t=0 π t (ξi; π) with the p-Ta SIL-augmented loss

ℓπ Ta SIL,p({ξi}n i=1; π) := 1

n Pn i=1 PT 1 t=1 Pp j=0 λj vec( j x π t (ξi; π)) ,

0.5 1.0 1.5 2.0 2.5 3.0 0.0

Mean Policy Discrepancy

Ta SIL order 0 Ta SIL order 1 Ta SIL order 2

0.5 1.0 1.5 2.0 2.5 3.0 0

Imitation Gap

Ta SIL order 0 Ta SIL order 1 Ta SIL order 2

Figure 1: Left: The average Euclidean norm difference between the expert and learned policies on trajectories rolled out under the learned policy. Right: The maximum deviation between an expert and learned policy trajectory starting from identical initial conditions. All statistics are averaged across 50 test trajectories and we plot the mean and standard deviation for 10 random seeds.

where {λj}p j=0 are positive tunable regularization parameters. We use the Euclidean norm of the vectorized error in the derivative tensors as more optimizer-amenable surrogate to the operator norm.

We experimentally demonstrate (i) the effect of the expert stability properties and order of the Ta SIL loss, and (ii) that Ta SIL-loss-based imitation learning algorithms are more sample efficient than their standard counterparts. All experiments7 are carried out using Jax [25] GPU acceleration and automatic differentiation capabilities to compute the higher-order derivatives, and the Flax [26] neural network and Optax [27] optimization toolkits.

Stability Experiments To illustrate the effect of the expert closed-loop system stability on samplecomplexity, we consider a simple δ-ISS stable dynamical system with a tunable γ input-to-state gain function. For state and input xt, ut R10, the dynamics are:

xt+1 = ηxt + (1 η)γ( h(xt) + ut )

h(xt) + ut (h(xt) + ut).

The perturbation function h : R10 R10 is set to a randomly initialized MLP with two hidden layers of width 32 and GELU [28] activations such that the expert π (x) = h(x) yield a closed loop system f π cl (x, ) which is δ-ISS stable with the specified class K function γ (see Appendix D). We use η = 0.95 for all experiments presented here.

We investigate the performance of p-Ta SIL loss functions for δ-ISS system with different class K stability. We sweep K functions γ(x) = Cxν for ν [0.05, 3], C = 5 and p-Ta SIL loss functions for p {0, 1, 2} (additional details can be found in Appendix D). The results of this sweep are shown in Figure 1. Higher-order p-Ta SIL losses significantly reduce both the imitation gap and the mean policy discrepancy on test trajectories. Notably, the first and second order Ta SIL loss maintain their improved performance for slower decaying class K functions. Theorem 3.1 and Theorem 3.2 yield lower bounds of ν = 1, ν = 2 1, and ν = 3 1 for closing the imitation gap using the 0-Ta SIL, 1-Ta SIL, and 2-Ta SIL losses respectively. Figure 1 demonstrates significant performance degradation in policy discrepancy and decaying imitation gap starting around these threshold values. Mu Jo Co Experiments We evaluate the ability of the Ta SIL loss to improve performance on standard imitation learning tasks by modifying Behavior Cloning, DAgger [6], and DART [7] to use the ℓTa SIL,1 loss and testing them in simulation on different Open AI Gym Mu Jo Co tasks [29]. The Mu Jo Co environments we use and their corresponding (state, input) dimensions are: Walker2d-v3 (17, 6), Half Cheetah-v3 (17, 6), Humanoid-v3 (376, 17), and Ant-v3 (111, 8).

For all environments we use pretrained expert policies obtained using Soft Actor Critic reinforcement learning by the Stable-Baselines3 [30] project. The experts consist of Multi-Layer Perceptrons with two hidden layers of 256 units each and Re LU activations. For all environments, learned policies have 2 hidden layers with 512 units each and GELU activations in addition to Batch Normalization. The final policy output for both the expert and learned policy are rescaled to the valid action space

7The code used for these experiments can be found at https://github.com/unstable-zeros/Ta SIL

BC BC with 1-Ta SIL

Ant Walker2d Half Cheetah Humanoid

DART DART with 1-Ta SIL

15 30 45 0.00

DAgger DAgger with 1-Ta SIL

15 30 45 15 30 45 15 30 45 15 30 45 Training Trajectories

Normalized Reward

Figure 2: Cumulative expert-normalized rewards as a function of trajectory budget for policies trained using different algorithms with and without 1-Ta SIL loss.

after applying a tanh nonlinearity. We used trajectories of length T = 300 for all experiments. We refer to Appendix E for additional experiment details.

In Figure 2 we report the mean expert-normalized rewards across 5 seeds for all algorithms and environments as a function of the trajectory budget provided. Algorithms with 1-Ta SIL loss showed significant improvement in sample-complexity across all challenging environments. The expert for the Swimmer environment is very robust due to the simplicity of the task, and so as predicted by Theorem 3.1 and 4.2, all algorithms (with the exception of the vanilla DAgger algorithm due to it initially selecting poor rollout trajectories) are able to achieve near expert performance across all trajectory budgets. We are also able to nearly match or exceed the performance of standard on-policy methods DAgger and DART with our off-policy 1-Ta SIL loss Behavior Cloning in all environments. Additional experimental results can be found in the appendix. These include representative videos of the behavior achieved by the expert, BC, and Ta SIL-augmented BC policies in the supplementary material, where once again, a striking improvement is observed, especially in low-data regimes for harder environments, as well as a systematic study of finite-difference-based approximations of 1-Ta SIL which achieve comparable performance to Jacobian-based implementations.

6 Conclusion

We presented Taylor Series Imitation Learning (Ta SIL), a simple augmentation to behavior cloning that penalizes deviations in the higher-order Tayler series terms between the learned and expert policies. We showed that δ-ISS experts are easier to learn, both in terms of the loss-function that needs to be optimized and sample-complexity guarantees. Finally, we showed the benefit of using Ta SIL-augmented losses in BC, DAgger, and DART across a variety of Mu Jo Co tasks. This work opens up many exciting future directions, including extending Ta SIL to pixel-based IL, and to (offline/inverse) reinforcement learning settings.

Acknowledgements

We thank Vikas Sindhwani, Sumeet Singh, Jean-Jacques E. Slotine, Jake Varley, and Fengjun Yang for helpful feedback. Nikolai Matni is supported by NSF awards CPS-2038873, CAREER award ECCS-2045834, and a Google Research Scholar award.

[1] Ahmed Hussein, Mohamed M. Gaber, Eyad Elyan, and Chrisina Jayne. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):1 35, 2017.

[2] Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J. Andrew Bagnell, Pieter Abbeel, and Jan Peters. An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(1 2):1 179, 2018.

[3] Dean A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems, volume 1, 1988.

[4] Felipe Codevilla, Matthias Miiller, Antonio López, Vladlen Koltun, and Alexey Dosovitskiy. End-to-end driving via conditional imitation learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 4693 4700, 2018.

[5] Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3(6):233 242, 1999.

[6] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15, pages 627 635. PMLR, 2011.

[7] Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. Dart: Noise injection for robust imitation learning. In Proceedings of the 1st Annual Conference on Robot Learning, volume 78, pages 143 156. PMLR, 2017.

[8] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, volume 29, 2016.

[9] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

[10] Michael Hertneck, Johannes Köhler, Sebastian Trimpe, and Frank Allgöwer. Learning an approximate model predictive controller with guarantees. IEEE Control Systems Letters, 2(3): 543 548, 2018.

[11] He Yin, Peter Seiler, Ming Jin, and Murat Arcak. Imitation learning with stability and safety guarantees. IEEE Control Systems Letters, 6:409 414, 2022.

[12] Allen Ren, Sushant Veer, and Anirudha Majumdar. Generalization guarantees for imitation learning. In Proceedings of the 2020 Conference on Robot Learning, volume 155, pages 1426 1442. PMLR, 2021.

[13] Stephen Tu, Alexander Robey, Tingnan Zhang, and Nikolai Matni. On the sample complexity of stability constrained imitation learning. ar Xiv preprint ar Xiv:2102.09161, 2021.

[14] Winfried Lohmiller and Jean-Jacques E. Slotine. On contraction analysis for non-linear systems. Automatica, 34(6):683 696, 1998.

[15] Sumeet Singh, Spencer M. Richards, Vikas Sindhwani, Jean-Jacques E. Slotine, and Marco Pavone. Learning stabilizable nonlinear dynamics with contraction-based regularization. The International Journal of Robotics Research, 40:1123 1150, 2020.

[16] Andre Lemme, Klaus Neumann, R. Felix Reinhart, and Jochen J. Steil. Neural learning of vector fields for encoding stable dynamical systems. Neurocomputing, 141:3 14, 2014.

[17] Harish Ravichandar, Iman Salehi, and Ashwin Dani. Learning partially contracting dynamical systems from demonstrations. In Proceedings of the 1st Annual Conference on Robot Learning, volume 78, pages 369 378. PMLR, 2017.

[18] Vikas Sindhwani, Stephen Tu, and Mohi Khansari. Learning contracting vector fields for stable imitation learning. ar Xiv preprint ar Xiv:1804.04878, 2018.

[19] Nicholas M. Boffi, Stephen Tu, and Jean-Jacques E. Slotine. Regret bounds for adaptive nonlinear control. In Proceedings of the 3rd Conference on Learning for Dynamics and Control, volume 144, pages 471 483. PMLR, 2021.

[20] Nicholas M. Boffi, Stephen Tu, Nikolai Matni, Jean-Jacques E. Slotine, and Vikas Sindhwani. Learning stability certificates from data. In Proceedings of the 2020 Conference on Robot Learning, volume 155, pages 1341 1350. PMLR, 2021.

[21] David Angeli. A lyapunov approach to incremental stability properties. IEEE Transactions on Automatic Control, 47(3):410 421, 2002.

[22] Hassan K. Khalil. Nonlinear Systems. Pearson Education. Prentice Hall, 2002.

[23] Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. The Annals of Statistics, 33(4):1497 1537, 2005.

[24] David Haussler. Decision theoretic generalizations of the pac model for neural net and other learning applications. Information and Computation, 100(1):78 150, 1992.

[25] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax.

[26] Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Zee. Flax: A neural network library and ecosystem for JAX, 2020. URL http://github.com/google/flax.

[27] Matteo Hessel, David Budden, Fabio Viola, Mihaela Rosca, Eren Sezener, and Tom Hennigan. Optax: composable gradient transformation and optimisation, in jax!, 2020. URL http: //github.com/deepmind/optax.

[28] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). ar Xiv preprint ar Xiv:1606.08415, 2016.

[29] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.

[30] Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1 8, 2021.

[31] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In Advances in Neural Information Processing Systems, volume 23, 2010.

[32] Martin J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.

[33] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

[34] Olivier Bousquet. Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. Ph D thesis, École Polytechnique: Department of Applied Mathematics Paris, France, 2002.

[35] Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2018.

[36] Max Simchowitz, Horia Mania, Stephen Tu, Michael I. Jordan, and Benjamin Recht. Learning without mixing: Towards a sharp analysis of linear system identification. In Proceedings of the 31st Conference On Learning Theory, volume 75, pages 439 473. PMLR, 2018.

[37] Franco Woolfe, Edo Liberty, Vladimir Rokhlin, and Mark Tygert. A fast randomized algorithm for the approximation of matrices. Applied and Computational Harmonic Analysis, 25(3): 335 366, 2008.

[38] Yann A Le Cun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9 48. Springer, 2012.