# apprenticeship_learning_via_frankwolfe__eb662790.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Apprenticeship Learning via Frank-Wolfe

Tom Zahavy, Alon Cohen, Haim Kaplan, Yishay Mansour Google Research, Tel Aviv

We consider the applications of the Frank-Wolfe (FW) algorithm for Apprenticeship Learning (AL). In this setting, we are given a Markov Decision Process (MDP) without an explicit reward function. Instead, we observe an expert that acts according to some policy, and the goal is to ﬁnd a policy whose feature expectations are closest to those of the expert policy. We formulate this problem as ﬁnding the projection of the feature expectations of the expert on the feature expectations polytope the convex hull of the feature expectations of all the deterministic policies in the MDP. We show that this formulation is equivalent to the AL objective and that solving this problem using the FW algorithm is equivalent wellknown Projection method of Abbeel and Ng (2004). This insight allows us to analyze AL with tools from convex optimization literature and derive tighter convergence bounds on AL. Speciﬁcally, we show that a variation of the FW method that is based on taking away steps achieves a linear rate of convergence when applied to AL and that a stochastic version of the FW algorithm can be used to avoid precise estimation of feature expectations. We also experimentally show that this version outperforms the FW baseline. To the best of our knowledge, this is the ﬁrst work that shows linear convergence rates for AL.

1 Introduction

We consider sequential decision making in the Markov decision process (MDP) formalism. Given an MDP, the optimal policy and its value function are characterized by the Bellman equations and can be computed via value or policy iteration. This makes the MDP model useful in problems where we can specify the MDP model (states, actions, reward, transitions) appropriately. However, in many real-world problems, it is often hard to deﬁne a reward function, such that the optimal policy with respect to this reward produces the desired behavior. In Apprenticeship Learning (AL), instead of manually tweaking the reward to produce the desired behavior, the idea is to observe and mimic an expert. The literature on AL is quite vast and dates back to the work of Abbeel and Ng

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

(2004), who proposed a novel framework for AL. In this setting, the reward function (while unknown to the apprentice) equals to a linear combination of a set of known features. More speciﬁcally, there is a weight vector w. The rewards are associated with states, and each state s has a feature vector φ(s), and its reward is φ(s) w. The expected return of a policy π is V π = Φ(π) w, where Φ(π) is the feature expectation under policy π. The expert demonstrates a set of trajectories that are used to estimate the feature expectations of its policy πE, denoted by ΦE Φ(πE). The goal is to ﬁnd a policy ψ, whose feature expectations are close to this estimate, and hence will have a similar return with respect to any weight vector w. Abbeel and Ng (2004) suggested two algorithms to solve this problem, one that is based on a maximum margin solver and a simpler projection algorithm. The algorithm starts with an arbitrary policy π0 and computes its feature expectation Φ(π0). At step t they deﬁne a reward function using weight vector wt = ΦE Φt 1 and ﬁnd the policy πt that maximizes it, where Φt is a convex combination of feature expectations of previous (deterministic) policies Φt = t j=1 αjΦ(πj). They show that in order to get that ΦT ΦE ϵ, it sufﬁces to run the algorithm for T = O( k (1 γ)2ϵ2 log( k (1 γ)ϵ)) iterations. Another type of algorithms, based on online convex optimization, was proposed by Syed and Schapire (2008). In this approach, AL is posed as a two-player zero-sum game. In each round the reward player plays a no-regret algorithm and the policy player plays the best response, i.e., it plays the policy πt that maximizes the reward at time t. The algorithm runs for T steps and returns a mixed policy ψ that assigns probability 1/T to each policy πt, t = 1, . . . , T. Syed and Schapire (2008) proved that their scheme is faster by a factor of k and requires only T = O(log(k)/(1 γ)2ϵ2) iterations. This improvement is closely related to the analysis of the mirror descent algorithm (MDA, Nemirovsky and Yudin (1983)). That is, by choosing the norm of the space (and projecting w.r.t this norm), a dimension-free rate of convergence (up to logarithmic factor) is achieved. The results in (Syed and Schapire 2008) use a speciﬁc instance of MDA where the optimization set is the simplex and distances are measured w.r.t 1. This version of MDA is

known as multiplicative weights or Hedge. In this work, we focus on the computational complexity of the problem as a function of ϵ. We show that a small modiﬁcation to the algorithm of Abbeel and Ng (2004) can lead to a linear rate of convergence, i.e., T = O(log(1/ϵ)).1 Methods that are based on online convex optimization (like Syed and Schapire (2008)), on the other hand, cannot achieve rates better than T = O(1/ϵ2).1 Our result is based on the observation that (a slight modiﬁcation) of the algorithm of Abbeel and Ng (2004) is, in fact, an instantiation of the Frank-Wolfe (FW) method a projection free method for convex optimization. To see this, we formulate the AL problem as ﬁnding the projection of the (estimated) feature expectations of the expert on the feature expectations polytope the convex hull of the feature expectations of all the deterministic policies in the MDP (Definition 7). To compute this polytope (and to project to it), one has to compute the feature expectations of the exponentially (|A||S|) many deterministic policies in the MDP. The beneﬁt in applying the FW method to this problem is that it avoids projecting to this polytope (as in projected gradient methods); instead, it minimizes a linear objective function over the polytope, which is equivalent to ﬁnding the optimal policy in an MDP. The observation that the algorithm of Abbeel and Ng (2004) is an instantiation of the Frank-Wolfe (FW) method allows to derive the convergence result of Abbeel and Ng (2004) immediately (even with a logarithmic factor improvement) from known analysis of the FW method. Furthermore, this equivalence leads us to propose a modiﬁcation to this Frank-Wolfe AL algorithm that is based on taking away steps. These steps try to remove weight from bad policies (policies that were added to the solution in previous iterations but now by removing them we get an improvement). This modiﬁcation gives the ﬁrst AL algorithm with a linear rate of convergence. We implemented this algorithm and compared it with the method of Abbeel and Ng (2004). Our ﬁndings suggest that away steps indeed give a better empirical performance. Finally, in many practical scenarios, an algorithm may only have access to the environment via a simulator. In such cases, the feature expectations of the agent cannot be computed explicitly and must be estimated by rolling trajectories using the agent s policy. To address this, we design an algorithm that uses unbiased estimates of the feature expectations (instead of the expected feature expectations) based on the stochastic FW algorithm (Hazan and Luo 2016). To the best of our knowledge, this is the ﬁrst AL algorithm that addresses this issue from a theoretical point of view.

2 Preliminaries In this section, we provide the relevant background on convex optimization, apprenticeship learning, and the Frank Wolfe algorithm. In convex analysis, we are interested in solving problems of the form

minimize x K h(x) (1)

1The O notation hides the dependency in k and γ .

where K is a convex set and h is a convex function. Next, we brieﬂy deﬁne important properties of convex functions and convex sets.

Deﬁnition 1 (Convex set). A set K is convex if x1, x2 K, λ [0, 1] : λx1 + (1 λ)x2 K.

Deﬁnition 2 (Diameter of a set). The diameter of a set K is given by DK = maxx1,x2 K ||x1 x2||.

Deﬁnition 3 (Convex function). A function h : K R is convex if K is a convex set and x1, x2 K, λ [0, 1] : h(λx1 + (1 λ)x2) λh(x1) + (1 λ)h(x2).

Deﬁnition 4 (Properties of convex functions). A differentiable convex function h over a convex set K, i.e., h : K R w.r.t is:

1. Strongly convex with strong convexity parameter σ > 0 if x1, x2 K : ( h(x1) h(x2)) (x1 x2) σ x1 x2 2. 2. Smooth with parameter β if x1, x2 K, | h(x1) h(x2)| β||x1 x2||. 3. Lipschitz continuous with parameter Lh if x K, h(x) Lh.

We use on the Euclidean norm in this paper. We will focus on a speciﬁc convex optimization problem: ﬁnding a particular Euclidean projection on a convex set.

Deﬁnition 5 (Euclidean projection). The Euclidean projection onto a convex set K is given by Proj K(x) = arg miny K ||x y||.

2.1 Inverse reinforcement learning and apprenticeship learning For consistency with prior work, we consider the discounted inﬁnite horizon scenario. We emphasize here that all the results in this paper can be easily extended to the episodic ﬁnite horizon and the average reward criteria. We indicate the required changes when appropriate. We are given an MDP\R, (MDP without a reward) denoted M {S, A, P, γ, D}, (2)

where S is the set of states, A is the set of actions, P = {P a | a A} is the set of transition matrices, γ is the discount factor, and D is the distribution of the initial state. Each state s is represented by an observable lowdimensional vector of features φ(s) [0, 1]k, and the reward function, while unknown to the apprentice, is assumed to be equal to a linear combination of the features; i.e., rw(s) = w φ(s), for some w W where W is a convex set. For example, W can be chosen to be the simplex (Syed and Schapire 2008), or the L2 ball (Abbeel and Ng 2004). We further assume the existence of an expert policy, denoted by πE, such that we can observe its execution in M. We deﬁne the feature expectations of a policy π in M as2

t=0 γtφ(st) π, P, D . (3)

2For other RL criteria there exist equivalent deﬁnitions of feature expectations; see Zahavy et al. (2019) for the average reward.

With this feature representation, the value of a policy π may be written as V π = w Φ(π). In addition, the feature expectations are bounded: ||Φ(π)|| 1/(1 γ).3 Similarly, we deﬁne the occupancy measure of π in M as

t=0 γt1st=s,at=a π, P, D (4)

Like Abbeel and Ng (2004), and Syed and Schapire (2008), the policy that we ﬁnd is not necessarily deterministic, but a mixed policy. A mixed policy ψ is a distribution over Π, the set of all deterministic policies in M. Because Π is ﬁnite (though extremely large), we can ﬁx an ordering π1, π2, . . . of the policies in Π. This allows us to treat ψ as a vector, where ψ(i) is the probability assigned to πi. A mixed policy ψ is executed by randomly selecting the policy πi Π at time 0 with probability ψ(i), and exclusively following πi thereafter. The deﬁnitions of the value function and the feature expectations are naturally extended to mixed policies as follows: V (ψ) = Ei ψV πi and Φ(ψ) = Ei ψΦ(πi). The following theorem shows that any mixed policies can be converted into a stochastic policy with the same value as follows. Theorem 1 (Theorem 3, Syed, Bowling, and Schapire (2008)). Let ψ be a mixed policy, and let xj be the occupancy measure (Eq. (4)) of πj, j [1, . . . , |Π|]. Let ˆπ be

a stochastic policy where ˆπ(a | s) =

j ψ(j)xj s,a

j ψ(j)xj s,a . Then

V (ˆπ) = V (ψ), and also Φ(ˆπ) = Φ(ψ). The objective of AL is to ﬁnd a policy π that does at least as well as the expert with respect to any reward function of the form r(s) = w φ(s), w W. That is we solve

max ψ Ψ min w W [w Φ(ψ) w ΦE] (5)

If we denote the value of Eq. (5) by f then, due to the von Neumann minimax theorem we also have that

f = min w W max ψ Ψ [w Φ(ψ) w ΦE] , (6)

We will refer to approximately solving Eq. (6) as IRL, i.e., ﬁnding w W such that

ψ Ψ : w ΦE w Φ(ψ) ϵ f ; (7)

and to the problem of approximately solving Eq. (5) as AL, i.e., ﬁnding ψ such that

w W : w Φ(ψ) w ΦE ϵ + f . (8)

Notice that due to Theorem 1, it is equivalent to solve AL over Π and Ψ (the sets of deterministic and mixed policies). The most famous AL algorithm for solving Eq. (8) is the Projection algorithm of Abbeel and Ng (2004) (Algorithm 1). Notice that we slightly changed the notation and the order of indices in Algorithm 1 w.r.t Abbeel and Ng (2004); it is immediate to verify that these algorithms are equivalent. In addition, line 8 in Algorithm 1 was not part of the original algorithm. The role of this step is to replace

3Replace 1 1 γ with H in the ﬁnite horizon case and with 1 in the average reward case.

the post-processing procedure in (Abbeel and Ng 2004) by maintaining a policy ψt with feature expectations Φt.4 Somewhat ironically, Abbeel and Ng (2004) termed their algorithm the projection algorithm , while we will soon see that it is actually a projection-free algorithm (CG method) with respect to the feature expectations polytope.

Algorithm 1 The projection method (Abbeel and Ng 2004)

1: Input: feature expectations of the expert ΦE, T number of iterations 2: Initialize: choose any π0, set ψ0 = eπ0, Φ0 = Φ(ψ0) 3: for t = 1, . . . , T do 4: Set wt = ΦE Φt 1 5: Compute πt = π wt, Φt = Φ(πt)

6: αt = (Φt Φt 1) (ΦE Φt 1)

(Φt Φt 1) (Φt Φt 1) 7: Φt = Φt 1 + αt(Φt Φt 1) 8: ψt = ψt 1 + αt(eπt ψt 1) 9: end for 10: Return ψT

The algorithm begins by estimating the feature expectation Φ0 of some arbitrary policy π0. Then, for iterations t = 1, . . . , T it ﬁnds the optimal policy πt w.r.t reward wt = ΦE Φt 1. The feature expectation Φt of the policy πt are computed and added to the solution Φt, such that Φt = Φt 1+αt(Φt Φt 1). The parameter αt is chosen via a line search, i.e., αt = minα Φt 1+α(Φt Φt 1) ΦE 2. For ψ to be a mixed policy, αt must be in the range [0, 1]. In the case that ΦE is given exactly, it is guaranteed that αt [0, 1]. When ΦE is estimated from samples, for ψ to be a mixed policy, αt must be truncated to [0, 1].5 Abbeel and Ng (2004) proved directly that the features expectations of Φt converge to the features of the expert. Another type of AL algorithms was proposed by Syed and Schapire (2008). The idea is to solve Eq. (8) in the following manner. In each round the reward player plays an online convex optimization algorithm on losses lt(wt) = wt (ΦE Φ(πt)); and the policy player plays the best response, i.e, the policy πt that maximizes the return Φ(πt) wt at time t. The algorithm runs for T steps and returns a mixed policy ψ that draws with probability 1/T a policy πt, t = 1, . . . , T. Thus, we have that

t=1 max π Π [wt Φ(π) wt ΦE]

t=1 [wt Φ(πt) wt ΦE] (9)

min w W 1 T

t=1 w [Φ(πt) ΦE] + O

= min w W w (Φ(ψ) ΦE) + O

4In Section 3, we show explicitly that Φ(ψt) = Φt. 5See a more detailed discussion in Section 3.

where Eq. (9) follows from the fact that the policy player plays the best response, that is, πt is the optimal policy w.r.t the reward wt; Eq. (10) follows from the fact that the reward player plays a no-regret algorithm, e.g., online MDA. Thus, we obtain from Eq. (11) that w W : w Φ(ψ)

w ΦE + f O 1

.6 Since this technique runs a no regret algorithm, it cannot obtain a convergence rate faster than T = O(1/ϵ2).6

Finally, IRL can also be formulated as a convex optimization problem, but it is not differentiable (Ratliff, Bagnell, and Zinkevich 2006). IRL is also not strongly convex, as it does not have a unique solution, as was observed in (Ng and Russell 2000). For these reasons, convex optimization methods for IRL did not achieve a linear rate of convergence.

2.2 The conditional gradient (CG) method

A common algorithm to minimize a convex function over a convex set K is projected gradient descent. This algorithm takes a step in the reverse gradient direction zt+1 = xt +αt h(xt), and then projects zt+1 back into K to obtain xt+1. Computing this projection may be expensive for some convex sets. The CG algorithm of Frank and Wolfe (1956) (Algorithm 2) avoids this projection. It ﬁnds a point yt K that has the largest correlation with the negative gradient, and updates xt+1 = (1 αt)xt + αtyt, which by convexity guarantees to be in K. To ﬁnd yt, the algorithm has to minimize a linear objective function over the feasible set K. We assume that this optimization is performed by an oracle (which we call linearoracle). If K is a polyhedron (given by its facets), then an oracle call is a linear programming problem. The CG method is useful for problems where implementing such a linearoracle is easier than computing a projection onto K.

Algorithm 2 The CG method (Frank and Wolfe 1956)

1: Input: a convex set K, a convex function h, learning rate schedule αt. 2: Initiation: let x0 K 3: for t = 1, . . . , T do 4: yt = arg miny K h(xt 1) y 5: xt = (1 αt)xt 1 + αtyt 6: end for

To give some context, in AL, the linear-oracle will be an algorithm that ﬁnds the optimal policy in an MDP with known reward and dynamics, e.g., Policy Iteration (PI). The polyhedral set will be the set of feature expectations of all the deterministic policies in this MDP, which is of size |A||S|. The computational complexity of computing this set explicitly (and hence projecting onto it) is therefore exponential in the size of the state space. On the other hand, it is known that PI converges to the optimal policy in a ﬁnite number of iterations (Puterman 1984, Theorem 8.6.6). A trivial upper bound on the number of iterations is the total number of deterministic policies, which is |A||S|. In

5The O notation hides the dependency in k and γ .

the discounted and the ﬁnite horizon cases, it was shown that PI runs in strongly polynomial time (Ye 2011). Therefore the CG algorithm has a computational advantage over projection-based algorithms in the discounted and ﬁnite horizon settings. For the average-reward criteria, however, there exist MDPs for which Howard s PI requires exponential time (Hansen, Miltersen, and Zwick 2013). The original paper of Frank and Wolfe contains a proof of an O(1/t) rate of convergence (Theorem 2, extracted from Jaggi (2013)). Canon and Cullum (1968) prove that for functions that are not strongly convex, this rate is tight.

Theorem 2. Let h be a convex and β-smooth function. Let DK be the diameter of K, and let αt = 2 t+1 for t 1. Then for any t 2, (Algorithm 2) computes xt such that

h(xt) h(x ) 2βD2 K t + 1 ,

where x is a minimizer of h over K.

Remark: The learning rate αt in Theorem 2 can also be chosen via a line search procedure; the same theoretical guarantees hold in this setting. Fast rates. In this paper, we focus on minimizing a strongly convex function. In this case, if the optimal solution is in the interior of the feasible set, then CG converges in a linear rate (Beck and Teboulle 2004; Gu elat and Marcotte 1986). Another setting in which a faster rate of convergence can be derived is when the feasible set is strongly convex. (A strongly convex set is a set where each convex combination of two points in the set is in the interior of the set.) In this case, the convergence rate is O(1/t2) (Garber and Hazan 2015). Alternatively, if the norm of the gradient of the objective function is bounded away from zero everywhere in K, then the rate of convergence is linear (Levitin and Polyak 1966) (even if the objective is only convex and not strongly convex). Unfortunately, for reasons that we will see later on, none of these cases is relevant for AL. A different approach to speed up the convergence is to modify the algorithm, as we describe next.

2.3 Frank-Wolfe with away steps (ASCG) Away steps conditional gradient (ASCG) is a variation of the CG method, proposed by Wolfe (1970) for polyhedral sets. By Carath eodory theorem, the iterate xt can always be represented as a sparse convex combination of at most k + 1 vertices of K, i.e., xt = k+1 i=1 αyiyi. ASCG uses this fact and removes weight from bad elements yi (not needed to represent the ﬁnal solution) by taking away steps. These steps decrease the weight of the bad elements faster then they would have decayed via the standard CG iterates. Explicitly, ASCG (Algorithm 3) maintains the list of vertices S(t) = {yi1, . . . , yiℓt }, where t is the iteration index, ℓt = |S(t)|, and ij t for every j = 1, . . . , ℓt, and a corresponding list of coefﬁcients {αyij }ℓt j=1 such that

xt = ℓt j=1 αyij yij. At each iteration, the algorithm computes a regular CG step (d F W ); In addition, it checks the possibility of decreas-

ing the coefﬁcient αzt of some zt S(t) in the representation of xt as a convex combination of S(t) by taking a so-called away step in the direction d AS = xt zt. The zt that has the largest correlation with the gradient is chosen, and the learning rate is set via a line search procedure. In addition, it is guaranteed that xt+1 remains in K. Once the step is taken the coefﬁcients of the remaining members in S are updated such that their combination remains convex (all coefﬁcients are positive and sum to 1). As a result, members in S that are not part of the solution are removed as their coefﬁcient decreases to 0. In contrast to CG, ASCG maintains the coefﬁcients of xt as a convex combination of the vertices in S(t) explicitly. This is required to guarantee that the learning rate of the away step is chosen such that xt+1 remains in K. In general, the size of the list S(t) at time t is bounded by t, however, we know that xt+1 can be written as a convex combination of at most k+1 points in K. Beck and Shtern (2017) propose an improved update representation procedure, based on the Carath eodory theorem, that guarantees that S(t) is of size at most k + 1 for all t.

Algorithm 3 Frank-Wolfe with away steps (ASCG)

1: Input: a convex set L, and a convex function h 2: Initiation: let x1 K, S(1) = {x1}, αx1 = 1 3: for t = 1, . . . , T do 4: yt = arg maxy K h(xt) y, d F W = yt xt 5: zt = arg maxz S(k) h(xt) z, d AS = xt zt 6: if h(xt) d F W < h(xt) d AS then 7: Frank-Wolfe step: d = d F W , γmax = 1 8: else 9: Away step: d = d AS, γmax = αzt/(1 αzt) 10: end if 11: Line-search: γt = arg minγ [0,γmax] h(xt + γd) 12: Update: xt+1 = xt + γtd 13: Update representation: 14: if Frank-Wolfe step then 15: if (γt = 1) then 16: S(t+1) = {yt}, αyt = 1 17: else 18: αyt = (1 γt)αyt + γt 19: y S(t) : αy = (1 γt)αy 20: S(t+1) = S(t) {yt} 21: end if 22: else if Away step then 23: if (γt = γmax) then 24: Drop step: S(t+1) = S(t) \ {zt} 25: else 26: azt = (1 γt)αzt γt 27: y S(t) : αy = (1 + γt)αy 28: S(t+1) = S(t)

29: end if 30: end if 31: end for

Gu elat and Marcotte (1986) were the ﬁrst to suggest that ASCG attains a linear rate of convergence when the set is a

polytope. Garber and Hazan (2013) provided the ﬁrst ofﬁcial proof that a variant of CG (that is similar to ASCG) convergence in linear rate; Jaggi (2013) proved this for ASCG. Theorem 3 below, due to Lacoste-Julien and Jaggi (2014), speciﬁes the convergence rate of ASCG in terms of a constant C(K) called the pyramidal width of K that depends on the geometry of K. Here we will use a characterization of C(K) (which we found to be more intuitive) that is called the facial distance (Pena and Rodriguez 2018) of K.

Deﬁnition 6 (The facial distance, Pena and Rodriguez (2018), Theorem 1). Let A be a set of points in Rk and let K = conv(A), The facial distance of K is

C(K) = min F faces(K) 0 F K

min u F v conv(A\F ) u v 2.

By conv(B) we denote the convex hull of the points in B, and by faces(K) we denote the set of faces (convex hulls of sets of pairwise adjacent vertices) of the polytope K.

Theorem 3 (Linear convergence of ASCG; (Lacoste-Julien and Jaggi 2014)). Suppose that h is a β smooth σ strongly convex function over a convex set with diameter DK. Then the error of ASCG decreases geometrically as

h(xt) h(x1) exp( ρt),

where ρ = σC(K)2

We remark that Garber and Hazan (2016) also give a variant of Frank-Wolfe that converges linearly on polyhedral sets when the objective function smooth and strongly convex (as it is in our case). In their result, the convergence rate is dominated by a constant different from the facial distance, that equals to the minimum distance between a vertex v and a hyperplane supporting a facet which does not contain v. Nonetheless, we conjecture that this constant is strongly related to facial distance. In this work, we focus on ASCG for two reasons: it incorporates line search, which is important in practice, and it is simpler to implement.

3 Convex formulation of AL

In this section, we further assume that W is the L2 ball with a unit radius (Abbeel and Ng 2004). Since scaling of the reward by a constant does not affect the resulting policy, this assumption is without loss of generality. Thus, we can rewrite Eq. (5) as follows:

max ψ Ψ min w W [w Φ(ψ) w ΦE]

= max ψ Ψ ||Φ(ψ) ΦE|| (12)

= min ψ Ψ ||Φ(ψ) ΦE||, (13)

where in Eq. (12), we use the fact that a unit vector in the direction of Φ(ψ) ΦE is the minimizer when W is the unit L2 ball. Next, we deﬁne the feature expectations polytope K as the convex hull of the feature expectations of all the deterministic policies in M:

Deﬁnition 7 (The feature expectations polytope).

i=1 aiΦ(πi), ai 0,

i=1 ai = 1, πi Π

It is straightforward to verify that the bounded features assumption (|φ(s)| 1 for all s) implies that the diameter of the polytope (Deﬁnition 2) is DK =

k/(1 γ). Deﬁnition 7 also implies the following fact on mixed policies. Corollary 4. ψ Ψ, we have that Φ(ψ) K. Corollary 4 implies that solving Eq. (13) is equivalent to ﬁnding the mixed policy ψ, whose feature expectations are Proj K(ΦE), i.e., the euclidean projection (Deﬁnition 5) of the feature expectations of the expert onto K.7 The challenge is that K has |A||S| vertices (feature expectations of deterministic policies), thus, computing the projection explicitly and then ﬁnding ψ whose feature expectations are close to this projection, is computationally prohibitive. This makes the CG method appealing for solving this projection problem. In particular, since we are able to compute a mixed policy ψ whose feature expectation are equal to those of the projection (via line 8, that we added in Algorithm 1). We are now ready to deﬁne the CG method for AL explicitly and to show that it is indeed equivalent to the projection algorithm of Abbeel and Ng (2004). Consider the square of Eq. (13) as the objective function for the CG method, where we take the feature expectation as the argument rather than the policy ψ. I.e., we deﬁne a function h over x K as

2 x ΦE 2. (14)

Clearly, h(x) = x ΦE, and therefore line 4 in Algorithm 2 is equivalent to ﬁnding the feature expectations of the optimal policy in an MDP with reward given by w = h(xt). It follows that lines (4-5) in Algorithm 1 are equivalent to line 4 in Algorithm 2 and line 5 in Algorithm 2 is equivalent to line 7 in Algorithm 1, if we substitute Φt = xt and Φt = yt. As we already mentioned in Section 2.1, line 6 of Algorithm 1 is equivalent to setting αt = minα Φt 1 + α(Φt Φt 1) ΦE 2 (line search). For the CG method to maintain Φ as a convex combination of feature expectations, αt must be restricted to [0, 1]. This holds automatically if ΦE K. When ΦE K, (e.g., when it is estimated from samples), we should restrict the line search and set αt = minα [0,1] Φt 1 + α(Φt Φt 1) ΦE 2. We also note that by Theorem 2, we can also set αt = 2 t+1 and get the same convergence rate. We focus on line search since it is known to work better empirically. Notice that h(Φt) (as deﬁned in Eq. (14)) is a 1 smooth 1 strongly convex function. In addition, it has the same unique minimizer as the original objective (Eq. (13)), which is not β smooth. For smooth functions, CG converges at a rate of O(D2 K/t) = O(k/t(1 γ)2) (Theorem 2). Thus, after O(k/(1 γ)2ϵ2) iterations, the CG method ﬁnds an ϵ optimal solution to Eq. (13). This gives a logarithmic improvment on the result of Abbeel and Ng (2004).

7If we know ΦE exactly then it is in K but typically we have only an estimate.

4 Linear rate of convergence for AL

In the preliminaries section, we described the conditions for the CG method to achieve a linear rate of convergence. Unfortunately, as we now explain, these conditions do not hold for AL, despite the fact that h(Φt) (Eq. (14)) is a strongly convex function. First of all, since K is a polytope, it is not a strongly convex set. Secondly, ΦE cannot be guaranteed to be an interior point of K. If πE is an optimal policy w.r.t some reward, and ΦE is given explicitly (and not via sampled trajectories), then it is located on the boundary of K and is not an interior point. It is perhaps possible to compute a direction into the interior of the set (i.e., by mixing the feature expectations of the expert policy with those of a random policy) to modify ΦE by an ϵ small step in this direction such that it will be an interior point. The problem is that when ΦE is approximated from samples (which is the case of interest), it is not guaranteed that it is located inside K. Since we do not know at what distance and at what direction it from K it is, we cannot guarantee that an ϵ step will take us to the interior. Since CG cannot attain a linear rate of convergence for AL, we now turn to analyze ASCG for AL.

4.1 ASCG for AL

Recall that in each iteration, the ASCG algorithm chooses between two alternative steps: an FW step and an away step. The FW step ﬁnds the feature expectations of the optimal policy in an MDP whose reward is the negative gradient. This is a standard RL (planning) problem and can be solved, for example, with policy iteration. We also know that there exists at least one optimal deterministic policy for it and that PI will return a solution that is a deterministic policy. Thus, the list of elements that ASCG maintains (S) is composed of feature expectations of deterministic policies {π1, π2, . . . , }. Since the associated coefﬁcients {απ1, απ2, . . .} are a convex combination, the mixed policy ψ(πi) = απi is guaranteed to have the same feature expectations as Φt. We can also compute a stochastic policy with the same feature expectations as ψ using Theorem 1. The away step in AL checks each one of the deterministic policies in the list and tries to reduce its coefﬁcient. If a policy that is not part of the ﬁnal solution was added to the solution during the run of the algorithm, then the away step will reduce its coefﬁcient faster then it would have decreased via standard FW steps. The AL problem satisﬁes all the requirements in Theorem 3, thus, it attains a linear rate of convergence, and we have that h(xt) h(x1) exp( ρt). Since h is 1 strongly convex, 1 smooth, and DK

k/(1 γ), we have that ρ = σC(K)2

8D2 Kβ = (1 γ)2C(K)2

8k . The facial distance C(K) is deﬁned in (Deﬁnition 6) and depends on the dynamics of the MDP and the features. Intuitively, ASCG converge faster than CG, since it chooses away steps only when they lead to steeper descent. In the experiments section, we observed that ASCG indeed enjoys faster convergence than CG in practice.

5 Real world AL 5.1 Estimating ΦE from samples In most practical cases, it is unrealistic to assume that the feature expectations of the expert are given explicitly. In such cases, AL algorithms estimate the feature expectations by querying the expert for trajectories and then run AL on the estimated feature expectations. Lemma 5 bounds the number of samples needed from the expert to get a good approximation of its feature expectations. Lemma 5. Given m samples {τi}m i=1 whose expectation is ΦE, deﬁne the estimator ˆΦ(πE) = 1

m m i=1 τi. Assume that we performed AL w.r.t ˆΦ(πE) and found a solution π such that Φ(π) ˆΦ(πE) ϵ. Then, for any ϵm, δ it is enough to have m = 2k ln(2k/δ)/ϵ2 m samples in order to have that Φ(π) ΦE ϵ + ϵm with probability 1 δ. The following proof is based on Abbeel and Ng (2004).

Proof. By Hoeffding s inequality we get that

i [1, .., k] Pr(|ˆΦE(i) ΦE(i)| ϵ) 2 exp( mϵ2/2). Applying the union bound over the features we get that

Pr( i [1, .., k], s.t., |ˆΦE(i) ΦE(i)| ϵ)

2k exp( mϵ2/2). This is equivalent to

Pr( i [1, .., k] |ˆΦE(i) ΦE(i)| ϵ)

1 2k exp( mϵ2/2), and to Pr( ˆΦE ΦE ϵ) 1 2k exp( mϵ2/2).

Thus, we got after collecting m = 2 ln(2k/δ)

k)2 samples of

Φ(E), then with probability 1 δ, we have that ˆΦE ΦE ϵm

k. Therefore, with probability 1 δ we have that ˆΦE ΦE 2

k ˆΦE ΦE ϵm. (15) Now, by the assumption, we have that Φ(π) ˆΦ(πE) 2 ϵ and that ΦE ˆΦ(πE) 2 ϵm. Thus, we have that with probability 1 δ

Φ(π) ΦE 2 = Φ(π) ˆΦ(πE) + ˆΦ(πE) ΦE 2

Φ(π) ˆΦ(πE) 2 + ΦE ˆΦ(πE) 2 (16) ϵ + ϵm, (17)

where Eq. (16) follows from the triangle inequality, and Eq. (17) follows from Eq. (15).

In the proof, we assumed that the samples in Lemma 5 are bounded and unbiased. For ﬁnite horizon, each sample can be obtained by observing the expert executing a trajectory. In the discounted case, we can follow Syed and Schapire (2008), limit the trajectories to be of length H (1/(1 γ)) log(1/(ϵH(1 γ))) and show that this comes at an additional cost of ϵH. Alternatively, we can follow Kakade and Langford (2002), execute the expert trajectory

online, and terminate it at each time step with probability 1 γ. This will make the estimate unbiased. Similarly, Zahavy et al. (2019) propose an unbiased sampling mechanism for the average reward criteria based on the Coupling From The Past (CFTP) protocol. In both of these cases the estimates are only bounded with high probability, but it is possible to follow Zahavy et al. (2019) to obtain a concentration result.

5.2 Stochastic CG In the previous subsection, we analyzed a scenario in which the feature expectations of the expert are estimated by querying the expert for trajectories. However, in many real-world applications, it may also not be possible to estimate the feature expectations of the agent exactly. This may happen, for example, when the algorithm can only access the environment by queering a simulator. In such cases, the feature expectations must be approximated by executing trajectories of the agent policy in the environment. To address this issue, we design an AL algorithm that is based on the Stochastic FW (SFW) algorithm (Hazan and Luo 2016). The core idea is to replace the expected gradient h(Φ(πt)) = Φ(πt) ΦE, with an unbiased estimation. Explicitly, given mt samples {τi}mt i=1 whose expectation is Φ(πt), deﬁne the estimator ˆΦ(πt) = 1 mt mt i=1 τi. We emphasize here there these samples are collected at iteration t using the agent policy πt and should not be confused with samples from the expert policy (in the previous subsection) that are collected once and are given as an input to the algorithm. Once ˆΦ(πt) is computed, it is used as an unbiased estimator of the gradient , ˆ h(Φ(πt)) = ˆΦ(πt) ΦE and can be plugged into Algorithm 1. The following theorem analyzes the sample complexity of this algorithm.

Theorem 6. Let h be a G Lipschitz, convex and β-smooth function. Let DK be the diameter of K, let αt = 2 t+1 and

let mt = G(t+1)

2 for t 1. Then for any t 2, (Algorithm 2) computes xt such that

h(xt) h(x ) 2βD2 K t + 1 ,

where x is a minimizer of h over K.

Notice that SFW has the same complexity as FW, but requires sampling mt = G(t+1)

βD2 K 2 trajectories at iteration t. Finally, we note that Hazan and Luo (2016) also developed more efﬁcient, variance-reduced stochastic FW algorithms. Still, these algorithms involve computations of the (expected) gradient (every few iterations) and therefore do not follow the motivation of this subsection. If it is possible to gain access to the expected gradient, then, these more efﬁcient versions may become useful.

6 Experiments In this section, we compare the CG and ASCG methods for AL in two AL domains: an autonomous driving simulation (Abbeel and Ng 2004; Syed and Schapire 2008), and a grid

world domain. The results in each experiment are averaged over 10 runs of each algorithm (random seeds). The mean is presented in a solid line; around it, the colored area shows the mean plus/minus the standard deviation. Setting. In each domain, there are ﬁxed dynamics and initial state distribution that are given as a simulator (and not explicitly in a matrix form). Computation of feature expectations are done through Monte Carlo simulations: Given a policy π, (that could be the expert or the agent) we execute NEstimation trajectories of length H from a state drawn by the initial state distribution. The average of the cumulative discounted sum of the features along these trajectories is our estimated feature expectations of π. Given a reward function, the optimal policy is computed by running Q learning (Watkins and Dayan 1992) for NRL steps. Our implementation of Q learning is standard and includes an ϵ greedy exploration with ϵ = 0.05 and a learning rate of αt = 0.2/t0.75 (following Even-Dar and Mansour (2003)). It is important to note that both of these procedures do not necessarily return accurate solutions and that these solutions become more accurate as we increase the computation resources, and speciﬁcally, NEstimation, H, NRL. Empirically we found that as we increased these resources (making our implementation closer to the theoretical framework), the differences between the two methods become more signiﬁcant. Gridworld domain. In this domain, we place an agent in a 5 5 grid world domain. The agent can move up, down, left, and right. At each point in the grid, there is a reward (negative, zero), and there is a single point with a positive reward. Once the agent collected a positive reward, it starts again from the initial state. We used NEstimation = 300, H = 50, NRL = 300 and run both CG and ASCG for Niter = 100 steps. In Section 6, we can see the error of each algorithm as a function of the iteration number. The error measures the distance (in a logarithmic scale) between the feature expectations of the expert and those of the agent at time t, i.e., ΦE Φt . We can see that ASCG has a clear advantage over CG.

Figure 1: Comparison of CG with ASCG on a 5 5 gridworld domain

Car simulator. The driving task simulates a three-lane highway, in which there are two visible cars - cars A and B. The agent, car A, can drive both on the highway and offroad. Car B drives on a ﬁxed lane, at a slower speed than

car A. Upon leaving the frame, car B is replaced by a new car, appearing in a random lane at the top of the screen. The reward is a linear combination of driving features: speed, collisions, and off-road driving. The goal of the agent is to ﬁnd a driving policy that balances between these features based on expert preferences. We used NEstimation = 1000, H = 40, NRL = 1000 and run both algorithms for Niter = 50 steps. Similar to the greed domain, we can see that ASCG has a clear advantage over CG.

Figure 2: Comparison of CG with ASCG on the car driving simulator

7 Discussion

We presented a convex optimization formulation for AL and showed that the CG method is equivalent to the projection algorithm of Abbeel and Ng (2004). This revelation allowed us to leverage known results on the CG method for AL. We showed that a version of the CG method that is taking away steps gives improved performance empirically and has a provable linear rate of convergence. We believe that our ﬁndings will help to improve AL algorithms further. One direction is to try and ﬁnd a relaxation of the problem, where instead of optimizing over the polytope, we optimize over a large strongly convex set that is contained in the polytope. Such a set can be obtained, for example, by mixing each deterministic policy with a random policy. If the distance between the sets is guaranteed to be small, then, it should be possible to obtain faster rates. Another direction is to try and bound the facial distance of the polytope using parameters of the MDP.

Acknowledgments

We would like to thank Elad Hazan and Dan Garbar for their comments on this work.

Abbeel, P., and Ng, A. Y. 2004. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-ﬁrst international conference on Machine learning, 1. ACM.

Beck, A., and Shtern, S. 2017. Linearly convergent awaystep conditional gradient for non-strongly convex functions. Mathematical Programming 164(1-2):1 27. Beck, A., and Teboulle, M. 2004. A conditional gradient method with linear rate of convergence for solving convex linear systems. Mathematical Methods of Operations Research 59(2):235 247. Canon, M. D., and Cullum, C. D. 1968. A tight upper bound on the rate of convergence of frank-wolfe algorithm. SIAM Journal on Control 6(4):509 516. Even-Dar, E., and Mansour, Y. 2003. Learning rates for qlearning. Journal of machine learning Research 5(Dec):1 25. Frank, M., and Wolfe, P. 1956. An algorithm for quadratic programming. Naval research logistics quarterly 3(1-2):95 110. Garber, D., and Hazan, E. 2013. A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. ar Xiv preprint ar Xiv:1301.4666. Garber, D., and Hazan, E. 2015. Faster rates for the frankwolfe method over strongly-convex sets. In 32nd International Conference on Machine Learning, ICML 2015. Garber, D., and Hazan, E. 2016. A linearly convergent variant of the conditional gradient algorithm under strong convexity, with applications to online and stochastic optimization. SIAM Journal on Optimization 26(3):1493 1528. Gu elat, J., and Marcotte, P. 1986. Some comments on wolfe s away step . Mathematical Programming 35(1):110 119. Hansen, T. D.; Miltersen, P. B.; and Zwick, U. 2013. Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. Journal of the ACM (JACM) 60(1):1. Hazan, E., and Luo, H. 2016. Variance-reduced and projection-free stochastic optimization. In International Conference on Machine Learning, 1263 1271. Jaggi, M. 2013. Revisiting frank-wolfe: Projection-free sparse convex optimization. Kakade, S., and Langford, J. 2002. Approximately optimal approximate reinforcement learning. In International conference on Machine learning, 267 274. Lacoste-Julien, S., and Jaggi, M. 2014. An afﬁne invariant linear convergence analysis for frank-wolfe algorithms. NIPS 2013 Workshop on Greedy Algorithms, Frank-Wolfe and Friends. Levitin, E. S., and Polyak, B. T. 1966. Constrained minimization methods. USSR Computational mathematics and mathematical physics 6(5):1 50. Nemirovsky, A. S., and Yudin, D. B. 1983. In Problem complexity and method efﬁciency in optimization. Wiley, New York. Ng, A. Y., and Russell, S. J. 2000. Algorithms for inverse reinforcement learning. International Conference on Machine Learning (ICML).

Pena, J., and Rodriguez, D. 2018. Polytope conditioning and linear convergence of the frank wolfe algorithm. Mathematics of Operations Research 44(1):1 18. Puterman, M. L. 1984. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Ratliff, N. D.; Bagnell, J. A.; and Zinkevich, M. A. 2006. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, 729 736. ACM. Syed, U., and Schapire, R. E. 2008. A game-theoretic approach to apprenticeship learning. In Advances in neural information processing systems, 1449 1456. Syed, U.; Bowling, M.; and Schapire, R. E. 2008. Apprenticeship learning using linear programming. In Proceedings of the 25th international conference on Machine learning, 1032 1039. ACM. Watkins, C. J., and Dayan, P. 1992. Q-learning. Machine learning 8(3-4):279 292. Wolfe, P. 1970. Convergence theory in nonlinear programming. Integer and nonlinear programming 1 36. Ye, Y. 2011. The simplex and policy-iteration methods are strongly polynomial for the markov decision problem with a ﬁxed discount rate. Mathematics of Operations Research 36(4):593 603. Zahavy, T.; Cohen, A.; Kaplan, H.; and Mansour, Y. 2019. Average reward reinforcement learning with unknown mixing times. ar Xiv preprint ar Xiv:1905.09704.