# provably_efficient_learning_of_transferable_rewards__6b3e2b4e.pdf Provably Efficient Learning of Transferable Rewards Alberto Maria Metelli * 1 Giorgia Ramponi * 1 Alessandro Concetti 1 Marcello Restelli 1 The reward function is widely accepted as a succinct, robust, and transferable representation of a task. Typical approaches, at the basis of Inverse Reinforcement Learning (IRL), leverage expert demonstrations to recover a reward function. In this paper, we study the theoretical properties of the class of reward functions that are compatible with the expert s behavior. We analyze how the limited knowledge of the expert s policy and of the environment affects the reward reconstruction phase. Then, we examine how the error propagates to the learned policy s performance when transferring the reward function to a different environment. We employ these findings to devise a provably efficient active sampling approach, aware of the need for transferring the reward function, that can be paired with a large variety of IRL algorithms. Finally, we provide numerical simulations on benchmark environments. 1. Introduction Inverse Reinforcement Learning (IRL, Osa et al., 2018) aims at recovering a reward function by observing the behavior of an expert. One of the main challenges of IRL is that the problem itself is ill-posed, as multiple solutions are admissible (Ng & Russell, 2000). Several criteria have been proposed to address this ambiguity issue, based on different principles, including feature-based matching (Abbeel & Ng, 2004), maximum margin planning (Ratliff et al., 2006a), maximum entropy (Ziebart et al., 2008), maximum Hessian eigenvalue (Metelli et al., 2017), and generative adversarial learning (Ho & Ermon, 2016). These algorithms were evaluated either experimentally, in terms of performance of the policy learned using the recovered reward function (Ratliff et al., 2006a; Ziebart et al., 2008; Silver et al., 2010; Ziebart et al., 2010; Boularias et al., 2011; Ho & Ermon, 2016), or *Equal contribution 1Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy. Correspondence to: Alberto Maria Metelli . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). theoretically, under the strong assumption of reward uniqueness (Abbeel & Ng, 2004; Pirotta & Restelli, 2016; Ramponi et al., 2020b). Nevertheless, as noted in Osa et al. (2018), the evaluation of IRL algorithms remains, to a large extent, an open question. Taking a step back, in the IRL framework, typically, the transition model of the underlying Markov Decision Process (MDP, Puterman, 2014) is unknown to the algorithm as well as the expert s policy. In general, these elements are estimated by interacting with the environment and by querying the expert. This leads to an unavoidable error on the feasible set of reward functions, i.e., the ones compatible with the expert s demonstrations. Motivated by this, the first question we aim to address is: (Q1) How does the error on the transition model and on the expert s policy propagate to the recovered reward? Clearly, any answer to this question will depend on the chosen IRL algorithm, i.e., on the criterion for selecting one reward function within the feasible set. To avoid the dependence on the specific IRL algorithm, we will address (Q1), studying directly the properties of the feasible set. From an applicative point of view, the IRL s objective is twofold: explainability and transferability. On the one hand, understanding the expert s intentions is useful for descriptive purposes and can help interpret the expert s decisions (Russell & Santos, 2019; Juozapaitis et al., 2019; Hayat et al., 2019; Likmeta et al., 2021). On the other hand, the recovered reward function can be used to learn the same task in a possibly different environment (Abbeel & Ng, 2004; Levine et al., 2011; Fu et al., 2017). This ability makes the IRL approach more powerful than Behavioral Cloning (BC, Osa et al., 2018). Indeed, the reward function is the most succinct representation of a task (Sutton et al., 1998), and it can be transferred to other domains, unlike a cloning policy that is tightly connected to the environment in which it is played. These considerations motivate our second question: (Q2) How does the error on the recovered reward affect the performance of the policy learned in a different environment? Thus, the performance of the policy learned in the new environment will be our index for evaluating the quality Provably Efficient Learning of Transferable Rewards of the recovered reward. As for question (Q1), we will not focus on a particular IRL algorithm but rather on the properties of the recovered feasible set of rewards. Contributions The contributions of this paper can be summarized as follows. We study the error on the recovered reward by deriving a bound that highlights the individual contributions due to the estimation of the transition model and of the expert s policy (Section 3). We analyze the problem of transferring the reward to a new environment. We suppose to have a known target environment and to interact with a different source environment and with the corresponding expert s policy only. We derive a bound on the performance of the learned policy in the target environment when using the reward recovered from the source one (Section 4). Given the previous results, we consider a uniform sampling strategy for the IRL problem given a generative model of the source environment (Section 5). This leads to a sample complexity bound for IRL and makes a first step towards the answer of one of the open questions of this setting (Osa et al., 2018). Finally, taking inspiration from Zanette et al. (2019), we propose a new algorithm, Transferable Reward Acti VE ir L (TRAVEL), that adapts the sampling strategy to the features of the problem. TRAVEL employs an IRL algorithm to choose the reward function from the estimated feasible set. We derive a problem-dependent upper bound to the sample complexity for the IRL setting (Section 6). The proofs of the results presented in the main paper can be found in Appendix B. 2. Preliminaries In this section, we introduce the background that will be employed throughout the remainder of the paper. Mathematical Notation Let X be a finite set and Y be a space, we denote by YX the set of functions f :X Ñ Y. The simplex over X is denoted by X tνPr0,1s X : ř x PX νpxq 1u and we indicate with Y X the set of functions f :X Ñ Y. Let µP X and f PRX , we abbreviate with µ f ř x PX µpxqfpxq, i.e., we use µ as an operator. We define the L8-norm of f as }f}8 maxx PX |fpxq|. Markov Decision Processes A discounted Markov Decision Process without Reward function (MDPz R) is defined as a tuple M p S,A,p,γq, where S is the state space, A is the action space, p P S SˆA is the transition model, and γPr0,1q is the discount factor. An MDPz R is a Markov Decision Process (MDP, Puterman, 2014) in which we remove the reward function. Given an MDPz R M and a reward function r PRSˆA, we denote with MYr the MDP obtained by paring M and r. The agent s behavior is modeled by a policy πP A S . We assume that the state and action spaces are finite with cardinality S |S| and A |A|. Operators Let f PRS and g PRSˆA. We denote by P and π the operators induced by the transition model p and by the policy π, i.e., p Pfqps,aq ř s1PS pps1|s,aqfps1q and pπgqpsq ř a PAπpa|sqgps,aq. Moreover, we introduce the operator p Efqps,aq fpsq. Given πP S SˆA, we denote with p Bπgqps,aq gps,aq1tπpa|sqą0u and p B πgqps,aq gps,aq1tπpa|sq 0u. Finally, we denote the expectation under the discounted occupancy measure with p IS γπPq 1f ř t PNpγπPqtf. See Appendix A for a complete definition of the operators. Value Functions and Optimality The Q-function Qπ MYr PRSˆA of a policy π in the MDP MYr is the expected discounted sum of the rewards starting from a state-action pair and playing policy π thereafter, and defined via the Bellman equation (Sutton et al., 1998): Qπ MYr r γPπQπ MYr. The V-function is the expectation of the Q-function over the action space: V π MYr πQπ MYr. The advantage function, defined as Aπ MYr Qπ MYr EV π MYr, provides the one-step performance gain achieved by playing a specific action in a state rather than following policy π. A policy π P A S is optimal if it yields non-positive advantage, i.e., Aπ MYrps,aqď0 for all ps,aq PSˆA. We denote with Π MYrĎ A S the set of optimal policies for the MDP MYr. Under mild conditions (Puterman, 2014), any optimal policy attains the optimal V-function, i.e., V MYr maxπP A S V π MYr. 3. Recovering Feasible Rewards In this section, we start by revising the formalization of the IRL problem. Then, we introduce and study the feasible reward set, i.e., the set of the reward functions that make the expert s policy optimal (Section 3.1). Finally, we analyze the error propagation when the feasible set is defined with an estimated transition model and expert s policy (Section 3.2). We start by formally stating the IRL problem using our notation and defining the feasible reward set. Definition 3.1 (IRL Problem (Ng & Russell, 2000)). An Inverse Reinforcement Learning (IRL) problem is a pair P p M,πEq, where M is an MDPz R and πEP A S is an expert s policy. A reward r PRSˆA is feasible for P if πE is an optimal policy for the MDP MYr, i.e., πEPΠ MYr. We denote by RP the set of feasible rewards for P, named feasible (reward) set. As we pointed out in Section 1, the IRL problem admits multiple solutions, i.e., it suffers from an ambiguity issue. Thus, an IRL algorithm A , implementing a criterion for Provably Efficient Learning of Transferable Rewards selecting a reward function within this set, can be seen as a choice function mapping an IRL problem P to a feasible reward, i.e., A :PÞÑr PRP. It is worth noting that the expert s reward function r E (which is unknown) belongs to RP, but for the sake of the IRL problem, we are interested in finding just one reward inside RP. Thus, we primarily focus on the properties of the feasible set, rather than on specific criteria (i.e., IRL algorithms) for choosing a single reward within RP. 3.1. Feasible Reward Set In this section, we study the properties of the feasible reward set. We start from the following result, which provides an implicit characterization of the feasible set. Lemma 3.1 (Feasible Reward Set Implicit (Ng & Russell, 2000)). Let P p M,πEq be an IRL problem. Let r PRSˆA, then r is a feasible reward, i.e., r PRP if and only if for all ps,aq PSˆA it holds that: (i) QπE MYrps,aq V πE MYrpsq 0 if πEpa|sqą0, (ii) QπE MYrps,aq V πE MYrpsqď0 if πEpa|sq 0. Furthermore, if condition (ii) holds with the strict inequality, πE is the unique optimal policy under r, i.e., Π MYr tπEu. Both conditions are expressed in terms of the advantage function AπE MYrps,aq QπE MYrps,aq V πE MYrpsq. Specifically, condition (i) prescribes that the advantage function of the actions that are played by the expert s policy πE must be null, whereas condition (ii) ensures that the actions that are not played have a non-positive advantage. In other words, Lemma 3.1 requires πE to be an optimal policy under reward function r. From Lemma 3.1, we derive an explicit form of the reward functions belonging to the feasible set. Lemma 3.2 (Feasible Reward Set Explicit). Let P p M,πEq be an IRL problem. Let r PRSˆA, then r is a feasible reward, i.e., r PRP if and only if there exist ζPRSˆA ě0 and V PRS such that: r B πE ζ p E γPq V. Thus, the reward function is the sum of two terms. The first term B πE ζ depends on the expert s policy πE only but not on the MDP. It is zero for all actions the expert plays, i.e., those such that πEpa|sqą0, while its value is non-positive for actions that the expert does not play, i.e., for those with πEpa|sq 0. Requiring a strictly positive ζ allows enforcing πE as the unique optimal policy. The second term p E γPq V depends on the MDP but not on the expert s policy. It can be interpreted as a reward-shaping via function V , which is well-known to preserve the optimality of the expert s policy (Ng & Russell, 2000). By applying the Bellman equation, it is easy to see that the Q-function P = (M, πE) b P = ( c M, bπE) Figure 1. Feasible reward sets of two IRL problems: P is an IRL problem and p P a version of P estimated from samples. induced by πE under r is given by QπE MYr B πE ζ EV , in which V has the role of translating the Q-values by a fixed quantity within a the state. Thus, B πE ζ represents the advantage function AπE MYr and V is the V-function V πE MYr. 3.2. Error Propagation in the Feasible Reward Set We now study the error propagation in the feasible set, addressing question (Q1) of Section 1. Specifically, we consider two IRL problems P p M,πEq and p P p x M,pπEq. p P can be thought of as an approximate version of P, where the transition model pp and the expert s policy pπE are estimated through samples. Intuitively, an error in estimating the transition model p and the expert s policy πE results in an error in the estimation of the feasible sets RP. Since the IRL problem is ambiguous (it admits multiple solutions) we cannot expect to recover the expert s reward r E exactly. Instead, we will be satisfied whenever we recover an accurate approximation of the feasible set, in which also r E lies. Informally, we will say that the estimated feasible set R p P is close to the exact one RP if for every reward r PRP there exists one estimated reward pr PR p P that is close to r and vice versa (Figure 1).1 The following result formalizes the intuition by providing the error propagation result. Theorem 3.1 (Error Propagation). Let P p M,πEq and p P p x M,pπEq be two IRL problems. Then, for any r PRP such that r B πE ζ p E γPq V and }r}8ďRmax there exists pr PR p P such that element-wise it holds that: |r pr|ďB πE BpπEζ γ ˇˇˇ P p P V ˇˇˇ. Furthermore, }ζ}8ď Rmax 1 γ and }V }8ď Rmax The result states the existence of a reward pr in the estimated feasible set R p P fulfilling the bound that consists of two components. The first one B πE BpπEζ depends on the policy approximation only. Specifically, this term is non-zero in the state-action pairs such that πEpa|sq 0 and pπEpa|sqą0 only, i.e., for the actions that are not played by the expert but are wrongly believed to be played. Thus, to zero out this 1This notion of closeness between two sets is formalized by the Hausdorff distance (Rockafellar & Wets, 2009). Provably Efficient Learning of Transferable Rewards term it suffices to identify for each state one action played by the expert. The second term |p P p Pq V |, instead, concerns the estimation error of the transition model. Clearly, by reversing the roles of r and pr, we can obtain a symmetric statement which, however, displays some differences when thinking to R p P as the estimated feasible set. Specifically, while the second term related to the transition model does not change (apart from V becoming p V ), the first one related to the policy becomes B pπE BπE pζ. To zero out this term it is required identifying all actions played by the expert (not just one), that is a more demanding task. Clearly, this distinction vanishes for deterministic experts. 4. Transferring Rewards As mentioned in Section 1, one of the advantages of IRL over BC is the possibility of reusing the learned reward function in a different environment. More specifically, we consider the following setting. There is an expert agent playing an optimal policy πE in a source MDPz R M. We want to recover a reward function explaining the expert s policy πE in M, knowing that we will employ it in a different target MDPz R M1 for policy learning. In this section, we discuss the assumptions needed for transferring the reward function (Section 4.1). Then, we analyze how the error on the reward function propagates to the performance of the learned policy in the target environment (Section 4.2). 4.1. Transferable Reward Assumption Transferring the recovered reward function poses new challenges that are quite unexplored in the IRL literature. Indeed, it might happen that different rewards are inducing the same expert s policy πE in the source MDPz R M, while generating different optimal policies in the target MDPz R M1. More formally, let r E be the true (and unknown) reward optimized by the expert s policy πE and let pπ1q E the policy that the expert would play in M1 optimizing the same r E. Suppose we are able to solve the source IRL problem P p M,πEq finding r PRP, possibly different from r E. There is no guarantee that r will make the policy pπ1q E optimal in the target MDPz R M1. In other words, r might not be a solution to the target IRL problem P1 p M1,pπ1q Eq. In order to solve this additional ambiguity issue, we enforce the following assumption. Assumption 4.1. Let P p M,πEq and P1 p M1,pπ1q Eq be the source and target IRL problems. The corresponding feasible sets satisfy RP1 ĚRP. With this assumption, we guarantee that every reward that is feasible for the source MDPz R M is also feasible for the target MDPz R M1. We think that this assumption is unavoidable in our setting since we have no information regarding the optimality of the observed expert policy πE in the target MDPz R M1. The intuition behind Assumption 4.1 is that, by simply observing the expert playing an action in a state, we can only conclude that the action is optimal, but we are unable to judge how much suboptimal are the actions that the agent does not play. This issue could be overcome in two ways, which anyway require a modification of the setting and, thus, are out of the scope of this work: (i) we observe the optimal behavior of the agent in several different environments (Amin et al., 2017); (ii) we assume the expert plays a stochastic policy defined in terms of its Q-values. In this regard, Fu et al. (2017) proves that there is a one-to-one mapping between Boltzmann policies and Q-functions except for a state-only translation and scaling. 4.2. Error Propagation on the Value Function In this section, we focus on question (Q2) presented in Section 1. We discuss, under Assumption 4.1, how an error on the reward function propagates into an error in estimating the optimal value function, and consequently on the optimal policy, when transferring the recovered reward to a possibly different MDPz R M1 p S,A,p1,γ1q. We start with Lemma 4.1, which provides upper and lower bounds to the difference between the optimal Q-function under the true reward function Q M1Yr and the optimal Q-function under the estimated reward function Q M1Ypr. Lemma 4.1 (Simulation Lemma 1). Let M1 p S,A,p1,γ1q be an MDPz R, let r,pr PRSˆA be two reward functions. Then, for every π PΠ M1Yr and pπ PΠ M1Ypr optimal policies for the MDPs M1Yr and M1Ypr respectively, the following inequalities hold element-wise: Q M1Yr Q M1Yprď ISˆA γ1P 1π 1pr prq, Q M1Yr Q M1Yprě ISˆA γ1P 1pπ 1pr prq. In particular, it holds that: Q M1Yr Q M1Ypr 8ď max πPtπ ,pπ u ISˆA γ1P 1π 1pr prq 8. The result suggests that we need to be accurate in estimating the reward function of the state-action pairs that are highly visited by the discounted occupancy measures of the optimal policy π PΠ M1Yr and of the policy pπ PΠ M1Ypr induced by the estimated reward function pr. It is worth noting that Lemma 4.1 holds for arbitrarily chosen policies π and pπ in the corresponding sets. However, sometimes we are not interested in just estimating the value function accurately, rather in the performance of the policy pπ , computed with the estimated reward function pr, under the true reward function r. The following result shows that we can reduce this problem to the one of estimating an accurate Q-function, as in Lemma 4.1. Lemma 4.2 (Simulation Lemma 2). Let M1 p S,A,p1,γ1q be an MDPz R, let r,pr PRSˆA be two reward functions, and Provably Efficient Learning of Transferable Rewards let pπ PΠ M1Ypr be an optimal policy for M1Ypr. Then, for every π PΠ M1Yr optimal policy for MDP M1Yr, the following inequality holds element-wise: V M1Yr V pπ M1Yrď IS γ1pπ P 1 1pπ pπ q ˆp Q M1Yr Q M1Yprq. (1) In particular, it holds that: V M1Yr V pπ M1Yr 8ď 2 1 γ1 Q M1Yr Q M1Ypr 8. (2) While the element-wise inequality in Equation (1) depends on the specific choice of pπ (indeed while all pπ PΠ M1Ypr induce the same value function under pr they might induce different value functions under r), the L8-norm inequality in Equation (2), although looser, does not depend on the specific policy pπ , but on the estimated reward pr only. 5. Learning Transferable Rewards with a Generative Model In this section, we introduce the problem of learning a transferable reward function in a generative model setting and we introduce the notion of sampling strategy (Section 5.1). Then, we provide confidence intervals for the estimated transition model and expert s policy (Section 5.2). Finally, we study the sample complexity of a simple uniform sampling strategy (Section 5.3). 5.1. Problem Setting and Sampling Strategy We start by explicitly stating the assumptions of the setting we consider. Assumption 5.1. The following statements hold: (i) M and M1 have the same state and action spaces; (ii) we have access to the generative model of M; (iii) we can query the expert s policy πE in any state of M; (iv) the expert s policy πE is deterministic; (v) we know the transition model p1 and the discount factor γ1 of M1. The sample collection proceeds at iterations. At each iteration k Pr Ks, we collect at most nmax PN samples. When the generative model is queried about a state-action pair ps,aq PSˆA, it responds with a transition triple ps,a,s1q, where s1 pp |s,aq, and with an expert decision πEpsq. The sampling strategy S decides, at each iteration k, how to allocate the nmax samples over the state-action space SˆA, with the goal of estimating the feasible set accurately. To this purpose, we introduce the following PAC requirement. Definition 5.1. Let S be a sampling strategy. Let RP be the exact feasible set and R p P be the feasible set recovered after observing ně0 samples collected in the source MDPz R M. Let pr,qrq PRPˆR p P be a pair of target rewards, we say that S is pϵ,δ,nq-correct for MDPz R M1 and for the target rewards pr,qrq if with probability at least 1 δ it holds that: inf pr PRx P Q M1Yr Q M1Ypr 8ďϵ Q M1Yqr Q M1Yr 8ďϵ. (3) The first condition guarantees that for a choice of the target reward in the exact feasible set r PRP (for instance, the expert s one r E), there exists a reward in the recovered feasible set pr PR p P, inducing an ϵ-close Q-function. However, the first condition alone allows the presence of reward functions in R p P that do not explain any exact reward function (e.g., we could select R p P RSˆA). Thus, we enforce the second condition, which guarantees that for a target reward function selected in the recovered feasible set qr PR p P (for instance, the one produced by an IRL algorithm A p R p Pq), there always exists a reward function in the exact feasible set r PR p P inducing an ϵ-close Q-function.2 The condition presented in Definition 5.1 refers to the ability to recover a reward function in the feasible set that induces an optimal Q-function Q M1Ypr close to the one induced by a target reward Q M1Yr. As already mentioned in Section 4.2, we might be interested, instead, in evaluating the performance of an optimal policy pπ PΠ M1Ypr, recovered with the estimated reward function pr, under a target reward function r. In such a case, we have to focus on the value function difference V M1Ypr V pπ M1Yr. The following result, which makes use of Lemma 4.2, shows that fulfilling Definition 5.1, allows deriving guarantees for this case too. Lemma 5.1. Let S be a sampling strategy. Let RP be the exact feasible set and R p P be the feasible set recovered after observing ně0 samples collected in the source MDPz R M. Let pr,qrq PRPˆR p P be a pair of target rewards, if S is pϵ,δ,nq-correct for MDPz R M1 and for the target rewards pr,qrq, as in Definition 5.1, then it holds that: inf pr PRx P sup pπ PΠ M1Ypr V M1Yr V pπ M1Yr 8ď 2ϵ 1 γ1 , inf r PRP sup π PΠ M1Yr V M1Yqr V π M1Yqr 8ď 2ϵ 1 γ1 . Thus, when moving from the approximation of the Qfunctions Q M1Yr Q M1Ypr, to the evaluation of performance of the learned policy pπ under the target reward r, we lose a factor 2{p1 γ1q. It is worth noting that the 2The target reward functions pr,qrq PRPˆR p P are a way of selecting one reward within the feasible set. As we shall see in the next sections, we will be able to provide sample-complexity guarantees for arbitrary choices of pr,qrq. Provably Efficient Learning of Transferable Rewards guarantee of Lemma 5.1 involves a supremum over the set of optimal policies for the candidate reward, i.e., Π M1Ypr, that helps discarding degenerate rewards, like constant ones. 5.2. Transition Model and Policy Estimation For each iteration k Pr Ks, we denote by nkps,a,s1q the number of times the triple ps,a,s1q PSˆAˆS is visited in episode k and nkps,aq ř s1PS nkps,a,s1q. For the transition model estimation, we define the cumulative counts Nkps,a,s1q ř j Prksnjps,a,s1q and Nkps,aq ř j Prksnjps,aq, which lead to the estimate: ppkps1|s,aq Nkps,a,s1q N k ps,aq , where x maxt1,xu. Concerning the estimated expert s policy pπE k , since the expert is deterministic for Assumption 5.1, the first time we sample a state s PS we recover the true policy πEpsq. We now show that, with high probability, we can guarantee an accurate estimate of the expert s policy and of the transition model. Lemma 5.2 (Good Event). Let δPp0,1q, define the good event E as the event such that the following inequalities hold simultaneously for all ps,aq PSˆA and kě1: B πE BpπE k ζ ps,aqď Rmax 1 γ 1t Nkpsq 0u, B pπE k BπE pζk ps,aqď Rmax 1 γ 1t Nkpsq 0u, ˇˇˇ P p Pk V ˇˇˇps,aqď Rmax 2ℓkps,aq N k ps,aq, ˇˇˇ P p Pk p Vk ˇˇˇps,aqď Rmax 2ℓkps,aq N k ps,aq, where ζ, pζk, V , and p Vk are defined in Theorem 3.1 and ℓkps,aq log 12SAp N k ps,aqq2 δ . Then, Prp Eqě1 δ. The terms related to the policy estimation become null as we identify the action played by the expert in each state, being the expert s policy deterministic. Therefore, they are replaced with 1t Nkpsq 0u. The terms related to the transition model, instead, are applications of H oeffding s inequality (Boucheron et al., 2013). By plugging the confidence interval of Lemma 5.2 into Theorem 3.1, we conclude that under the good event E, at iteration k 1, given a target reward r PRP, there exists an estimated reward prk 1PR p Pk 1 such that |r prk 1|ps,aqďCk 1ps,aq where: Ck 1ps,aq Rmax 1t Nk 1psq 0u γ and Nk 1ps,aq Nkps,aq nk 1ps,aq. Moreover, given a target reward qrk 1PR p Pk 1, we can guarantee that under Algorithm 1 Uniform Sampling IRL Input: significance δPp0,1q, ϵ target accuracy, nmax maximum number of samples per iteration kÐ0 ϵ0 1 1 γ1 while ϵkąϵ do Collect r nmax SA s from each ps,aq PSˆA Update accuracy ϵk 1 1 1 γ1 maxps,aq PSˆACk 1ps,aq Update ppk 1 and pπE k 1 kÐk 1 end while the same good event E, there exists an exact reward function r PRP such that |r qrk 1|ps,aqďCk 1ps,aq as well. 5.3. Uniform Sampling Strategy The first algorithm we present employs a uniform sampling strategy to allocate samples over SˆA, until the desired accuracy ϵą0 is reached (Algorithm 1). The stopping condition makes use of the obtained confidence intervals: ϵk 1: 1 1 γ1 max ps,aq PSˆACk 1ps,aqďϵ. Thanks to Lemma 4.1, we are guaranteed that, under E, when the stopping condition is activated, the recovered feasible set fulfills Definition 5.1, as shown below. Theorem 5.1 (Sample Complexity of Uniform Sampling IRL). If Algorithm 1 stops at iteration K with accuracy ϵK, then with probability at least 1 δ it fulfills Definition 5.1, for arbitrary target reward functions r and qr, with a number of samples upper bounded by: nď r O ˆ γ2R2 max SA p1 γ1q2p1 γq2ϵ2 K 6. Active Learning of Transferable Rewards In this section, we present a novel algorithm, named Transferable Reward Acti VE ir L (TRAVEL), that adapts the sampling strategy to the structure of the problem. In order to choose which state-action pairs to sample from, we make use of Lemma 4.1. Suppose we are at iteration k Pr Ks and we have to decide how to allocate the nmax samples of the next iteration k 1. We have already observed that, under the good event E, there exists prk 1PR p Pk 1 and r PRP such that |r prk 1|ps,aqďCk 1ps,aq and |qrk 1 r|ps,aqď Ck 1ps,aq. Then, by Lemma 4.1 we have that: Q M1Yr Q M1Yprk 1 8ď max πPtπ ,pπ k 1u ISˆA γ1P 1π 1Ck 1 8, Q M1Yqrk 1 Q M1Yr 8ď max πPtqπ k 1,π u ISˆA γ1P 1π 1Ck 1 8, where the policies are arbitrarily selected in the corresponding sets: π PΠ M1Yr, pπ k 1PΠ M1Yprk 1, qπ k 1PΠ M1Yqrk 1, and π PΠ M1Yr. In principle, we could optimize the right- Provably Efficient Learning of Transferable Rewards Algorithm 2 TRAVEL Input: significance δPp0,1q, ϵ target accuracy, nmax maximum number of samples per iteration, IRL algorithm A kÐ0 ϵ0 1 1 γ1 while ϵkąϵ do Solve optimization problem in Eq (4) for nk 1 and ϵk 1 Collect nk 1ps,aq samples from ps,aq PSˆA Update ppk 1 and pπE k 1 kÐk 1 end while hand side of the previous inequalities over nk 1 to obtain the sample allocation. However, we have no knowledge about all the involved policies. Thus, we resort to a surrogate bound that leads to an allocation better than the uniform one. To this purpose, given an IRL algorithm A , we follow the spirit of (Zanette et al., 2019) extending the maximization over a set of policies ΠA k that, with high probability, contains the needed ones: πP A S : sup µ0P Sµ0 V M1YA p Rx Pk q V π M1YA p Rx Pk q ď4ϵk where the value of ϵk will be defined later. Here is the first point in which we actually make use of an IRL algorithm A , whose goal is to choose a reward in the feasible reward set. The rationale in the definition of ΠA k is to constrain the search for the policy to those yielding a value function at iteration k close to the estimated optimal one. We can now formulate the optimization problem: ϵk 1: min nk 1PNSˆA max µ0P SˆA µ0 ISˆA γ1P 1π 1Ck 1 s.t. µ0 E V M1YA p Rx Pk q V π M1YA p Rx Pk q ď4ϵk ÿ ps,aq PSˆA nk 1ps,aqďnmax. The program is a minimax in which we look for the sample allocation nk 1 that minimizes the bound on the value function difference of Lemma 4.1, under the worst possible policy π in the set ΠA k and initial state-action distribution µ0. Therefore ϵk, used to define ΠA k , is the objective function value of the previous iteration k. It can be proved that under the good event E, ΠA k contains a specimen of all the required optimal policies, i.e., π , pπ k 1, qπ k 1, and π (Corollary B.2). The constant nmax is the maximum number of samples allowed per iteration and it is a user-defined parameter. By choosing the nmax value, the user has to make a trade-off between time and sample efficiency. If the value of nmax is too high, the algorithm achieves the desired ϵ-correctness very quickly but with a possible sample inefficient behavior (close to uniform); if the value of nmax is too low, many iterations are needed to achieve the desired accuracy ϵ, but choosing more carefully where to sample. The pseudocode of TRAVEL is reported in Algorithm 2. It is worth noting that we have not specified which IRL algorithm should be employed to recover a reward function. Indeed, any IRL algorithm A can be used for this purpose, provided that it selects a reward function within the feasible set R p P. We stress that the main goal of this paper is not to provide a new IRL algorithm for choosing a good reward from the feasible reward set, but to explain how to recover a good approximation of this feasible set. 6.1. Sample Complexity In this section, we prove that TRAVEL fulfills the PACcondition of Definition 5.1. In order to provide this result, we use as suboptimality gaps the negative advantage: A M1Yrps,aq V M1Yrpsq Q M1Yrps,aq for all ps,aq PSˆA, where rr Parginfr PRP }r A p R p PKq}8 is the reward function in the exact feasible set RP closest to the one returned by the IRL algorithm A applied to the estimated feasible set R p PK. Theorem 6.1 (Sample Complexity of TRAVEL). If Algorithm 2 stops at iteration K with accuracy ϵK and accuracy ϵK 1 at the previous iteration, then with probability at least 1 δ it fulfills Definition 5.1, for arbitrary target reward functions r and qr, with a number of samples upper bounded by n ř ps,aq PSˆANKps,aq where: NKps,aqď r O ˆ min " γ2R2 max p1 γ1q2p1 γq2ϵ2 K , γ2R2 maxϵ2 K 1 p1 γq2p A M1Yrps,aqq2ϵ2 K The significance of this result depends on two main components: the ratio between the two objectives ϵK 1 and ϵK and the suboptimality gaps. The latter depends, although indirectly, on the employed IRL algorithm A . The more suboptimal the action is, the less the action will be sampled. Instead, the ϵK 1{ϵK component depends on the choice of the nmax value: if this value is small, then the ratio will also be small. As discussed in the previous section, if the nmax value is too high, the algorithm tends to sample every action-state pair uniformly. 6.2. Discussion The upper bound on the sample complexity that we derive in the problem-independent version is of order r O SA p1 γ1q2p1 γq2ϵ2 . Although the problem we address is intimately different, it is interesting to compare this result with the well-known lower bound for RL with a generative model (Azar et al., 2013): r O SA p1 γq3ϵ2 , matched by several algorithms (Azar et al., 2012; Sidford et al., 2018; Zanette et al., 2019). Thus, when γ1 γ, we have an exponent 4 for the term 1{p1 γq. One might be tempted to think that this is a consequence of using H oeffding s inequality Provably Efficient Learning of Transferable Rewards instead of the tighter Bernstein s inequality.3 We think that this might not be the case as the crucial property that allows achieving power 3 is a careful bound of the variance of p PV (Azar et al., 2013, Lemmas 7 and 8). There are two reasons why we cannot exploit this bound. First, we consider the expectations taken w.r.t. a different target model P 1, while the estimates are conducted on the source one P. Second, we are required, based on Theorem 3.1, to estimate the variance of p PV and V is unknown and hard to estimate. 7. Related Works The IRL problem was introduced by Ng & Russell (2000). Most early IRL algorithms assume that the dynamics of the system are known. Many criteria were proposed for selecting a good reward function in the feasible reward set, based on features matching (Abbeel & Ng, 2004), maximum margin (Ratliff et al., 2006a), maximum entropy (Ziebart et al., 2008; 2010), Bayesian framework (Ramachandran & Amir, 2007), boosting methods (Ratliff et al., 2006b) and Gaussian processes (Levine et al., 2011). A limited number of IRL algorithms can be considered model-free: Relative Entropy Inverse Reinforcement Learning (Boularias et al., 2011), Generative Adversarial Imitation Learning (Ho & Ermon, 2016), Gradient-based Inverse Reinforcement Learning (Pirotta & Restelli, 2016) and its extensions (Metelli et al., 2017; 2020; Ramponi et al., 2020b;a). Other works on Imitation Learning use an active approach, such as the one used in this paper. Judah et al. (2012) draw a reduction from active imitation learning to i.i.d. active learning. In (Ross & Bagnell, 2010) and (Ross et al., 2011), the authors propose two approaches based on executing the estimated policy and asking an oracle for a dataset containing the action performed by the expert. In these papers, however, no guarantees on the sample complexity are provided. The closest work to ours is by Lopes et al. (2009), which propose a method to actively ask for samples from a generator to perform IRL, adopting a Bayesian approach. However, they assume knowledge of the real transition model and the main effort lies in estimating the expert s policy. Since we do not assume the knowledge of the transition model, this work is not fully comparable to our setting. 8. Experimental Evaluation In this section, we provide the experimental evaluation of TRAVEL with a threefold goal. In the first experiment, we motivate the need for employing IRL over BC when our goal is to transfer knowledge to a target environment (Section 8.1). Then, we highlight the benefits of the sampling strategy of TRAVEL over Uniform Sampling (Section 8.2). 3In the RL with generative model setting, H oeffding s inequality leads to the same exponent 4. Finally, we show how TRAVEL can be combined with different IRL algorithms (Section 8.3). In all the experiments, we employ reward functions that depend on the state only and the algorithms are evaluated according to the following performance index }V M1Yr E V pπ M1Yr E}2 2, where the symbols are defined in the previous sections (we omit the subscript M1Y in the plots). The complete experimental results are provided in Appendix D. 8.1. IRL vs Behavioral Cloning In this section, we show the benefits of IRL over BC when we want to learn a transferable reward function. In BC, the collected samples are used to estimate a policy describing the expert s behavior directly. The recovered policy is typically highly dependent on the environment in which the expert is acting, therefore, in many cases, it cannot be transferred to different environments. We use a 3ˆ3 Gridworld environment with an obstacle in the central cell that makes the agent bouncing back with probability p and surpassing it with probability 1 p. If p 0 the optimal policy is to collide with the obstacle until the agent reaches the goal state. While, if p 1, an optimal agent gets around the obstacle. The source MDP has obstacle s probability p 0.8 and target MDPs are four Gridworlds with obstacle s probabilities p Pt0,0.2,0.5,0.8u. For this experiment, we use a simple IRL algorithm that enforces the conditions of Lemma 3.1 and chooses the reward function to maximize the minimum action gap; we call it Max Gap-IRL (details in Appendix C). The results in Figure 2 show that the performance of BC deteriorates as the source and target MDPs become more dissimilar, as expected. Differently, TRAVEL combined with Max Gap-IRL allows recovering a reward function that leads to an optimal policy. Thus, as long as the target and source environments are the same (Figure 2 first plot) BC is a valid alternative, but IRL becomes unavoidable when the need for transferring knowledge arises. 8.2. TRAVEL vs Uniform Sampling As discussed in Section 5.3, Uniform Sampling IRL and TRAVEL differ from the strategy used to allocate samples to the state-action pairs. While Uniform Sampling queries the generative model uniformly, TRAVEL actively allocates samples in the state-action pairs that will carry more information . We consider a chain MDP composed by 6 states S ts0,...,s4,sbu and 10 actions A tag,a1,...,a9u (Figure 3). We have tested both algorithms and the results are shown in Figure 4. Although Uniform Sampling IRL seems to perform better with a small number of samples, we observe that TRAVEL recovers a reward function that allows achieving a near-optimal performance in less than half of the samples needed by Uniform Sampling. Provably Efficient Learning of Transferable Rewards 0 500 1000 1500 0 500 1000 1500 0 500 1000 1500 0 500 1000 1500 Figure 2. Comparison between TRAVEL and Behavioral Cloning (BC) on Gridworld environment, with different values of obstacle s probability for the target MDP M1. 200 runs, 98% c.i. (ag, 0.3), (a , 0.7) (ag, 0.3), (a , 0.7) (ag, 0.7), (a , 0.3) (ag, 0.05), (a , 0.01) (ag, 0.95), (a , 0.99) Figure 3. MDP employed in Section 8.2. States s2 and s3 behave exactly as s1. a denotes any action in ta1,...,a9u. 0 2000 4000 6000 10 3 TRAVEL Uniform Sampling Figure 4. Comparison between Uniform Sampling IRL and TRAVEL. 300 runs, 98% c.i. 0 1000 2000 3000 Random-IRL Max Gap-IRL Linear-IRL Max Ent-IRL Figure 5. TRAVEL using different IRL algorithms on random MDPs. 200 runs, 98% c.i. 8.3. TRAVEL with Different IRL Algorithms In this section, we show the performance of TRAVEL paired with different IRL algorithms: Max Gap-IRL, Max Ent IRL (Ziebart et al., 2008), Linear-IRL (Abbeel & Ng, 2004), and Random-IRL. The Random IRL selects a random reward function from the estimated feasible set of rewards. We compare the algorithms on 200 random generated MDPs. The results in Figure 5 show that Max Gap-IRL, Linear-IRL, and Max Ent-IRL display a faster convergence rate than Random. This is the expected behavior as these IRL algorithms choose the reward function in the feasible set with a meaningful criterion. However, the curve of Random-IRL shows an improvement, proving that the feasible set shrinks, but it struggles harder to reach a near-zero error as it likely selects less discriminating rewards. This underlines how a reasonable choice of the reward function within the feasible set can have a positive impact on performance. 9. Conclusions In this paper, we have studied how to efficiently learn a transferable reward from a theoretical perspective. Using the concept of feasible reward set, introduced by Ng & Russell (2000), we have derived novel bounds on the error of the reward function, given an error on the transition model and the expert s policy. We have then obtained similar results on the performance in a target environment using the rewards recovered from a source environment, introducing new simulation lemmas. Based on these findings, we have proposed two algorithms, Uniform Sampling IRL and TRAVEL, which, given a generator model for the source MDP, decide the sampling strategy for querying the generator. These algorithms use an IRL algorithm, decided by the user, as choice function. We have derived from the Uniform Sampling IRL a sample complexity bound which, to the best of our knowledge, is the first sample complexity result for the IRL setting. TRAVEL, instead, adapts the sampling strategy to the specific environment at hand. Leveraging this characteristic of the algorithm, we have obtained a problemdependent bound on its sample complexity. Despite the limitations of the considered setting, we believe that this paper makes a first step towards a better understanding of the theoretical aspects of IRL. Many appealing future research directions arise. One central theoretical question is: (Q3) Is Inverse Reinforcement Learning intrinsically more difficult than Reinforcement Learning? Is the sample complexity r O SA p1 γ1q2p1 γq2ϵ2 tight? We are currently unable to answer. From an algorithmic perspective, our setting limits to tabular MDPs and assumes access to a generative model. Future investigations should include the extension to episode-based interaction and the introduction of function approximation techniques to cope with continuous problems. Provably Efficient Learning of Transferable Rewards Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Brodley, C. E. (ed.), Machine Learning, Proceedings of the Twenty-first International Conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004, volume 69 of ACM International Conference Proceeding Series. ACM, 2004. doi: 10.1145/1015330. 1015430. Amin, K., Jiang, N., and Singh, S. P. Repeated inverse reinforcement learning. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 1815 1824, 2017. Azar, M. G., Munos, R., and Kappen, B. On the sample complexity of reinforcement learning with a generative model. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012. icml.cc / Omnipress, 2012. Azar, M. G., Munos, R., and Kappen, H. J. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learning, 91(3): 325 349, 2013. doi: 10.1007/s10994-013-5368-1. Boucheron, S., Lugosi, G., and Massart, P. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013. Boularias, A., Kober, J., and Peters, J. Relative entropy inverse reinforcement learning. In Gordon, G., Dunson, D., and Dud ık, M. (eds.), Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pp. 182 189, Fort Lauderdale, FL, USA, 11 13 Apr 2011. JMLR Workshop and Conference Proceedings. Chatzigeorgiou, I. Bounds on the lambert function and their application to the outage analysis of user cooperation. IEEE Communications Letters, 17(8):1505 1508, 2013. Fu, J., Luo, K., and Levine, S. Learning robust rewards with adversarial inverse reinforcement learning. Co RR, abs/1710.11248, 2017. Hayat, A., Singh, U., and Namboodiri, V. P. Inforl: Interpretable reinforcement learning using information maximization. Co RR, abs/1905.10404, 2019. Ho, J. and Ermon, S. Generative adversarial imitation learning. In Lee, D. D., Sugiyama, M., von Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 4565 4573, 2016. Judah, K., Fern, A., and Dietterich, T. G. Active imitation learning via reduction to I.I.D. active learning. In de Freitas, N. and Murphy, K. P. (eds.), Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, August 14-18, 2012, pp. 428 437. AUAI Press, 2012. Juozapaitis, Z., Koul, A., Fern, A., Erwig, M., and Doshi Velez, F. Explainable reinforcement learning via reward decomposition. In IJCAI/ECAI Workshop on Explainable Artificial Intelligence, 2019. Kakade, S. M. and Langford, J. Approximately optimal approximate reinforcement learning. In Sammut, C. and Hoffmann, A. G. (eds.), Machine Learning, Proceedings of the Nineteenth International Conference (ICML 2002), University of New South Wales, Sydney, Australia, July 8-12, 2002, pp. 267 274. Morgan Kaufmann, 2002. Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020. Levine, S., Popovic, Z., and Koltun, V. Nonlinear inverse reinforcement learning with gaussian processes. In Shawe Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F. C. N., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain, pp. 19 27, 2011. Likmeta, A., Metelli, A. M., Ramponi, G., Tirinzoni, A., Giuliani, M., and Restelli, M. Dealing with multiple experts and non-stationarity in inverse reinforcement learning: an application to real-life problems. Machine Learning, pp. 1 36, 2021. Lopes, M., Melo, F., and Montesano, L. Active learning for reward estimation in inverse reinforcement learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 31 46. Springer, 2009. Metelli, A. M., Pirotta, M., and Restelli, M. Compatible reward inverse reinforcement learning. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 2050 2059, 2017. Metelli, A. M., Pirotta, M., and Restelli, M. On the use of the policy gradient and hessian in inverse reinforcement Provably Efficient Learning of Transferable Rewards learning. Intelligenza Artificiale, 14(1):117 150, 2020. doi: 10.3233/IA-180011. Ng, A. Y. and Russell, S. J. Algorithms for inverse reinforcement learning. In Langley, P. (ed.), Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29 - July 2, 2000, pp. 663 670. Morgan Kaufmann, 2000. Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., and Peters, J. An algorithmic perspective on imitation learning. Found. Trends Robotics, 7(1-2):1 179, 2018. doi: 10.1561/2300000053. Pirotta, M. and Restelli, M. Inverse reinforcement learning through policy gradient minimization. In Schuurmans, D. and Wellman, M. P. (eds.), Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 1217, 2016, Phoenix, Arizona, USA, pp. 1993 1999. AAAI Press, 2016. Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. Ramachandran, D. and Amir, E. Bayesian inverse reinforcement learning. In Veloso, M. M. (ed.), IJCAI 2007, Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, January 6-12, 2007, pp. 2586 2591, 2007. Ramponi, G., Drappo, G., and Restelli, M. Inverse reinforcement learning from a gradient-based learner. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020a. Ramponi, G., Likmeta, A., Metelli, A. M., Tirinzoni, A., and Restelli, M. Truly batch model-free inverse reinforcement learning about multiple intentions. In Chiappa, S. and Calandra, R. (eds.), The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 2628 August 2020, Online [Palermo, Sicily, Italy], volume 108 of Proceedings of Machine Learning Research, pp. 2359 2369. PMLR, 2020b. Ratliff, N. D., Bagnell, J. A., and Zinkevich, M. Maximum margin planning. In Cohen, W. W. and Moore, A. W. (eds.), Machine Learning, Proceedings of the Twenty Third International Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25-29, 2006, volume 148 of ACM International Conference Proceeding Series, pp. 729 736. ACM, 2006a. doi: 10.1145/1143844.1143936. Ratliff, N. D., Bradley, D. M., Bagnell, J. A., and Chestnutt, J. E. Boosting structured prediction for imitation learning. In Sch olkopf, B., Platt, J. C., and Hofmann, T. (eds.), Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, pp. 1153 1160. MIT Press, 2006b. Rockafellar, R. T. and Wets, R. J.-B. Variational analysis, volume 317. Springer Science & Business Media, 2009. Ross, S. and Bagnell, D. Efficient reductions for imitation learning. In Teh, Y. W. and Titterington, D. M. (eds.), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010, volume 9 of JMLR Proceedings, pp. 661 668. JMLR.org, 2010. Ross, S., Gordon, G. J., and Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Gordon, G. J., Dunson, D. B., and Dud ık, M. (eds.), Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, volume 15 of JMLR Proceedings, pp. 627 635. JMLR.org, 2011. Russell, J. and Santos, E. Explaining reward functions in markov decision processes. In Bart ak, R. and Brawner, K. W. (eds.), Proceedings of the Thirty-Second International Florida Artificial Intelligence Research Society Conference, Sarasota, Florida, USA, May 19-22 2019, pp. 56 61. AAAI Press, 2019. Sidford, A., Wang, M., Wu, X., Yang, L., and Ye, Y. Nearoptimal time and sample complexities for solving markov decision processes with a generative model. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montr eal, Canada, pp. 5192 5202, 2018. Silver, D., Bagnell, J. A., and Stentz, A. Learning from demonstration for autonomous navigation in complex unstructured terrain. The International Journal of Robotics Research, 29(12):1565 1592, 2010. Sutton, R. S., Barto, A. G., et al. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998. Zanette, A., Kochenderfer, M. J., and Brunskill, E. Almost horizon-free structure-aware best policy identification with a generative model. In Wallach, H. M., Larochelle, Provably Efficient Learning of Transferable Rewards H., Beygelzimer, A., d Alch e-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 5626 5635, 2019. Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In Fox, D. and Gomes, C. P. (eds.), Proceedings of the Twenty Third AAAI Conference on Artificial Intelligence, AAAI 2008, Chicago, Illinois, USA, July 13-17, 2008, pp. 1433 1438. AAAI Press, 2008. Ziebart, B. D., Bagnell, J. A., and Dey, A. K. Modeling interaction via the principle of maximum causal entropy. In F urnkranz, J. and Joachims, T. (eds.), Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pp. 1255 1262. Omnipress, 2010.