# iterative_valueaware_model_learning__b3db24af.pdf

Iterative Value-Aware Model Learning

Amir-massoud Farahmand Vector Institute, Toronto, Canada farahmand@vectorinstitute.ai

This paper introduces a model-based reinforcement learning (MBRL) framework that incorporates the underlying decision problem in learning the transition model of the environment. This is in contrast with conventional approaches to MBRL that learn the model of the environment, for example by ﬁnding the maximum likelihood estimate, without taking into account the decision problem. Value-Aware Model Learning (VAML) framework argues that this might not be a good idea, especially if the true model of the environment does not belong to the model class from which we are estimating the model. The original VAML framework, however, may result in an optimization problem that is difﬁcult to solve. This paper introduces a new MBRL class of algorithms, called Iterative VAML, that beneﬁts from the structure of how the planning is performed (i.e., through approximate value iteration) to devise a simpler optimization problem. The paper theoretically analyzes Iterative VAML and provides ﬁnite sample error upper bound guarantee for it.

1 Introduction

Value-Aware Model Learning (VAML) is a novel framework for learning the model of the environment in Model-Based Reinforcement Learning (MBRL) [Farahmand et al., 2017a, 2016a]. The conventional approach to model learning in MBRL is based on minimizing some kind of probabilistic loss. A common choice is to minimize the KL-divergence between the empirical data and the model, which leads to the Maximum Likelihood Estimator (MLE). Farahmand et al. [2017a, 2016a] argue that minimizing a probabilistic loss function might not be a good idea because it does not take into account the underlying decision problem. Any knowledge about the reward, value function, or policy is ignored in the conventional model learning approaches in MBRL (some recent exceptions are Joseph et al. [2013], Silver et al. [2017], Oh et al. [2017], Farquhar et al. [2018]; refer to the supplementary material for a detailed literature review of MBRL). The main thesis behind decision-aware model learning, including VAML, is that the knowledge about the underlying decision problem, which is often available, should be considered in the model learning itself. VAML, as its name suggests, uses the information about the value function. In particular, the formulation by Farahmand et al. [2017a] incorporates the knowledge about the value function space in learning the model. In this work, we suggest an alternative, and possibly simpler, approach called Iterative VAML (Iter VAML, for short).

VAML deﬁnes a robust loss function and has a min P M max V F structure, where M is the transition probability model of the environment to be learned and F is the function space to which the value function belongs (we discuss this in more detail in Section 2). Solving this min max optimization can be difﬁcult in general, unless we impose some structure on F, e.g., linear function space. Iter VAML mitigates this issue by beneﬁting from special structure of how value functions are generated within the approximate value iteration (AVI) framework (Section 3).

Homepage: http://academic.sologen.net. Part of this work has been done when the author was afﬁliated with Mitsubishi Electric Research Laboratories (MERL), Cambridge, USA.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

We theoretically analyze Iter VAML (Section 4). We provide a ﬁnite-sample error upper bound guarantee for the model learning that shows the effect of the number of samples and complexity of the model on the error bound (Section 4.1). We also analyze how the errors in the learned model affect the quality of the outcome policy. This is in the form of an error propagation result (Section 4.2).

2 Background on Value-Aware Model Learning

To formalize the framework, let us consider a discounted Markov Decision Process (MDP) (X, A, R , P , γ) [Szepesvári, 2010]. Here X is the state space, A is the action space, R is the reward distribution, P is the transition probability kernel, and 0 γ < 1 is the discount factor. In the RL setting, P and R are not known to the agent. Instead, the agent can interact with the environment to collect samples from these distributions. The collected data is in the form of

Dn = {(Xi, Ai, Ri, X i)}n i=1, (1)

with the current state-action being distributed according to Zi = (Xi, Ai) ν(X A) M(X A), the reward Ri R ( |Xi, Ai), and the next-state X i P ( |Xi, Ai). We denote the expected reward by r(x, a) = E [R ( |x, a)].2

The goal of model learning is to ﬁnd a ˆP that is close to P .3 The learned model ˆP is then used by an MBRL algorithm to ﬁnd a policy. To formalize this, let us denote Planner as an algorithm that receives a model ˆP and returns a policy, i.e., π Planner( ˆP). We assume that the reward function is already known to Planner, so we do not explicitly pass it as an argument. There are many variations on how Planner may use the learned model to obtain a new policy. For example, Planner might be a value function-based approach that computes an estimate of the optimal value function based on ˆP, and then returns the greedy policy of the estimated value function. Or it might be a policy gradient method that computes the gradient of the performance with respect to (w.r.t.) the policy using the learned model.

A fundemental question is how we should measure the closeness of ˆP to P . The answer to this question depends on how Planner is going to use the model. It is possible that some aspects of the dynamics is irrelevant to Planner. The usual approaches based on the probabilistic losses, such as the KL-divergence that leads to MLE, ignore this dependency. Therefore, they might be less efﬁcient than an approach that considers how Planner is going to use the learned model.

VAML, introduced by Farahmand et al. [2016a, 2017a], is a value-based approach and assumes that Planner uses the Bellman optimality operator deﬁned based on ˆP to ﬁnd a ˆQ , that is

T ˆ P : Q 7 r + γ ˆP max a Q, (2)

and then outputs π = ˆπ( ; ˆQ ), the greedy policy w.r.t. ˆQ deﬁned as ˆπ(x; Q) = argmaxa A Q(x, a). For brevity, we sometimes use ˆT instead of T ˆ P. The use of the Bellman [optimality] operator is central to value-based approaches such as the family of (Approximate) Value Iteration [Gordon, 1995, Szepesvári and Smart, 2004, Ernst et al., 2005, Munos and Szepesvári, 2008, Farahmand et al., 2009, Farahmand and Precup, 2012, Mnih et al., 2015, Tosatto et al., 2017, Farahmand et al., 2017b] or (Approximate) Policy Iteration (API) algorithms [Lagoudakis and Parr, 2003, Antos et al., 2008, Bertsekas, 2011, Lazaric et al., 2012, Scherrer et al., 2012, Farahmand et al., 2016b].

VAML focuses on ﬁnding ˆP such that the difference between T Q and ˆT Q is small. It starts from assuming that V is known and deﬁnes the pointwise loss (or cost) between ˆP and P as

c( ˆP, P ; V )(x, a) = D P ( |x, a) ˆP( |x, a) , V E

Z h P (dx |x, a) ˆP(dx |x, a) i V (x ) , (3)

2Given a set Ωand its σ-algebra σΩ, M(Ω) refers to the set of all probability distributions deﬁned over σΩ. As we do not get involved in the measure theoretic issues in this paper, we do not explicitly deﬁne the σ-algebra, and simply use a well-deﬁned and standard one, e.g., Borel sets deﬁned for metric spaces. 3Learning the expected reward r is also a part of model learning, which can be formulated as a regression problem. For simplicity of presentation, we assume that r is known.

in which we substituted maxa Q( , a) in (2) with V to simplify the presentation. In the rest of the paper, we sometimes use Pz( ) with z = (x, a) Z = X A to refer to the probability distribution P( |x, a), so Pz V = R P(dx |x, a)V (x ).

Given a probability distribution ν M(X A), which can be the same distribution as the data generating one, VAML deﬁnes the expected loss function

c2 2,ν( ˆP, P ; V ) = Z dν(x, a)

Z h P (dx |x, a) ˆP(dx |x, a) i V (x )

Notice that the value function V is unknown, so we cannot readily minimize this loss function, or its empirical version. What differentiates this work from the original VAML formulation is how this unknown V is handled. VAML takes a robust approach: It considers the worst-case choice of V in the value function space F that is used by Planner. Therefore, it minimizes

c2 2,ν( ˆP, P ) = Z dν(x, a) sup V F

Z h P (dx |x, a) ˆP(dx |x, a) i V (x )

As argued by Farahmand et al. [2017a], this is still a tighter objective to minimize than the KLdivergence. To see this, consider a ﬁx z = (x, a). We have sup V F | P ( |x, a) ˆP( |x, a), V |

P z ˆPz 1 sup V F V q

2KL(P z || ˆPz) sup V F V , where we used Pinsker s inequality. As MLE is the minimizer of the KL-divergence based on data, these upper bounds suggest that if we ﬁnd a good MLE (with small KL-divergence), we also have an accurate Bellman operator. These sequences of upper bounding, however, might be quite loose. For an extreme, but instructive, example, suppose that the value function space consists of bounded constant functions (F = { x 7 c : |c| < }). In that case, sup V F | P ( |x, a) ˆP( |x, a), V | is always zero, no matter how large the total variation and the KL-divergence of two distributions are. MLE does not explicitly beneﬁt from these interaction of the value function and the model. Asadi et al. [2018] show that the VAML objective sup V F | P ( |x, a) ˆP( |x, a), V | with the choice of 1-Lipschitz functions for F is equivalent to the Wasserstein metric between P ( |x, a) and ˆP( |x, a). Refer to Farahmand et al. [2017a] for more detail and discussion about VAML and its properties.

The loss function (5) deﬁnes the population version loss of VAML. The empirical version, which is minimized in practice, replaces P and ν by their empirical distributions. The result is

c2 2,n( ˆP) = 1

(Xi,Ai,X i) Dn sup V F

V (X i) Z ˆP(dx |Xi, Ai)V (x )

The estimated probability distribution ˆP is obtain by solving the following optimization problem:

ˆP argmin P M c2 2,n(P). (7)

Farahmand et al. [2017a] provide an expression for the gradient of this objective function when F is the space of linear function approximators, commonly used in the RL literature, and M is an exponential family. They also provide ﬁnite sample error upper bound guarantee showing that the minimizer of (6) converges to the minimizer of (5).

3 Iterative Value-Aware Model Learning

In this section we describe an alternative approach to formulating a value-aware model learning method. As opposed to the original formulation (5), and its empirical version (6), it is not based on a worst-case formulation. Instead, it deﬁnes the loss function based on the actual sequence of value functions generated by an (Approximate) Value Iteration (AVI) type of Planner.

Consider the Value Iteration (VI) procedure: At iteration k = 0, 1, . . . ,

Qk+1(x, a) r(x, a) + γ Z P (dx |x, a) max a Qk(x , a ), (8)

or more succinctly,

Qk+1 T P Qk r + γP Vk,

with Vk(x) maxa Qk(x, a). Here T P is the Bellman optimality operator deﬁned based on the true transition probability kernel P (we similarly deﬁne T P for any generic transition probability kernel P). Because of the contraction property of the Bellman optimality operator for discounted MDPs, we have Qk Q as k . This is the basis of the VI procedure.

The intuition behind Iter VAML can be developed by studying the sequence Q0, Q1, . . . generated by VI. Starting from Q0 r, VI generates the following sequence

Q0 r Q1 T P V0 = r + γP r Q2 T P V1 = r + γP V1 ...

To obtain the value of Q0, we do not need the knowledge of P . To obtain the value of Q1, we only need to compute P V0 = P r. If we ﬁnd a ˆP such that

we may replace it with P to obtain Q1 without any error, i.e., Q1 = r +γP r = r +γ ˆPr. Likewise, for any k 1 and given Qk, in order to compute Qk+1 exactly we only need to ﬁnd a ˆP such that

ˆPVk = P Vk.

We have two sources of errors though. The ﬁrst is that we may only guarantee that

As a result, T ˆ PVk is not exactly the same as T P Vk. Iter VAML s goal is to make the error ˆPVk P Vk as small as possible. Based on this, we deﬁne the following idealized optimization problem:

Given a model space M and the current approximation of the value function ˆQk (and therefore ˆVk), solve

ˆP(k) argmin P M

(P P ) ˆVk 2

2 = Z (P P )(dx |z) max a ˆQk(x , a ) 2 dν(z), (9)

where ν M(X A) is a user-deﬁned distribution over the state-action space. Oftentimes, this distribution is the same as the empirical distribution generating Dn (1), hence our use of the same notation. Afterwards, we use ˆP(k) to ﬁnd ˆVk+1 by using the usual VI approach, that is,

ˆQk+1 T ˆ P(k) ˆQk. (10)

This procedure is repeated. This is using the exact VI based on ˆP(k).

This formulation is idealized as we do not have access to P and ν, but only samples from them. We use the empirical distribution instead:

ˆP(k+1) argmin P M

(Xi,Ai,X i) Dn

ˆVk(X i) Z P(dx |Xi, Ai) ˆVk(x )

The optimization problem minimizes the distance between the next-state expectation of ˆVk according to P and the samples ˆVk(X ) with X being drawn from the true next-state distribution. In case the integral is difﬁcult to compute, we may replace it by samples from P, i.e., Z P(dx |Xi, Ai) ˆVk(x ) 1

j=1 ˆVk(X i,j),

with X i,j P( |Xi, Ai) for j = 1, . . . , m. These are virtual samples generated from the model.

The second source of error is that the VI cannot be performed exactly (for example because the state space is very large), and can only be performed approximately. This leads to the Approximate Value Iteration (AVI) procedure (also known as Fitted Value or Q-Iteration) with a function space F|A| (the space of action-value functions), see e.g., Ernst et al. [2005], Munos and Szepesvári [2008], Farahmand et al. [2009], Farahmand and Precup [2012], Mnih et al. [2015], Tosatto et al. [2017]. Instead of setting ˆQk+1 T ˆ P(k) ˆQk, in AVI we have

ˆQk+1 argmin Q F|A| 1 n

(Xi,Ai,Ri) Dn

Q(Xi, Ai) Ri + γ Z ˆP(k+1)(dx |Xi, Ai) ˆVk(x )

Notice that ﬁnding ˆQk+1 is a regression problem, for which many methods to solve are available, including the regularized variants of this empirical risk minimization problem [Farahmand et al., 2009]. As before, we may replace the integral with virtual samples, i.e., Z ˆP(k+1)(dx |Xi, Ai) ˆVk(x ) 1

j=1 ˆVk(X i,j),

with X i,j ˆP(k+1)( |Xi, Ai) for j = 1, . . . , m, for each i = 1, . . . , n. These virtual samples from the model play the same role as in the hypothetical experience in the Dyna architecture [Sutton, 1990] or imagination in imagination-augmented agents by Racanière et al. [2017].

Algorithm 1 summarizes a generic Iter VAML procedure. The algorithm receives a model space M, the action-value function space F|A|, the space of reward functions G, and the number of iterations K. At each iteration k = 0, 1, . . . , it generates a fresh training dataset D(k) n = {(Xi, Ai, Ri, X i)}n i=1 by interacting with the environment. It learns the transition model ˆP(k+1) by solving (11). It also learns the reward function ˆr, by minimizing Loss R, which can be the squared loss (or a robust variant). We do not analyze learning the reward function in this work. Afterwards, it performs one step of AVI by solving (12). These steps are repeated for K iterations.

Many variations of this algorithm are possible. We brieﬂy remark on some of them. Here the AVI step only uses the model ˆP. In practice, however, one may use both the learned model ˆP and the data Dn in solving the optimization problem (12) in order to obtain better solutions, as in the Dyna architecture [Sutton, 1990]. Moreover, the summation in (12) is Dn (or k i=0D(i) n as stated in the algorithm), which is the dataset of true samples. If we also learn a distribution model of ν by ˆν, we can sample from it too. In that case, we can increase the number of samples used in solving the regression step of Iter VAML. We have more discussion about the algorithm in the supplementary material.

We can also similarly deﬁne a policy evaluation version of Iter VAML.

4 Theoretical Analysis of Iterative VAML

We analyze the statistical properties of the Iter VAML procedure. The analysis is divided into two parts. First we analyze one iteration of model learning (cf. (11)) and provide an upper bound on the error in learning the model (Theorem 1 in Section 4.1). Afterwards, we consider how errors at each iteration propagate throughout the iterations of Iter VAML and affect the quality of the learned policy (Theorem 2 in Section 4.2). Theorem 3 combines these two results and shows how the model learning errors affect the quality of the outcome policy. The proofs and more extensive discussions are all referred to the extended version of the paper, which is provided as a supplementary material.

4.1 Error Analysis for a Single Iteration

We analyze the k-th iteration of Iter VAML (11) and provide an error bound on ( ˆP(k+1) P )Vk 2. To reduce clutter, we do not specify the iteration index k, e.g., the analyzed loss would be denoted by ( ˆP P )V 2 for a ﬁxed V .

Algorithm 1 Model-based Reinforcement Learning Algorithm with Iterative VAML

// MDP (X, A, R , P , γ) // K: Number of iterations // M: Space of transition probability kernels // F |A|: Space of action-value functions // G: Space of reward functions Initialize a policy π0 and a value function ˆV0. for k = 0 to K 1 do

Generate training set D(k) n = {(Xi, Ai, Ri, X i)}n i=1 by interacting with the true environment (potentially using πk), i.e., (Xi, Ai) νk with X i P ( |Xi, Ai) and Ri R ( |Xi, Ai). ˆP(k+1) argmin P M ˆVk(X i) R P(dx |Xi, Ai) ˆVk(x ) 2

k i=0D(i) n .

ˆr argminr G Loss R(r; k i=0D(i) n )

ˆQk+1 argmin Q F|A| Q(Xi, Ai) ˆr(Xi, Ai) + γ R ˆP(k+1)(dx |Xi, Ai) ˆVk(x ) 2

k i=0D(i) n .

πk+1 ˆπ( ; ˆQk+1). end for return πK

Consider a ﬁxed value function V : X R. We are given a dataset Dn = {(Xi, Ai, X i)}n i=1 with Zi = (Xi, Ai) ν(X A) M(X A), and the next-state X i P ( |Xi, Ai), as speciﬁed in (1).

We now enlist our set of assumptions. Some of them are technical assumptions to simplify the analysis, and some are characterizing crucial aspects of the model learning. We shall remark on these as we introduce them.

Assumption A1 (Samples) At the k-th iteration we are given a dataset Dn(= D(k) n ) Dn = {(Xi, Ai, X i)}n i=1, (13) with Zi = (Xi, Ai) being independent and identically distributed (i.i.d.) samples drawn from ν(X A) M(X A) and the next-state X i P ( |Xi, Ai). Furthermore, we assume that D(k) n and D(k ) n for k = k are independent.

The i.i.d. assumption is to simplify the analysis, and with extra effort one can provide similar results for dependent processes that gradually forget their past. The forgetting behaviour can be characterized by the mixing behaviour of the stochastic process [Doukhan, 1994]. One can then provide statistical guarantees for learning algorithms under various mixing conditions [Yu, 1994, Meir, 2000, Steinwart and Christmann, 2009, Mohri and Rostamizadeh, 2009, 2010, Farahmand and Szepesvári, 2012].

In this assumption we also require that the datasets of two different iterations are independent. This is again to simplify the analysis. In practice, we might reuse the same dataset in all iterations. Theoretical results by Munos and Szepesvári [2008] suggest that the dependence between iterations may not lead to signiﬁcant performance degradation.

We need to make some assumptions about the model space M and its complexity (i.e., capacity). We use covering number (and its logarithm, i.e., metric entropy) of a function space (here being the model space M) as the characterizer of its complexity. The covering number at resolution ε is the minimum number of balls with radius ε required to cover the model space M according to a particular metric, and is denoted by N(ε, M) (see the supplementary material for deﬁnitions). As ε decreases, the covering number increases (or more accurately, the covering number is non-decreasing). For example, the covering number for a p-dimensional linear function approximator with constraint on the magnitude of its functions behaves like O( 1

εp ). A similar result holds when the subgraphs of a function space has a VC-dimension p. Model spaces whose covering number grows faster are more complex, and estimating a function within them is more difﬁcult. This leads to larger estimation error, as we shall see. On the other hand, those model spaces often (but not always) have better model approximation properties too.

In order to show the ﬁner behaviour of the error bound, we deﬁne M as a subset of a larger family of probability distributions M0. Let J : M0 [0, ) be a pseudo-norm deﬁned on functions in M0.

We then deﬁne M = { P M0 : J(P) R } for some R > 0. One may think of J as a measure of complexity of functions in M0, so M would be a ball with a ﬁxed radius R w.r.t. J. If M0 is deﬁned based on a reproducing kernel Hilbert space (RKHS), we can think of J as the inner product norm of the RKHS.

Assumption A2 (Model Space) For R > 0, let M = MR = { P M0 : J(P) R }. There exist constants c > 0 and 0 < α < 1 such that for any ε, R > 0 and all sequences z1:n z1, . . . , zn X A, the following metric entropy condition is satisﬁed:

log N (ε, M, L2(z1:n)) c R

Furthermore, the model space M is convex, and compact w.r.t. d ,TV(P1, P2) = supz Z R |P1(dy|z) P2(dy|z)|.

This form of the metric entropy of M = MR is suitable to capture the complexity of large function spaces such as some RKHS and Sobolev spaces. For example, for Wk(Rd) = Wk,2(Rd), the Sobolev space deﬁned w.r.t. the L2-norm of the weak derivatives up to order k, we can set α = d 2k, see e.g., Lemma 20.6 of Györﬁet al. [2002]. For smaller function spaces, such as the p-dimensional linear function approximator mentioned above, the behaviour of the metric entropy is p log( 1

ε), which can be seen as having α 0 with a certain rate, For many examples of the covering number and metric entropy results, refer to van de Geer [2000], Györﬁet al. [2002], Zhou [2003], Steinwart and Christmann [2008], Giné and Nickl [2015]. Also note that here we require the convexity and compactness of M. The convexity is a crucial assumption for obtaining fast convergence rate, as was shown and discussed by Lee et al. [1998, 2008], Mendelson [2008]. The compactness w.r.t. this particular metric is a technical assumption and it may be possible to be relaxed.

Assumption A3 (Value Function) The value function V is ﬁxed (i.e., not dependent on Dn) and is Vmax-bounded with Vmax 1.

This assumption is for the simplicity of the analysis. We use it in large deviation results that require the boundedness of the involved random variables, e.g., Theorem 2.1 of Bartlett et al. [2005] or Theorem 19.1 of Györﬁet al. [2002], which we use in our proofs.

We are now ready to state the main result of this section.

Theorem 1. Suppose that Assumptions A1, A2, and A3 hold. Consider ˆP obtained by solving (11). There exists a ﬁnite c(α) > 0, depending only on α, such that for any δ > 0, with probability at least 1 δ, we have

( ˆPz P z )V 2

2,ν inf P M (Pz P z )V 2 2,ν + c(α)V 2 max R 2α 1+α p

This result upper bounds the error of ˆP in approximating the next-state expectation of the value function. The upper bound has the model (or function) approximation error (the ﬁrst term) and the estimation error (the second term). It is notable that the constant in front of the model approximation error is one, so the best we can hope from this algorithm, in the limit of inﬁnite data, is as good as the best model in the model class M.

The estimation error behaves as n 1/(1+α) (0 < α < 1). This is a fast rate and can reach n 1 whenever α 0. We do not know whether this is an optimal rate for this particular problem, but results from regression with least-squares loss suggest that this might indeed be optimal: For a regression function belonging to a function space that has a packing entropy in the same form as in the upper bound of Assumption A2, the rate Ω(n 1/(1+α)) is its minimax lower bound [Yang and Barron, 1999].

We use local Rademacher complexity and analyze the modulus of continuity of empirical processes to obtain rates faster than what could be achieved by more conventional techniques of analyzing the supremum of the empirical processes. Farahmand et al. [2017a] used the supremum of the empirical process to analyze VAML and obtained n 1/2 rate, which is slower than n 1/(1+α). Notice that the loss functions of VAML and Iter VAML are different, so this is only an approximate comparison. The

rate n 1/2, however, is common in the supremum of empirical process-based analysis, so we would expect it to hold if we used those techniques to analyze Iter VAML. Finally notice that the error rate of VAML is not necessarily slower than Iter VAML s; the present difference is at least somehow due to the shortcoming of their simpler proof technique.

4.2 Error Propagation

We analyze how the errors incurred at each step of Iter VAML propagate throughout the iterations and affect the quality of the outcome. For policy evaluation, the quality is deﬁned as the difference between V π and ˆVK, weighted according to a user-deﬁned probability distribution ρX M(X), i.e., V π ˆVK 1,ρX . For the control case we consider the performance loss, which is deﬁned as the difference between the value of following the greedy policy w.r.t. ˆQK compared to the value of the optimal policy Q , weighted according to a user-deﬁned probability distribution ρ M(X A), i.e., ρ(Q QπK) (cf. Algorithm 1). This type of error propagation analysis has been performed before by Munos [2007], Antos et al. [2008], Farahmand et al. [2010], Scherrer et al. [2012], Huang et al. [2015], Mann et al. [2015], Farahmand et al. [2016c].

Recall that there are two sources of errors in the Iter VAML procedure. The ﬁrst is the error in model learning, which is caused because the model ˆP(k+1) learned by solving the minimization problem (11) only satisﬁes ˆP(k+1) ˆVk P ˆVk instead of being exact. This error is studied in Section 4.1, and Theorem 1 provides an upper bound on it.

The second source of error is that the AVI performs the Bellman update only approximately. So instead of having ˆQk+1 = T ˆ P(k) ˆQk (or its policy evaluation equivalent), the function ˆQk+1 obtained

by solving (12) is only approximately equal to T ˆ P(k) ˆQk. As already mentioned, this step is essentially solving a regression problem. Hence, many of the standard error guarantees for regression can be used here too, with possibly some minor changes.

Consider a sequence of ˆQ0, ˆQ1, . . . , ˆQK with ˆQk+1 T ˆ P(k+1) ˆQk, with ˆP(k+1) being an approximation of the true P P . Iter VAML, which consists of repeated solving of (11) and (12), is an example of a procedure that generates these ˆQk. The result, however, is more general and does not depend on the particular way ˆP(k+1) and ˆQk+1 are produced.

We deﬁne the following concentrability coefﬁcients, similar to the coefﬁcient introduced by Farahmand et al. [2010], which itself is a relaxation of the coefﬁcient introduced by Munos [2007]. These coefﬁcients are the Radon-Nikydom (R-N) derivative of the multi-step ahead state-action distribution w.r.t. the distribution ν. The R-N. derivative can be thought of as the ratio of two probability density functions. Deﬁnition 1 (Expected Concentrability of the Future State-Action Distribution). Given ρ, ν M(X A), an integer number k 0, and an arbitrary sequence of policies (πi)m i=1, the distribution ρPπ1 Pπk denotes the future state-action distribution obtained when the ﬁrst state-action is distributed according ρ and the agent follows the sequence of policies π1, π2, . . . , πk. Deﬁne

c VI,ρ,ν(k) = sup π1,...,πk

If the future state-action distribution ρPπ1 Pπk is not absolutely continuous w.r.t. ν, we take c VI,ρ,ν(k) = . Moreover, for a discount factor 0 γ < 1, deﬁne the discounted weighted average concentrability coefﬁcient as C(ρ, ν) = (1 γ)2 X

k 1 kγk 1 c VI,ρ,ν(k).

The deﬁnition of C(ρ, ν) is similar to the second order discounted future state distribution concentration coefﬁcient of Munos [2007], with the main difference being that it is deﬁned for the expectation of the R-N derivative instead of its supremum.

The following theorem is our main error propagation result. It can be seen as the generalization of the results of Farahmand et al. [2010], Munos [2007] to the case when we use a model that has an error, whereas the aforementioned papers are for model-free case (or when the model is exact). Because of this similarity, several steps of the proof are similar to theirs.

Theorem 2. Consider a sequence of action-value function ( ˆQk)K k=0, and their corresponding ( ˆVk)K k=0, each of which is deﬁned as ˆVk(x) = maxa ˆQk(x, a). Suppose that the MDP is such that the expected rewards are Rmax-bounded, and ˆQ0 is initialized such that it is Vmax Rmax

1 γ -bounded.

Let εk = T ˆ P(k+1) ˆQk ˆQk+1 (regression error) and ek = (P ˆP(k+1)) ˆVk (modelling error) for

k = 0, 1, . . . , K 1. Let πK be the greedy policy w.r.t. ˆQK, i.e., πK(x) = argmaxa A ˆQ(x, a) for all x X. Consider probability distributions ρ, ν M(X A). We have

Q QπK 1,ρ 2γ (1 γ)2

C(ρ, ν) max 0 k K 1

εk 2,ν + γ ek 2,ν + 2γKRmax

We compare this result with the results of Munos [2007], Farahmand et al. [2010], Farahmand [2011] in the same section of the supplementary material. Before stating Theorem 3, which is the direct implication of Theorems 1 and 2, we state another assumption.

Assumption A4 (Value Function Space) The value function space F|A| is Vmax-bounded with Vmax Rmax

1 γ , and Vmax 1.

This assumption requires that all the value functions ˆQk and ˆVk generated by performing a step of AVI (12) and used in the model learning steps (11) are Vmax-bounded. This ensures that Assumption A3, which is required by Theorem 1, is satisﬁed in all iterations. This assumption is easy to satisfy in practice by clipping the output of the value function estimator at the level of Vmax. Theoretical analysis of such a clipped value function estimator, however, is more complicated. As we do not analyze the value function estimation steps of Iter VAML, which depends on the choice of F|A|, we ignore this issue.

Theorem 3. Consider the Iter VAML procedure in which at the k-th iteration the model ˆP(k+1) is obtained by solving (11) and ˆQk+1 is obtained by solving (12). Let εk = T ˆ P(k+1) ˆQk ˆQk+1 be the regression error. Suppose that Assumptions A1, A2, and A4 hold. Consider the greedy policy πK w.r.t. ˆQK. For any ρ M(X A), there exists a ﬁnite c(α) > 0, depending only on α, such that for any δ > 0, with probability at least 1 δ, we have

Q QπK 1,ρ 2γ (1 γ)2

C(ρ, ν) max 0 k K 1 εk 2,ν + γemodel(n) + 2γKRmax

emodel(n) = sup V F+ inf P M (Pz P z )V 2,ν + c(α)Vmax R α 1+α 4p

and F+ = maxa Q( , a) : Q F|A| .

This result provides an upper bound on the quality of the learned policy πK, as a function of the number of samples and the properties of the model space M and the MDP. The estimation error due to the model learning is n

1 2(1+α) , which is discussed in some detail after Theorem 1. The model approximation error term sup V F+ inf P M (Pz P z )V 2,ν shows the interaction between the model M and the value function space F|A|. This quantity is likely to be conservative and can be improved. We also note that an upper bound on εk 2,ν depends on the regression method, the choice of F|A|, and the number of samples generated from ˆP(k+1).

5 Conclusion

We have introduced Iter VAML, a decision-aware model-based RL algorithm. We proved ﬁnite sample error upper bound for the model learning procedure (Theorem 1) and a generic error propagation result for an approximate value iteration algorithm that uses an inaccurate model (Theorem 2). The consequence of these two results was Theorem 3, which provides an error upper bound guarantee on the quality of the outcome policy of Iter VAML.

There are several possible future research directions. One is empirical studies of Iter VAML and comparing it with non-decision-aware methods. Another direction is to investigate other approaches to decision-aware model-based RL algorithms.

Acknowledgments

I would like to thank the anonymous reviewers for their helpful feedback, and Mehdi Ghasemi and Murat A. Erdogdu for discussions.

András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with Bellmanresidual minimization based ﬁtted policy iteration and a single sample path. Machine Learning, 71:89 129, 2008. 2, 8

Kavosh Asadi, Evan Cater, Dipendra Misra, and Michael L. Littman. Equivalence between wasserstein and value-aware model-based reinforcement learning. In FAIM Workshop on Prediction and Generative Modeling in Reinforcement Learning, 2018. 3

Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local Rademacher complexities. The Annals of Statistics, 33(4):1497 1537, 2005. 7

Dimitri P. Bertsekas. Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications, 9(3):310 335, 2011. 2

Paul Doukhan. Mixing: Properties and Examples, volume 85 of Lecture Notes in Statistics. Springer Verlag, Berlin, 1994. 6

Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research (JMLR), 6:503 556, 2005. 2, 5

Amir-massoud Farahmand. Regularization in Reinforcement Learning. Ph D thesis, University of Alberta, 2011. 9

Amir-massoud Farahmand and Doina Precup. Value pursuit iteration. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems (NIPS - 25), pages 1349 1357. Curran Associates, Inc., 2012. 2, 5

Amir-massoud Farahmand and Csaba Szepesvári. Regularized least-squares regression: Learning from a β-mixing sequence. Journal of Statistical Planning and Inference, 142(2):493 505, 2012. 6

Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesvári, and Shie Mannor. Regularized ﬁtted Q-iteration for planning in continuous-space Markovian Decision Problems. In Proceedings of American Control Conference (ACC), pages 725 730, June 2009. 2, 5

Amir-massoud Farahmand, Rémi Munos, and Csaba Szepesvári. Error propagation for approximate policy and value iteration. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems (NIPS - 23), pages 568 576. 2010. 8, 9

Amir-massoud Farahmand, André M.S. Barreto, and Daniel N. Nikovski. Value-aware loss function for model learning in reinforcement learning. In 13th European Workshop on Reinforcement Learning (EWRL), December 2016a. 1, 2

Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesvári, and Shie Mannor. Regularized policy iteration with nonparametric function spaces. Journal of Machine Learning Research (JMLR), 17(139):1 66, 2016b. 2

Amir-massoud Farahmand, Daniel N. Nikovski, Yuji Igarashi, and Hiroki Konaka. Truncated approximate dynamic programming with task-dependent terminal value. In AAAI Conference on Artiﬁcial Intelligence, February 2016c. 8

Amir-massoud Farahmand, André M.S. Barreto, and Daniel N. Nikovski. Value-aware loss function for model-based reinforcement learning. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 1486 1494, April 2017a. 1, 2, 3, 7

Amir-massoud Farahmand, Saleh Nabi, and Daniel N. Nikovski. Deep reinforcement learning for partial differential equation control. In American Control Conference (ACC), 2017b. 2

Gregory Farquhar, Tim Rocktaeschel, Maximilian Igl, and Shimon Whiteson. Tree QN and ATreec: Differentiable tree planning for deep reinforcement learning. In International Conference on Learning Representations (ICLR), 2018. 1

Evarist Giné and Richard Nickl. Mathematical Foundations of Inﬁnite-Dimensional Statistical Models. Cambridge University Press, 2015. 7

Geoffrey Gordon. Stable function approximation in dynamic programming. In International Conference on Machine Learning (ICML), 1995. 2

László Györﬁ, Michael Kohler, Adam Krzy zak, and Harro Walk. A Distribution-Free Theory of Nonparametric Regression. Springer Verlag, New York, 2002. 7

De-An Huang, Amir-massoud Farahmand, Kris M Kitani, and J. Andrew Bagnell. Approximate Max Ent inverse optimal control and its application for mental simulation of human interactions. In AAAI Conference on Artiﬁcial Intelligence, January 2015. 8

Joshua Joseph, Alborz Geramifard, John W Roberts, Jonathan P How, and Nicholas Roy. Reinforcement learning with misspeciﬁed model classes. In Proceedings of IEEE International Conference on Robotics and Automation (ICRA), pages 939 946. IEEE, 2013. 1

Michail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine Learning Research (JMLR), 4:1107 1149, 2003. 2

Alessandro Lazaric, Mohammad Ghavamzadeh, and Rémi Munos. Finite-sample analysis of leastsquares policy iteration. Journal of Machine Learning Research (JMLR), 13:3041 3074, October 2012. 2

Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 44(5):1974 1980, 1998. 7

Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. Correction to the importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 54(9):4395, 2008. 7

Timothy A. Mann, Shie Mannor, and Doina Precup. Approximate value iteration with temporally extended actions. Journal of Artiﬁcial Intelligence Research (JAIR), 53:375 438, 2015. 8

Ron Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning, 39(1):5 34, 2000. 6

Shahar Mendelson. Lower bounds for the empirical risk minimization algorithm. IEEE Transactions on Information Theory, 54(8):3797 3803, August 2008. 7

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 02 2015. 2, 5

Mehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-i.i.d. processes. In Advances in Neural Information Processing Systems 21, pages 1097 1104. Curran Associates, Inc., 2009. 6

Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary φ-mixing and β-mixing processes. Journal of Machine Learning Research (JMLR), 11:789 814, 2010. ISSN 1532-4435. 6

Rémi Munos. Performance bounds in Lp norm for approximate value iteration. SIAM Journal on Control and Optimization, pages 541 561, 2007. 8, 9

Rémi Munos and Csaba Szepesvári. Finite-time bounds for ﬁtted value iteration. Journal of Machine Learning Research (JMLR), 9:815 857, 2008. 2, 5, 6

Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. In Advances in Neural Information Processing Systems (NIPS - 30), pages 6118 6128. Curran Associates, Inc., 2017. 1

Sébastien Racanière, Theophane Weber, David Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adrià Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, Razvan Pascanu, Peter Battaglia, Demis Hassabis, David Silver, and Daan Wierstra. Imagination-augmented agents for deep reinforcement learning. In Advances in Neural Information Processing Systems (NIPS - 30), pages 5690 5701. Curran Associates, Inc., 2017. 5

Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, and Matthieu Geist. Approximate modiﬁed policy iteration. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2012. 2, 8

David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert, Neil Rabinowitz, André M.S. Barreto, and Thomas Degris. The predictron: End-to-end learning and planning. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 3191 3199, 2017. 1

Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer, 2008. 7

Ingo Steinwart and Andreas Christmann. Fast learning from non-i.i.d. observations. In Advances in Neural Information Processing Systems (NIPS - 22), pages 1768 1776. Curran Associates, Inc., 2009. 6

Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the 7th International Conference on Machine Learning (ICML), 1990. 5

Csaba Szepesvári. Algorithms for Reinforcement Learning. Morgan Claypool Publishers, 2010. 2

Csaba Szepesvári and William D. Smart. Interpolation-based Q-learning. In Proceedings of the twenty-ﬁrst International Conference on Machine learning (ICML), 2004. 2

Samuele Tosatto, Matteo Pirotta, Carlo D Eramo, and Marcello Restelli. Boosted ﬁtted q-iteration. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 3434 3443, August 2017. 2, 5

Sara A. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000. 7

Yuhong Yang and Andrew R. Barron. Information-theoretic determination of minimax rates of convergence. The Annals of Statistics, 27(5):1564 1599, 1999. 7

Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94 116, January 1994. 6

Ding-Xuan Zhou. Capacity of reproducing kernel spaces in learning theory. IEEE Transactions on Information Theory, 49:1743 1752, 2003. 7