# reinforcement_learning_under_model_mismatch__d5984d58.pdf Reinforcement Learning under Model Mismatch Aurko Roy1, Huan Xu2, and Sebastian Pokutta2 1Google , Email: aurkor@google.com 2ISy E, Georgia Institute of Technology, Atlanta, GA, USA. Email: huan.xu@isye.gatech.edu 2ISy E, Georgia Institute of Technology, Atlanta, GA, USA. Email: sebastian.pokutta@isye.gatech.edu We study reinforcement learning under model misspecification, where we do not have access to the true environment but only to a reasonably close approximation to it. We address this problem by extending the framework of robust MDPs of [1, 15, 11] to the model-free Reinforcement Learning setting, where we do not have access to the model parameters, but can only sample states from it. We define robust versions of Q-learning, SARSA, and TD-learning and prove convergence to an approximately optimal robust policy and approximate value function respectively. We scale up the robust algorithms to large MDPs via function approximation and prove convergence under two different settings. We prove convergence of robust approximate policy iteration and robust approximate value iteration for linear architectures (under mild assumptions). We also define a robust loss function, the mean squared robust projected Bellman error and give stochastic gradient descent algorithms that are guaranteed to converge to a local minimum. 1 Introduction Reinforcement learning is concerned with learning a good policy for sequential decision making problems modeled as a Markov Decision Process (MDP), via interacting with the environment [20, 18]. In this work we address the problem of reinforcement learning from a misspecified model. As a motivating example, consider the scenario where the problem of interest is not directly accessible, but instead the agent can interact with a simulator whose dynamics is reasonably close to the true problem. Another plausible application is when the parameters of the model may evolve over time but can still be reasonably approximated by an MDP. To address this problem we use the framework of robust MDPs which was proposed by [1, 15, 11] to solve the planning problem under model misspecification. The robust MDP framework considers a class of models and finds the robust optimal policy which is a policy that performs best under the worst model. It was shown by [1, 15, 11] that the robust optimal policy satisfies the robust Bellman equation which naturally leads to exact dynamic programming algorithms to find an optimal policy. However, this approach is model dependent and does not immediately generalize to the model-free case where the parameters of the model are unknown. Essentially, reinforcement learning is a model-free framework to solve the Bellman equation using samples. Therefore, to learn policies from misspecified models, we develop sample based methods to solve the robust Bellman equation. In particular, we develop robust versions of classical reinforcement learning algorithms such as Q-learning, SARSA, and TD-learning and prove convergence to an approximately optimal policy under mild assumptions on the discount factor. We also show that Work done while at Georgia Tech 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. the nominal versions of these iterative algorithms converge to policies that may be arbitrarily worse compared to the optimal policy. We also scale up these robust algorithms to large scale MDPs via function approximation, where we prove convergence under two different settings. Under a technical assumption similar to [5, 24] we show convergence of robust approximate policy iteration and value iteration algorithms for linear architectures. We also study function approximation with nonlinear architectures, by defining an appropriate mean squared robust projected Bellman error (MSRPBE) loss function, which is a generalization of the mean squared projected Bellman error (MSPBE) loss function of [22, 21, 6]. We propose robust versions of stochastic gradient descent algorithms as in [22, 21, 6] and prove convergence to a local minimum under some assumptions for function approximation with arbitrary smooth functions. Contribution. In summary we have the following contributions: 1. We extend the robust MDP framework of [1, 15, 11] to the model-free reinforcement learning setting. We then define robust versions of Q-learning, SARSA, and TD-learning and prove convergence to an approximately optimal robust policy. 2. We also provide robust reinforcement learning algorithms for the function approximation case and prove convergence of robust approximate policy iteration and value iteration algorithms for linear architectures. We also define the MSRPBE loss function which contains the robust optimal policy as a local minimum and we derive stochastic gradient descent algorithms to minimize this loss function as well as establish convergence to a local minimum in the case of function approximation by arbitrary smooth functions. 3. Finally, we demonstrate empirically the improvement in performance for the robust algorithms compared to their nominal counterparts. For this we used various Reinforcement Learning test environments from Open AI [9] as benchmark to assess the improvement in performance as well as to ensure reproducibility and consistency of our results. Related Work. Recently, several approaches have been proposed to address model performance due to parameter uncertainty for Markov Decision Processes (MDPs). A Bayesian approach was proposed by [19] which requires perfect knowledge of the prior distribution on transition matrices. Other probabilistic and risk based settings were studied by [10, 25, 23] which propose various mechanisms to incorporate percentile risk into the model. A framework for robust MDPs was first proposed by [1, 15, 11] who consider the transition matrices to lie in some uncertainty set and proposed a dynamic programming algorithm to solve the robust MDP. Recent work by [24] extended the robust MDP framework to the function approximation setting where under a technical assumption the authors prove convergence to an optimal policy for linear architectures. Note that these algorithms for robust MDPs do not readily generalize to the model-free reinforcement learning setting where the parameters of the environment are not explicitly known. For reinforcement learning in the non-robust model-free setting, several iterative algorithms such as Q-learning, TD-learning, and SARSA are known to converge to an optimal policy under mild assumptions, see [4] for a survey. Robustness in reinforcement learning for MDPs was studied by [13] who introduced a robust learning framework for learning with disturbances. Similarly, [16] also studied learning in the presence of an adversary who might apply disturbances to the system. However, for the algorithms proposed in [13, 16] no theoretical guarantees are known and there is only limited empirical evidence. Another recent work on robust reinforcement learning is [12], where the authors propose an online algorithm with certain transitions being stochastic and the others being adversarial and the devised algorithm ensures low regret. For the case of reinforcement learning with large MDPs using function approximations, theoretical guarantees for most TD-learning based algorithms are only known for linear architectures [2]. Recent work by [6] extended the results of [22, 21] and proved that a stochastic gradient descent algorithm minimizing the mean squared projected Bellman equation (MSPBE) loss function converges to a local minimum, even for nonlinear architectures. However, these algorithms do not apply to robust MDPs; in this work we extend these algorithms to the robust setting. 2 Preliminaries We consider an infinite horizon Markov Decision Process (MDP) [18] with finite state space X of size n and finite action space A of size m. At every time step t the agent is in a state i X and can choose an action a A incurring a cost ct(i, a). We will make the standard assumption that future cost is discounted, see e.g., [20], with a discount factor ϑ < 1 applied to future costs, i.e., ct(i, a) := ϑtc(i, a), where c(i, a) is a fixed constant independent of the time step t for i X and a A. The states transition according to probability transition matrices τ := {Pa}a A which depends only on their last taken action a. A policy of the agent is a sequence π = (a0, a1, . . . ), where every at(i) corresponds to an action in A if the system is in state i at time t. For every policy π, we have a corresponding value function vπ Rn, where vπ(i) for a state i X measures the expected cost of that state if the agent were to follow policy π. This can be expressed by the recurrence relation vπ(i) := c(i, a0(i)) + ϑEj X [vπ(j)] . (1) The goal is to devise algorithms to learn an optimal policy π that minimizes the expected total cost: Definition 2.1 (Optimal policy). Given an MDP with state space X , action space A and transition matrices Pa, let Π be the strategy space of all possibile policies. Then an optimal policy π is one that minimizes the expected total cost, i.e., π := arg min π Π E " t=0 ϑtc(it, at(it)) In the robust case we will assume as in [15, 11] that the transition matrices Pa are not fixed and may come from some uncertainty region P a and may be chosen adversarially by nature in future runs of the model. In this setting, [15, 11] prove the following robust analogue of the Bellman recursion. A policy of nature is a sequence τ := (P0, P1, . . . ) where every Pt(a) P a corresponds to a transition probability matrix chosen from P a. Let T denote the set of all such policies of nature. In other words, a policy τ T of nature is a sequence of transition matrices that may be played by it in response to the actions of the agent. For any set P Rn and vector v Rn, let σP(v) := sup p v | p P be the support function of the set P. For a state i X , let P a i be the projection onto the ith row of P a. Theorem 2.2. [15] We have the following perfect duality relation min π Π max τ T Eτ " t=0 ϑtc (it, at(it)) = max τ T min π Π Eτ " t=0 ϑtc (it, at(it)) The optimal value function vπ corresponding to the optimal policy π satisfies vπ (i) = min a A c(i, a) + ϑσPa i (vπ ) , (4) and π can then be obtained in a greedy fashion, i.e., a (i) arg min a A n c(i, a) + ϑσPa i (v) o . (5) The main shortcoming of this approach is that it does not generalize to the model free case where the transition probabilities are not explicitly known but rather the agent can only sample states according to these probabilities. In the absence of this knowledge, we cannot compute the support functions of the uncertainty sets P a i . On the other hand it is often easy to have a confidence region Ua i , e.g., a ball or an ellipsoid, corresponding to every state-action pair i X , a A that quantifies our uncertainty in the simulation, with the uncertainty set P a i being the confidence region Ua i centered around the unknown simulator probabilities. Formally, we define the uncertainty sets corresponding to every state action pair in the following fashion. Definition 2.3 (Uncertainty sets). Corresponding to every state-action pair (i, a) we have a confidence region Ua i so that the uncertainty region P a i of the probability transition matrix corresponding to (i, a) is defined as P a i := {x + pa i | x Ua i } , (6) where pa i is the unknown state transition probability vector from the state i X to every other state in X given action a during the simulation. As a simple example, we have the ellipsoid Ua i := x | x Aa i x 1, i X xi = 0 for some n n psd matrix Aa i with the uncertainty set P a i being P a i := x + pa i | x Ua i , where pa i is the unknown simulator state transition probability vector with which the agent transitioned to a new state during training. Note that while it may easy to come up with good descriptions of the confidence region Ua i , the approach of [15, 11] breaks down since we have no knowledge of pa i and merely observe the new state j sampled from this distribution. In the following sections we develop robust versions of Q-learning, SARSA, and TD-learning which are guaranteed to converge to an approximately optimal policy that is robust with respect to this confidence region. The robust versions of these iterative algorithms involve an additional linear optimization step over the set Ua i , which in the case of Ua i = { x 2 r} simply corresponds to adding fixed noise during every update. In later sections we will extend it to the function approximation case where we study linear architectures as well as nonlinear architectures; in the latter case we derive new stochastic gradient descent algorithms for computing approximately robust policies. 3 Robust exact dynamic programming algorithms In this section we develop robust versions of exact dynamic programming algorithms such as Qlearning, SARSA, and TD-learning. These methods are suitable for small MDPs where the size n of the state space is not too large. Note that confidence region Ua i must also be constrained to lie within the probability simplex n. However since we do not have knowledge of the simulator probabilities pa i , we do not know how far away pa i is from the boundary of n and so the algorithms will make use of a proxy confidence region c Ua i where we drop the requirement of c Ua i n, to compute the robust optimal policies. With a suitable choice of step lengths and discount factors we can prove convergence to an approximately optimal Ua i -robust policy where the approximation depends on the difference between the unconstrained proxy region c Ua i and the true confidence region Ua i . Below we give specific examples of possible choices for simple confidence regions. Ellipsoid: Let {Aa i }i,a be a sequence of n n psd matrices. Then we can define the confidence region as x Aa i x 1, i X xi = 0, pa ij xj 1 pa ij, j X Note that Ua i has some additional linear constraints so that the uncertainty set P a i := pa i + x | x Ua i lies inside n. Since we do not know pa i , we will make use of the proxy confidence region c Ua i := {x | x Aa i x 1, i X xi = 0}. In particular when Aa i = r 1In for every i X , a A then this corresponds to a spherical confidence interval of [ r, r] in every direction. In other words, each uncertainty set P a i is an ℓ2 ball of radius r. Parallelepiped: Let {Ba i }i,a be a sequence of n n invertible matrices. Then we can define the confidence region as Ba i x 1 1, i X xi = 0, pa ij xj 1 pa ij, j X As before, we will use the unconstrained parallelepiped c Ua i without the pa ij xj 1 pa ij constraints, as a proxy for Ua i since we do not have knowledge pa i . In particular if Ba i = D for a diagonal matrix D, then the proxy confidence region c Ua i corresponds to a rectangle. In particular if every diagonal entry is r, then every uncertainty set P a i is an ℓ1 ball of radius r. 3.1 Robust Q-learning Let us recall the notion of a Q-factor of a state-action pair (i, a) and a policy π which in the non-robust setting is defined as Q(i, a) := c(i, a) + Ej X [v(j)] , (9) where v is the value function of the policy π. In other words, the Q-factor represents the expected cost if we start at state i, use the action a and follow the policy π subsequently. One may similarly define the robust Q-factors using a similar interpretation and the minimax characterization of Theorem 2.2. Let Q denote the Q-factors of the optimal robust policy and let v Rn be its value function. Note that we may write the value function in terms of the Q-factors as v = mina A Q (i, a). From Theorem 2.2 we have the following expression for Q : Q (i, a) = c(i, a) + ϑσPa i (v ) (10) = c(i, a) + ϑσUa i (v ) + ϑ j X pa ij min a A Q (j, a ), (11) where equation (11) follows from Definition 2.3. For an estimate Qt of Q , let vt Rn be its value vector, i.e., vt(i) := mina A Qt(i, a). The robust Q-iteration is defined as: Qt(i, a) := (1 γt) Qt 1(i, a) + γt c(i, a) + ϑσc Ua i (vt 1) + ϑ min a A Qt 1(j, a ) , (12) where a state j X is sampled with the unknown transition probability pa ij using the simulator. Note that the robust Q-iteration of equation (12) involves an additional linear optimization step to compute the support function σc Ua i (vt) of vt over the proxy confidence region c Ua i . We will prove that iterating equation (12) converges to an approximately optimal policy. The following definition introduces the notion of an ε-optimal policy, see e.g., [4]. The error factor ε is also referred to as the amplification factor. We will treat the Q-factors as a |X | |A| matrix in the definition so that its ℓ norm is defined as usual. Definition 3.1 (ε-optimal policy). A policy π with Q-factors Q is ε-optimal with respect to the optimal policy π with corresponding Q-factors Q if Q Q ε Q . The following simple lemma allows us to decompose the optimization of a linear function over the proxy uncertainty set c P a i in terms of linear optimization over P a i , Ua i , and c Ua i . Lemma 3.2. Let v Rn be any vector and let βa i := maxy c Ua i minx Ua i y x 1. Then we have σc Pa i (v) σPa i (v) + βa i v . The following theorem proves that under a suitable choice of step lengths γt and discount factor ϑ, the iteration of equation (12) converges to an ε-approximately optimal policy with respect to the confidence regions Ua i . Theorem 3.3. Let the step lengths γt of the Q-iteration algorithm be chosen such that t=0 γt = and t=0 γ2 t < and let the discount factor ϑ < 1. Let βa i be as in Lemma 3.2 and let β := maxi X ,a A βa i . If ϑ(1 + β) < 1 then with probability 1 the iteration of equation (12) converges to an ε-optimal policy where ε := ϑβ 1 ϑ(1+β). Remark 3.4. If β = 0 then note that by Theorem 3.3, the robust Q-iterations converge to the exact optimal Q-factors since ε = 0. Since βa i := maxy c Ua i minx Ua i y x 1, it follows that β = 0 iff c Ua i = Ua i for every i X , a A. This happens when the confidence region is small enough so that the simplex constraints pa ij xj 1 pa ij j X in the description of P a i become redundant for every i X , a A. Equivalently every pa i is far from the boundary of the simplex n compared to the size of the confidence region Ua i . Remark 3.5. Note that simply using the nominal Q-iteration without the σc Ua i (v) term does not guarantee convergence to Q . Indeed, the nominal Q-iterations converge to Q-factors Q where Q Q may be arbitrary large. This follows easily from observing that | Q (i, a) Q (i, a)| = σc Ua i (v ) (13) , where v is the value function of Q and so Q Q = max i X ,a A σc Ua i (v ) (14) which can be as high as v = Q . 3.2 Robust TD-Learning Let (i0, i1, . . . ) be a trajectory of the agent, where im denotes the state of the agent at time step m. The main idea behind the TD(λ)-learning method is to estimate the value function vπ of a policy π using the temporal difference errors dm defined as dm := c(im, π(im)) + νvt(im+1) vt(im). (15) For a parameter λ (0, 1), the TD-learning iteration is defined in terms of the temporal difference errors as vt+1(ik) := vt(ik) + γt m=k (ϑλ)m k dm In the robust setting, we have a confidence region Ua i with proxy c Ua i for every temporal difference error, which leads us to define the robust temporal difference errors as edm := dm + ϑσ \ Uπ(im) im (vt), (17) where dm is the non-robust temporal difference. The robust TD-update is the usual TD-update, with the robust temporal difference errors f dm replacing the usual temporal difference error dm. We define an ε-suboptimal value function for a fixed policy π similar to Definition 3.1. Definition 3.6 (ε-approximate value function). Given a policy π, we say that a vector v Rn is an ε-approximation of vπ if v vπ ε vπ . The following theorem guarantees convergence of the robust TD-iteration to an approximate value function for π. We refer the reader to the supplementary material for a proof. Theorem 3.7. Let βa i be as in Lemma 3.2 and let β := maxi X ,a A βa i . Let ρ := ϑλ 1 ϑλ. If ϑ(1 + ρβ) < 1 then the robust TD-iteration converges to an ε-approximate value function, where ε := ϑβ 1 ϑ(1+ρβ). In particular if βa i = β = 0, i.e., the proxy confidence region c Ua i is the same as the true confidence region Ua i , then the convergence is exact, i.e., ε = 0. 4 Robust Reinforcement Learning with function approximation In Section 3 we derived robust versions of exact dynamic programming algorithms such as Q-learning, SARSA and TD-learning respectively. If the state space X of the MDP is large then it is prohibitive to maintain a lookup table entry for every state. A standard approach for large scale MDPs is to use the approximate dynamic programming (ADP) framework [17]. In this setting, the problem is parametrized by a smaller dimensional vector θ Rd where d n = |X |. The natural generalizations of Q-learning, SARSA, and TD-learning algorithms of Section 3 are via the projected Bellman equation, where we project back to the space spanned by all the parameters in θ Rd, since they are the value functions representable by the model. Convergence for these algorithms even in the non-robust setting are known only for linear architectures, see e.g., [2]. Recent work by [6] proposed stochastic gradient descent algorithms with convergence guarantees for smooth nonlinear function architectures, where the problem is framed in terms of minimizing a loss function. We give robust versions of both these approaches. 4.1 Robust approximations with linear architectures In the approximate setting with linear architectures, we approximate the value function vπ of a policy π by Φθ where θ Rd and Φ is a n d feature matrix with rows φ(j) for every state j X representing its feature vector. Let S be the span of the columns of Φ, i.e., S := n Φθ | θ Rdo . Define the operator Tπ : Rn Rn as (Tπv)(i) := c(i, π(i)) + ϑ j X pπ(i) ij v(j), so that the true value function vπ satisfies Tπvπ = vπ. A natural approach towards estimating vπ given a current estimate Φθt is to compute Tπ (Φθt) and project it back to S to get the next parameter θt+1. The motivation behind such an iteration is the fact that the true value function is a fixed point of this operation if it belonged to the subspace S. This gives rise to the projected Bellman equation where the projection Π is typically taken with respect to a weighted Euclidean norm ξ, i.e., x ξ = i X ξix2 i , where ξ is some probability distribution over the states X . In the model free case, where we do not have explicit knowledge of the transition probabilities, various methods like LSTD(λ), LSPE(λ), TD(λ) have been proposed [3, 8, 7, 14, 22, 21]. The key idea behind proving convergence for these methods is to show that the mapping ΠTπ is a contraction mapping with respect to the ξ for some distribution ξ over the states X . While the operator Tπ in the non-robust case is linear and is a contraction in the ℓ norm as in Section 3, the projection operator with respect to such norms is not guaranteed to be a contraction. However, it is known that if ξ is the steady state distribution of the policy π under evaluation, then Π is non-expansive in ξ [4, 2]. In the robust setting, we have the same methods but with the robust Bellman operators Tπ defined as (Tπv)(i) := c(i, π(i)) + ϑσPπ(i) i (v). Since we do not have access to the simulator probabilities pa i , we will use a proxy set c P a i as in Section 3, with the proxy operator denoted by c Tπ. While the iterative methods of the non-robust setting generalize via the robust operator Tπ and the robust projected Bellman equation Φθ = ΠTπ(Φθ), it is however not clear how to choose the distribution ξ under which the projected operator ΠTπ is a contraction in order to show convergence. Let ξ be the steady state distribution of the exploration policy bπ of the MDP with transition probability matrix P bπ. We make the following assumption on the discount factor ϑ as in [24]. Assumption 4.1. For every state i X and action a A, there exists a constant α (0, 1) such that for any p P a i we have ϑpj αP bπ ij for every j X . Assumption 4.1 might appear artificially restrictive; however, it is necessary to prove that ΠTπ is a contraction. While [24] require this assumption for proving convergence of robust MDPs, a similar assumption is also required in proving convergence of off-policy Reinforcement Learning methods of [5] where the states are sampled from an exploration policy bπ which is not necessarily the same as the policy π under evaluation. Note that in the robust setting, all methods are necessarily off-policy since the transition matrices are not fixed for a given policy. The following lemma is an ξ-weighted Euclidean norm version of Lemma 3.2. Lemma 4.2. Let v Rn be any vector and let βa i := maxy c Ua i minx Ua i y x ξ ξmin . Then we have σc Pa i (v) σPa i (v) + βa i v ξ , where ξmin := mini X ξi. The following theorem shows that the robust projected Bellman equation is a contraction under some assumptions on the discount factor ϑ. Theorem 4.3. Let βa i be as in Lemma 4.2 and let β := maxi X βπ(i) i . If the discount factor ϑ satisfies Assumption 4.1 and α2 + ϑ2β2 < 1 2, then the operator c Tπ is a contraction with respect to ξ. In other words for any two θ, θ Rd, we have c Tπ(Φθ) c Tπ(Φθ ) 2 ξ 2 α2 + ϑ2β2 Φθ Φθ 2 ξ < Φθ Φθ 2 ξ . (18) If βi = β = 0 so that [ Uπ(i) i = Uπ(i) i , then we have a simpler contraction under the assumption that α < 1. The following corollary shows that the solution to the proxy projected Bellman equation converges to a solution that is not too far away from the true value function vπ. Corollary 4.4. Let Assumption 4.1 hold and let β be as in Theorem 4.3. Let evπ be the fixed point of the projected Bellman equation for the proxy operator c Tπ, i.e., Πc Tπevπ = evπ. Let bvπ be the fixed point of the proxy operator c Tπ, i.e., c Tπbvπ = bvπ. Let vπ be the true value function of the policy π, i.e., Tπvπ = vπ. Then it follows that evπ vπ ξ ϑβ vπ ξ + Πvπ vπ ξ 2 (α2 + ϑ2β2) . (19) In particular if βi = β = 0 i.e., the proxy confidence region is actually the true confidence region, then the proxy projected Bellman equation has a solution satisfying evπ vπ ξ Πvπ vπ ξ Theorem 4.3 guarantees that the robust projected Bellman iterations of LSTD(λ), LSPE(λ) and TD(λ)-methods converge, while Corollary 4.4 guarantees that the solution it coverges to is not too far away from the true value function vπ. 4.2 Robust approximations with nonlinear architectures In this section we consider the situation where the function approximator vθ is a smooth but not necessarily linear function of θ. This section generalizes the results of [6] to the robust setting with confidence regions. We define robust analogues of the nonlinear GTD2 and nonlinear TDC algorithms respectively. Let M := n vθ | θ Rdo be the manifold spanned by all possible value functions representable by our model and let PMθ be the tangent plane of M at θ. Let TMθ be the tangent space, i.e., the translation of PMθ to the origin. In other words, TMθ := n Φθu | u Rdo , where Φθ is an n d matrix with entries Φθ(i, j) := θj vθ(i). In the nonlinear case, we project on to the tangent space TMθ, since projections on to M is computationally hard. We denote this projection by Πθ and it is also with respect to a weighted Euclidean norm ξ. The mean squared projected Bellman equation (MSPBE) loss function was proposed by [6] and is an extension of [22, 21], MSPBE(θ) = vθ ΠθTπvθ 2 ξ , where we now project to the the tangent space TMθ. Since the number n of states is prohibitively large, we want stochastic gradient algorithms that run in time polynomial in d. Therefore, we assume that the confidence region of every state action pair is the same: Ua i = U and c Ua i = Ua i . The robust version of the MSPBE loss function, the. mean squared robust projected Bellman equation (MSRPBE) loss can then be defined in terms of the robust Bellman operator with the proxy confidence region b U and proxy uncertainty set [ Pπ(i) i as MSRPBE(θ) = vθ Πθ c Tπvθ 2 In order to derive stochastic gradient descent algorithms for minimizing the MSRPBE loss function, we need to take the gradient of σP(vθ) for the a convex set P. The gradient µ of σ is given by µP(θ) := max y P y vθ = Φ θ arg max y P y vθ, (21) where Φθ(i) := vθ(i). Let us denote Φθ(i) simply by φ and Φθ(i ) by φ , where i is the next sampled state. Let us denote by b U the proxy confidence region [ Uπ(i) i of state i and the policy π under evaluation. Let h(θ, u) := E h ( ed φ u) 2vθ(i)u i (22) where ed is the robust temporal difference error. As in [6], we may express MSRPBE(θ) in terms of h(θ, w) where w = E φφ 1 E h edφ i . We refer the reader to the supplementary material for the details. This leads us to the following robust analogues of nonlinear GTD and nonlinear TDC, where we update the estimators wk of w as wk+1 := wk + βk edk φ k wk φk, with the parameters θk being updated on a slower timescale as θk+1 := Γ θk + αk n φk ϑφ k ϑµ b U(θ) (φ k wk) hk o robust-nonlinear-GTD2, (23) θk+1 := Γ θk + αk n edkφk ϑφ k ϑµ b U(θ)(φ k wk) hk o robust-nonlinear-TDC, (24) where hk := edk φ k wk 2vθk (ik) wk and Γ is a projection into an appropriately chosen compact set C with a smooth boundary as in [6]. Under the assumption of Lipschitz continuous gradients and suitable assumptions on the step lengths αk and βk and the confidence region b U, the updates of equations (23) converge with probability 1 to a local optima of MSRPBE(θ). See the supplementary material for the exact statement and proof of convergence. Note that in general computing µ b U(θ) would take time polynomial in n, but it can be done in O(d2) time using a rank-d approximation to b U. 5 Experiments We implemented robust versions of Q-learning and SARSA as in Section 3 and evaluated its performance against the nominal algorithms using the Open AI gym framework [9]. To test the performance of the robust algorithms, we perturb the models slightly by choosing with a small probability p a random state after every action. The size of the confidence region Ua i for the robust model is chosen by a 10-fold cross validation via line search. After the value functions are learned for the robust and the nominal algorithms, we evaluate its performance on the true environment. To compare the true algorithms we compare both the cumulative reward as well as the tail distribution function (complementary cumulative distribution function) as in [24] which for every a plots the probability that the algorithm earned a reward of at least a. Note that there is a tradeoffin the performance of the robust versus the nominal algorithms with the value of p due to the presence of the β term in the convergence results. See Figure 1 for a comparison. More figures and detailed results are included in the supplementary material. Figure 1: Line search, tail distribution, and cumulative rewards during transient phase of robust vs nominal Q-learning on Frozen Lake-v0 with p = 0.01. Note the instability of reward as a function of the size of the uncertainty set (left) is due to the small sample size used in line search. Acknowledgments The authors would like to thank Guy Tennenholtz and anonymous reviewers for helping improve the presentation of the paper. [1] J. A. Bagnell, A. Y. Ng, and J. G. Schneider. Solving uncertain markov decision processes. 2001. [2] D. P. Bertsekas. Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications, 9(3):310 335, 2011. [3] D. P. Bertsekas and S. Ioffe. Temporal differences-based policy iteration and applications in neuro-dynamic programming. Lab. for Info. and Decision Systems Report LIDS-P-2349, MIT, Cambridge, MA, 1996. [4] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming: an overview. In Decision and Control, 1995., Proceedings of the 34th IEEE Conference on, volume 1, pages 560 564. IEEE, 1995. [5] D. P. Bertsekas and H. Yu. Projected equation methods for approximate solution of large linear systems. Journal of Computational and Applied Mathematics, 227(1):27 50, 2009. [6] S. Bhatnagar, D. Precup, D. Silver, R. S. Sutton, H. R. Maei, and C. Szepesvári. Convergent temporal-difference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems, pages 1204 1212, 2009. [7] J. A. Boyan. Technical update: Least-squares temporal difference learning. Machine Learning, 49(2-3):233 246, 2002. [8] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1-3):33 57, 1996. [9] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016. [10] E. Delage and S. Mannor. Percentile optimization for markov decision processes with parameter uncertainty. Operations research, 58(1):203 213, 2010. [11] G. N. Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257 280, 2005. [12] S. H. Lim, H. Xu, and S. Mannor. Reinforcement learning in robust markov decision processes. In Advances in Neural Information Processing Systems, pages 701 709, 2013. [13] J. Morimoto and K. Doya. Robust reinforcement learning. Neural computation, 17(2):335 359, 2005. [14] A. Nedić and D. P. Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems, 13(1):79 110, 2003. [15] A. Nilim and L. El Ghaoui. Robustness in markov decision problems with uncertain transition matrices. In NIPS, pages 839 846, 2003. [16] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta. Robust adversarial reinforcement learning. ar Xiv preprint ar Xiv:1703.02702, 2017. [17] W. B. Powell. Approximate Dynamic Programming: Solving the curses of dimensionality, volume 703. John Wiley & Sons, 2007. [18] M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. [19] A. Shapiro and A. Kleywegt. Minimax analysis of stochastic problems. Optimization Methods and Software, 17(3):523 542, 2002. [20] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. [21] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvári, and E. Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 993 1000. ACM, 2009. [22] R. S. Sutton, H. R. Maei, and C. Szepesvári. A convergent o(n) temporal-difference algorithm for off-policy learning with linear function approximation. In Advances in neural information processing systems, pages 1609 1616, 2009. [23] A. Tamar, Y. Glassner, and S. Mannor. Optimizing the cvar via sampling. ar Xiv preprint ar Xiv:1404.3862, 2014. [24] A. Tamar, S. Mannor, and H. Xu. Scaling up robust mdps using function approximation. In ICML, volume 32, page 2014, 2014. [25] W. Wiesemann, D. Kuhn, and B. Rustem. Robust markov decision processes. Mathematics of Operations Research, 38(1):153 183, 2013.