# robust_reinforcement_learning_using_offline_data__7ea7e575.pdf

Robust Reinforcement Learning using Offline Data

Kishan Panaganti1, Zaiyan Xu1, Dileep Kalathil1, Mohammad Ghavamzadeh2

1Texas A&M University, 2Google Research. Emails: {kpb, zxu43, dileep.kalathil}@tamu.edu, ghavamza@google.com

The goal of robust reinforcement learning (RL) is to learn a policy that is robust against the uncertainty in model parameters. Parameter uncertainty commonly occurs in many real-world RL applications due to simulator modeling errors, changes in the real-world system dynamics over time, and adversarial disturbances. Robust RL is typically formulated as a max-min problem, where the objective is to learn the policy that maximizes the value against the worst possible models that lie in an uncertainty set. In this work, we propose a robust RL algorithm called Robust Fitted Q-Iteration (RFQI), which uses only an offline dataset to learn the optimal robust policy. Robust RL with offline data is significantly more challenging than its non-robust counterpart because of the minimization over all models present in the robust Bellman operator. This poses challenges in offline data collection, optimization over the models, and unbiased estimation. In this work, we propose a systematic approach to overcome these challenges, resulting in our RFQI algorithm. We prove that RFQI learns a near-optimal robust policy under standard assumptions and demonstrate its superior performance on standard benchmark problems.

1 Introduction

Reinforcement learning (RL) algorithms often require a large number of data samples to learn a control policy. As a result, training them directly on the real-world systems is expensive and potentially dangerous. This problem is typically overcome by training them on a simulator (online RL) or using a pre-collected offline dataset (offline RL). The offline dataset is usually collected either from a sophisticated simulator of the real-world system or from the historical measurements. The trained RL policy is then deployed assuming that the training environment, the simulator or the offline data, faithfully represents the model of the real-world system. This assumption is often incorrect due to multiple factors such as the approximation errors incurred while modeling, changes in the real-world parameters over time and possible adversarial disturbances in the real-world. For example, the standard simulator settings of the sensor noise, action delay, friction, and mass of a mobile robot can be different from that of the actual real-world robot, in addition to changes in the terrain, weather conditions, lighting, and obstacle densities of the testing environment. Unfortunately, the current RL control policies can fail dramatically when faced with even mild changes in the training and testing environments (Sünderhauf et al., 2018; Tobin et al., 2017; Peng et al., 2018).

The goal in robust RL is to learn a policy that is robust against the model parameter mismatches between the training and testing environments. The robust planning problem is formalized using the framework of Robust Markov Decision Process (RMDP) (Iyengar, 2005; Nilim and El Ghaoui, 2005). Unlike the standard MDP which considers a single model (transition probability function), the RMDP formulation considers a set of models which is called the uncertainty set. The goal is to find an optimal robust policy that performs the best under the worst possible model in this uncertainty set. The minimization over the uncertainty set makes the robust MDP and robust RL problems significantly more challenging than their non-robust counterparts.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

In this work, we study the problem of developing a robust RL algorithm with provably optimal performance for an RMDP with arbitrarily large state spaces, using only offline data with function approximation. Before stating the contributions of our work, we provide a brief overview of the results in offline and robust RL that are directly related to ours. We leave a more thorough discussion on related works to Appendix D.

Offline RL: Offline RL considers the problem of learning the optimal policy only using a pre-collected (offline) dataset. Offline RL problem has been addressed extensively in the literature (Antos et al., 2008; Bertsekas, 2011; Lange et al., 2012; Chen and Jiang, 2019; Xie and Jiang, 2020; Levine et al., 2020; Xie et al., 2021). Many recent works develop deep RL algorithms and heuristics for the offline RL problem, focusing on the algorithmic and empirical aspects (Fujimoto et al., 2019; Kumar et al., 2019, 2020; Yu et al., 2020; Zhang and Jiang, 2021). A number of theoretical work focus on analyzing the variations of Fitted Q-Iteration (FQI) algorithm (Gordon, 1995; Ernst et al., 2005), by identifying the necessary and sufficient conditions for the learned policy to be approximately optimal and characterizing the performance in terms of sample complexity (Munos and Szepesvári, 2008; Farahmand et al., 2010; Lazaric et al., 2012; Chen and Jiang, 2019; Liu et al., 2020; Xie et al., 2021). All these works assume that the offline data is generated according to a single model and the goal is to find the optimal policy for the MDP with the same model. In particular, none of these works consider the offline robust RL problem where the offline data is generated according to a (training) model which can be different from the one in testing, and the goal is to learn a policy that is robust w.r.t. an uncertainty set.

Robust RL: The RMDP framework was first introduced in Iyengar (2005); Nilim and El Ghaoui (2005). The RMDP problem has been analyzed extensively in the literature (Xu and Mannor, 2010; Wiesemann et al., 2013; Yu and Xu, 2015; Mannor et al., 2016; Russel and Petrik, 2019) providing computationally efficient algorithms, but these works are limited to the planning problem. Robust RL algorithms with provable guarantees have also been proposed (Lim et al., 2013; Tamar et al., 2014; Roy et al., 2017; Panaganti and Kalathil, 2021; Wang and Zou, 2021), but they are limited to tabular or linear function approximation settings and only provide asymptotic convergence guarantees. Robust RL problem has also been addressed using deep RL methods (Pinto et al., 2017; Derman et al., 2018, 2020; Mankowitz et al., 2020; Zhang et al., 2020a). However, these works do not provide any theoretical guarantees on the performance of the learned policies.

The works that are closest to ours are by Zhou et al. (2021); Yang et al. (2021); Panaganti and Kalathil (2022) that address the robust RL problem in a tabular setting under the generative model assumption. Due to the generative model assumption, the offline data has the same uniform number of samples corresponding to each and every state-action pair, and tabular setting allows the estimation of the uncertainty set followed by solving the planning problem. Our work is significantly different from these in the following way: (i) we consider a robust RL problem with arbitrary large state space, instead of the small tabular setting, (ii) we consider a true offline RL setting where the state-action pairs are sampled according to an arbitrary distribution, instead of using the generative model assumption, (iii) we focus on a function approximation approach where the goal is to directly learn optimal robust value/policy using function approximation techniques, instead of solving the tabular planning problem with the estimated model. To the best of our knowledge, this is the first work that addresses the offline robust RL problem with arbitrary large state space using function approximation, with provable guarantees on the performance of the learned policy.

Offline Robust RL: Challenges and Our Contributions: Offline robust RL is significantly more challenging than its non-robust counterpart mainly because of the following key difficulties. (i) Data generation: The optimal robust policy is computed by taking the infimum over all models in the uncertainty set P. However, generating data according to all models in P is clearly infeasible. It may only be possible to get the data from a nominal (training) model P o. How do we use the data from a nominal model to account for the behavior of all the models in the uncertainty set P? (ii) Optimization over the uncertainty set P: The robust Bellman operator (defined in (3)) involves a minimization over P, which is a significant computational challenge. Moreover, the uncertainty set P itself is unknown in the RL setting. How do we solve the optimization over P? (iii) Function approximation: Approximation of the robust Bellman update requires a modified target function which also depends on the approximate solution of the optimization over the uncertainty set. How do we perform the offline RL update accounting for both approximations?

As the key technical contributions of this work, we first derive a dual reformulation of the robust Bellman operator which replaces the expectation w.r.t. all models in the uncertainty set P with an ex-

pectation only w.r.t. the nominal (training) model P o. This enables using the offline data generated by P o for learning, without relying on high variance importance sampling techniques to account for all models in P. Following the same reformulation, we then show that the optimization problem over P can be further reformulated as functional optimization. We solve this functional optimization problem using empirical risk minimization and obtain performance guarantees using the Rademacher complexity based bounds. We then use the approximate solution obtained from the empirical risk minimization to generate modified target samples that are then used to approximate robust Bellman update through a generalized least squares approach with provably bounded errors. Performing these operations iteratively results in our proposed Robust Fitted Q-Iteration (RFQI) algorithm, for which we prove that its learned policy achieves non-asymptotic and approximately optimal performance guarantees.

Notations: For a set X, we denote its cardinality as |X|. The set of probability distribution over X is denoted as (X), and its power set sigma algebra as Σ(X). For any x R, we denote max{x, 0} as (x)+. For any function f : S A R, state-action distribution ν (S A), and real number p 1, the ν-weighted p-norm of f is defined as f p,ν = Es,a ν[|f(s, a)|p]1/p.

2 Preliminaries

A Markov Decision Process (MDP) is a tuple (S, A, r, P, γ, d0), where S is the state space, A is the action space, r : S A R is the reward function, γ (0, 1) is the discount factor, and d0 (S) is the initial state distribution. The transition probability function Ps,a(s ) is the probability of transitioning to state s when action a is taken at state s. In the literature, P is also called the model of the MDP. We consider a setting where |S| and |A| are finite but can be arbitrarily large. We will also assume that r(s, a) [0, 1], for all (s, a) S A, without loss of generality. A policy π : S (A) is a conditional distribution over actions given a state. The value function Vπ,P and the state-action value function Qπ,P of a policy π for an MDP with model P are defined as

Vπ,P (s) = Eπ,P [

t=0 γtr(st, at) | s0 = s], Qπ,P (s, a) = Eπ,P [

t=0 γtr(st, at) | s0 = s, a0 = a],

where the expectation is over the randomness induced by the policy π and model P. The optimal value function V P and the optimal policy π P of an MDP with the model P are defined as V P = maxπ Vπ,P and π P = arg maxπ Vπ,P . The optimal state-action value function is given by Q P = maxπ Qπ,P . The optimal policy can be obtained as π P (s) = arg maxa Q P (s, a). The discounted state-action occupancy of a policy π for an MDP with model P, denoted as dπ,P (S A), is defined as dπ,P (s, a) = (1 γ)Eπ,P [P t=0 γt1(st = s, at = a)].

Robust Markov Decision Process (RMDP): Unlike the standard MDP which considers a single model (transition probability function), the RMDP formulation considers a set of models. We refer to this set as the uncertainty set and denote it as P. We consider P that satisfies the standard (s, a)- rectangularity condition (Iyengar, 2005). We note that a similar uncertainty set can be considered for the reward function at the expense of additional notations. However, since the analysis will be similar and the sample complexity guarantee will be identical up to a constant factor, without loss of generality, we assume that the reward function is known and deterministic.

We specify an RMDP as M = (S, A, r, P, γ, d0), where the uncertainty set P is typically defined as

P = (s,a) S A Ps,a, where Ps,a = {Ps,a (S) : D(Ps,a, P o s,a) ρ}, (1)

P o = (P o s,a, (s, a) S A) is the nominal model, D( , ) is a distance metric between two probability distributions, and ρ > 0 is the radius of the uncertainty set that indicates the level of robustness. The nominal model P o can be thought as the model of the training environment. It is either the model of the simulator on which the (online) RL algorithm is trained, or in our setting, it is the model according to which the offline data is generated. The uncertainty set P (1) is the set of all valid transition probability functions (valid testing models) in the neighborhood of the nominal model P o, which by definition satisfies (s, a)-rectangularity condition (Iyengar, 2005), where the neighborhood is defined using the distance metric D( , ) and radius ρ. In this work, we consider the Total Variation (TV) uncertainty set defined using the TV distance, i.e., D(Ps,a, P o s,a) = (1/2) Ps,a P o s,a 1.

The RMDP problem is to find the optimal robust policy which maximizes the value against the worst possible model in the uncertainty set P. The robust value function V π corresponding to a policy π

and the optimal robust value function V are defined as (Iyengar, 2005; Nilim and El Ghaoui, 2005)

V π = inf P P Vπ,P , V = sup π inf P P Vπ,P . (2)

The optimal robust policy π is such that the robust value function corresponding to it matches the optimal robust value function, i.e., V π = V . It is known that there exists a deterministic optimal policy (Iyengar, 2005) for the RMDP. The robust Bellman operator is defined as (Iyengar, 2005)

(TQ)(s, a) = r(s, a) + γ inf Ps,a Ps,a Es Ps,a[max b Q(s , b)]. (3)

It is known that T is a contraction mapping in the infinity norm and hence it has a unique fixed point Q with V (s) = maxa Q (s, a) and π (s) = arg maxa Q (s, a) (Iyengar, 2005). The Robust Q-Iteration (RQI) can now be defined using the robust Bellman operator as Qk+1 = TQk. Since T is a contraction, it follows that Qk Q . So, RQI can be used to compute (solving the planning problem) Q and π in the tabular setting with a known P. Due to the optimization over the uncertainty set Ps,a for each (s, a) pair, solving the planning problem in RMDP using RQI is much more computationally intensive than solving it in MDP using Q-Iteration.

Offline RL: Offline RL considers the problem of learning the optimal policy of an MDP when the algorithm does not have direct access to the environment and cannot generate data samples in an online manner. For learning the optimal policy π P of an MDP with model P, the algorithm will only have access to an offline dataset DP = {(si, ai, ri, s i)}N i=1, where (si, ai) µ, µ (S A) is some distribution, and s i Psi,ai. Fitted Q-Iteration (FQI) is a popular offline RL approach which is amenable to theoretical analysis while achieving impressive empirical performance. In addition to the dataset DP , FQI uses a function class F = {f : S A [0, 1/(1 γ)]} to approximate Q P . The typical FQI update is given by fk+1 = arg minf F PN i=1(r(si, ai) + γ maxb fk(s i, b) f(si, ai))2, which aims to approximate the non-robust Bellman update using offline data with function approximation. Under suitable assumptions, it is possible to obtain provable performance guarantees for FQI (Szepesvári and Munos, 2005; Chen and Jiang, 2019; Liu et al., 2020).

3 Offline Robust Reinforcement Learning

The goal of an offline robust RL algorithm is to learn the optimal robust policy π using a pre-collected offline dataset D. The data is typically generated according to a nominal (training) model P o, i.e., D = {(si, ai, ri, s i)}N i=1, where (si, ai) µ, µ (S A) is some data generating distribution, and s i P o si,ai. The uncertainty set P is defined around this nominal model P o as given in (1) w.r.t. the total variation distance metric. We emphasize that the learning algorithm does not know the nominal model P o as it has only access to D, and hence it also does not know P. Moreover, the learning algorithm does not have data generated according to any other models in P and has to rely only on D to account for the behavior w.r.t. all models in P.

Learning policies for RL problems with large state-action spaces is computationally intractable. RL algorithms typically overcome this issue by using function approximation. In this paper, we consider two function classes F = {f : S A [0, 1/(1 γ)]} and G = {g : S A [0, 2/(ρ(1 γ))]}. We use F to approximate Q and G to approximate the dual variable functions which we will introduce in the next section. For simplicity, we will first assume that these function classes are finite but exponentially large, and we will use the standard log-cardinality to characterize the sample complexity results, as given in Theorem 1. We note that, at the cost of additional notations and analysis, infinite function classes can also be considered where the log-cardinalities are replaced by the appropriate notions of covering number.

Similar to the non-robust offline RL, we make the following standard assumptions about the data generating distribution µ and the representation power of F. Assumption 1 (Concentratability). There exists a finite constant C > 0 such that for any ν {dπ,P o | any policy π} (S A), we have ν/µ

Assumption 1 states that the ratio of the distribution ν and the data generating distribution µ, ν(s, a)/µ(s, a), is uniformly bounded. This assumption is widely used in the offline RL literature (Munos, 2003; Agarwal et al., 2019; Chen and Jiang, 2019; Wang et al., 2021; Xie et al., 2021) in many different forms. We borrow this assumption from Chen and Jiang (2019), where they used it for

non-robust offline RL. In particular, we note that the distribution ν is in the collection of discounted state-action occupancies on model P o alone for the robust RL. Assumption 2 (Approximate completeness). Let µ (S A) be the data distribution. Then, supf F inff F f Tf 2 2,µ εc.

Assumption 2 states that the function class F is approximately closed under the robust Bellman operator T. This assumption has also been widely used in the offline RL literature (Agarwal et al., 2019; Chen and Jiang, 2019; Wang et al., 2021; Xie et al., 2021).

One of the most important properties that the function class F should have is that there must exist a function f F which well-approximates Q . This assumption is typically called approximate realizability in the offline RL literature. This is typically formalized by assuming inff F f Tf 2 2,µ εr (Chen and Jiang, 2019). It is known that the approximate completeness assumption and the concentratability assumption imply the realizability assumption (Chen and Jiang, 2019; Xie et al., 2021).

4 Robust Fitted Q-Iteration: Algorithm and Main Results

In this section, we give a step-by-step approach to overcome the challenges of the offline robust RL outlined in Section 1. We then combine these intermediate steps to obtain our proposed RFQI algorithm. We then present our main result about the performance guarantee of the RFQI algorithm, followed by a brief description about the proof approach.

4.1 Dual Reformulation of Robust Bellman Operator

One key challenge in directly using the standard definition of the optimal robust value function given in (2) or of the robust Bellman operator given in (3) for developing and analyzing robust RL algorithms is that both involve computing an expectation w.r.t. each model P P. Given that the data is generated only according to the nominal model P o, estimating these expectation values is really challenging. We show that we can overcome this difficulty through the dual reformulation of the robust Bellman operator, as given below. Proposition 1. Let M be an RMDP with the uncertainty set P specified by (1) using the total variation distance D(Ps,a, P o s,a) = (1/2) Ps,a P o s,a 1. Then, for any Q : S A [0, 1/(1 γ)], the robust Bellman operator T given in (3) can be equivalently written as

(TQ)(s, a) = r(s, a) γ inf η [0, 2 ρ(1 γ) ](Es P o s,a[(η V (s ))+] η + ρ(η inf s V (s ))+), (4)

where V (s) = maxa A Q(s, a). Moreover, the inner optimization problem in (4) is convex in η.

This result mainly relies on Shapiro (2017, Section 3.2) and Duchi and Namkoong (2018, Proposition 1). Note that in (4), the expectation is now only w.r.t. the nominal model P o, which opens up the possibility of using empirical estimates obtained from the data generated according to P o. This avoids the need to use importance sampling based techniques to account for all models in P, which often have high variance, and thus, are not desirable.

While (4) provides a form that is amenable to estimation using offline data, it involves finding infs V (s ). Though this computation is straightforward in a tabular setting, it is infeasible in a function approximation setting. In order to overcome this issue, we make the following assumption. Assumption 3 (Fail-state). The RMDP M has a fail-state sf, such that r(sf, a) = 0 and Psf ,a(sf) = 1, a A, P P.

We note that this is not a very restrictive assumption because such a fail-state is quite natural in most simulated or real-world systems. For example, a state where a robot collapses and is not able to get up, either in a simulation environment like Mu Jo Co or in real-world setting, is such a fail state.

Assumption 3 immediately implies that Vπ,P (sf) = 0, P P, and hence V (sf) = 0 and Q (sf, a) = 0, a A. It is also straightforward to see that Qk+1(sf, a) = 0, a A, where Qk s are the RQI iterates given by the robust Bellman update Qk+1 = TQk with the initialization Q0 = 0. By the contraction property of T, we have Qk Q . So, under Assumption 3, without loss of generality, we can always keep Qk(sf, a) = 0, a A and for all k in RQI (and later in RFQI).

So, in the light of the above description, for the rest of the paper we will use the robust Bellman operator T by setting infs V (s ) = 0. In particular, for any function f : S A [0, 1/(1 γ)] with f(sf, a) = 0, the robust Bellman operator T is now given by (Tf)(s, a) = r(s, a) γ inf η [0, 2 (ρ(1 γ)) ] (Es P o s,a[(η max a f(s , a ))+] (1 ρ)η). (5)

4.2 Approximately Solving the Dual Optimization using Empirical Risk Minimization

Another key challenge in directly using the standard definition of the optimal robust value function given in (2) or of the robust Bellman operator given in (3) for developing and analyzing robust RL algorithms is that both involve an optimization over P. The dual reformulation given in (5) partially overcomes this challenge also, as the optimization over P is now replaced by a convex optimization over a scalar η [0, 2/(ρ(1 γ))]. However, this still requires solving an optimization for each (s, a) S A, which is clearly infeasible even for moderately sized state-action spaces, not to mention the function approximation setting. Our key idea to overcome this difficulty is to reformulate this as a functional optimization problem instead of solving it as multiple scalar optimization problems. This functional optimization method will make it amenable to approximately solving the dual problem using an empirical risk minimization approach with offline data.

Consider the probability (measure) space (S A, Σ(S A), µ) and let L1(S A, Σ(S A), µ) be the set of all absolutely integrable functions defined on this space.1 In other words, L1 is the set of all functions g : S A C R, such that g 1,µ is finite. We set C = [0, 2/ρ(1 γ)], anticipating the solution of the dual optimization problem (5). We also note µ is the data generating distribution which is a σ-finite measure.

For any given function f : S A [0, 1/(1 γ)], we define the loss function Ldual( ; f) as

Ldual(g; f) = Es,a µ[Es P o s,a[(g(s, a) max a f(s , a ))+] (1 ρ)g(s, a)], g L1. (6)

In the following lemma, we show that the scalar optimization over η for each (s, a) pair in (5) can be replaced by a single functional optimization w.r.t. the loss function Ldual. Lemma 1. Let Ldual be the loss function defined in (6). Then, for any function f : S A [0, 1/(1 γ)], we have

inf g L1 Ldual(g; f) = Es,a µ h inf η [0, 2 (ρ(1 γ)) ]

Es P o s,a η max a f(s , a )

+ (1 ρ)η i . (7)

Note that the RHS of (7) has minimization over η for each (s, a) pair and minimization is inside the expectation Es,a µ[ ]. However, the LHS of (7) has a single functional minimization over g L1 and this minimization is outside the expectation. For interchanging the expectation and minimization, and for moving from point-wise optimization to functional optimization, we use the result from Rockafellar and Wets (2009, Theorem 14.60), along with the fact that L1 is a decomposable space. We also note that this result has been used in many recent works on distributionally robust optimization (Shapiro, 2017; Duchi and Namkoong, 2018) (see Appendix A for more details).

We can now define the empirical loss function b Ldual corresponding to the true loss Ldual as

b Ldual(g; f) = 1

i=1 (g(si, ai) max a f(s i, a ))+ (1 ρ)g(si, ai). (8)

Now, for any given f, we can find an approximately optimal dual function through the empirical risk minimization approach as infg L1 b Ldual(g; f).

As we mentioned in Section 3, our offline robust RL algorithm is given an input function class G = {g : S A [0, 2/(ρ(1 γ))]} to approximate the dual variable functions. So, in the empirical risk minimization, instead of taking the infimum over all the functions in L1, we can only take the infimum over all the functions in G. For this to be meaningful, G should have sufficient representation power. In particular, the result in Lemma 1 should hold approximately even if we replace the infimum over L1 with infimum over G. One can see that this is similar to the realizability requirement for the function class F as described in Section 3. We formalize the representation power of G in the following assumption.

1In the following, we will simply denote L1(S A, Σ(S A), µ) as L1 for conciseness.

Assumption 4 (Approximate dual realizability). For all f F, there exists a uniform constant εdual such that infg G Ldual(g; f) infg L1 Ldual(g; f) εdual

Using the above assumption, for any given f F, we can find an approximately optimal dual function bgf G through the empirical risk minimization approach as bgf = arg ming G b Ldual(g; f).

In order to characterize the performance of this approach, consider the operator Tg for any g G as

(Tgf)(s, a) = r(s, a) γ(Es P o s,a[(g(s, a) max a f(s , a ))+] (1 ρ)g(s, a)), (9)

for all f F and (s, a) S A. We will show in Lemma 6 in Appendix C that the error supf F Tf Tbgf f 1,µ is O(log(|F|/δ)/

N) with probability at least 1 δ.

4.3 Robust Fitted Q-iteration

The intuitive idea behind our robust fitted Q-iteration (RFQI) algorithm is to approximate the exact RQI update step Qk+1 = TQk with function approximation using offline data. The exact RQI step requires updating each (s, a)-pair separately, which is not scalable to large state-action spaces. So, this is replaced by the function approximation as Qk+1 = arg minf F TQk f 2 2,µ. It is still infeasible to perform this update as it requires to exactly compute the expectation (w.r.t. P o and µ) and to solve the dual problem accurately. We overcome these issues by replacing both these exact computations with empirical estimates using the offline data. We note that this intuitive idea is similar to that of the FQI algorithm in the non-robust case. However, RFQI has unique challenges due to the nature of the robust Bellman operator T and the presence of the dual optimization problem within T.

Given a dataset D, we also follow the standard non-robust offline RL choice of least-squares residual minimization (Chen and Jiang, 2019; Xie et al., 2021; Wang et al., 2021). Define the empirical loss of f given f (which represents the Q-function from the last iteration) and dual variable function g as

b LRFQI(f; f , g) = 1

r(si, ai) + γ (g(si, ai) maxa f (s i, a ))+ + (1 ρ)g(si, ai) f(si, ai)

The correct dual variable function to be used in (10) is the optimal dual variable g f = arg ming G Ldual(g; f ) corresponding to the last iterate f , which we will approximate it by bgf = arg ming G b Ldual(g; f ). The RFQI update is then obtained as arg minf F b LRFQI(f; f , bgf ).

Summarizing the individual steps described above, we formally give our RFQI algorithm below.

Algorithm 1 Robust Fitted Q-Iteration (RFQI) Algorithm

1: Input: Offline dataset D = (si, ai, ri, s i)N i=1, function classes F and G. 2: Initialize: Q0 0 F. 3: for k = 0, , K 1 do 4: Dual variable function optimization: Compute the dual variable function corresponding to Qk through empirical risk minimization as gk = bg Qk = arg ming G b Ldual(g; Qk) (see (8)). 5: Robust Q-update: Compute the next iterate Qk+1 through least-squares regression as Qk+1 = arg min Q F b LRFQI(Q; Qk, gk) (see (10)). 6: end for 7: Output: πK = arg maxa QK(s, a)

Now we state our main theoretical result on the performance of the RFQI algorithm.

Theorem 1. Let Assumptions 1-4 hold. Let πK be the output of the RFQI algorithm after K iterations. Denote Jπ = Es d0[V π(s)] where d0 is initial state distribution. Then, for any δ (0, 1), with probability at least 1 2δ, we have

C( 6εc + γεdual)

(1 γ)2 + 16 ρ(1 γ)3

18C log(2|F||G|/δ)

Remark 1. Theorem 1 states that the RFQI algorithm can achieve approximate optimality. To see this, note that with K O( 1 log(1/γ) log( 1 ε(1 γ))), and neglecting the second term corresponding to

(inevitable) approximation errors εc and εdual, we get Jπ JπK ε/(1 γ) with probability greater than 1 2δ for any ε, δ (0, 1), as long as the number of samples N O( 1 (ρε)2(1 γ)4 log |F||G|

δ ). So, the above theorem can also be interpreted as a sample complexity result.

Remark 2. The known sample complexity of robust-RL in the tabular setting is e O( |S|2|A| (ρε)2(1 γ)4 ) (Yang

et al., 2021; Panaganti and Kalathil, 2022). Considering e O(log(|F||G|)) to be e O(|S||A|), we can recover the same bound as in the tabular setting (we save |S| due to the use of Bernstein inequality). Remark 3. Under similar Bellman completeness and concentratability assumptions, RFQI sample complexity is comparable to that of a non-robust offline RL algorithm, i.e., O( 1 ε2(1 γ)4 log |F|

δ ) (Chen and Jiang, 2019). As a consequence of robustness, we have ρ 2 and log(|G|) factors in our bound.

4.4 Proof Sketch

Here we briefly explain the key ideas used in the analysis of RFQI for obtaining the optimality gap bound in Theorem 1. The complete proof is provided in Appendix C.

Step 1: To bound Jπ JπK, we connect it to the error Qπ QK 1,ν for any state-action distribution ν. While the similar step follows almost immediately using the well-known performance lemma in the analysis of non-robust FQI, such a result is not known in the robust RL setting. So, we derive the basic inequalities to get a recursive form and to obtain the bound Jπ JπK 2 Qπ QK 1,ν/(1 γ) (see (22) and the steps before in Appendix C). Step 2: To bound Qπ QK 1,ν for any state-action distribution ν such that ν/µ

C, we decompose it to get a recursion, with approximation terms based on the least-squares regression and empirical risk minimization. Recall that bgf is the dual variable function from the algorithm for stateaction value function f F. Denote bfg as the least squares solution from the algorithm for the stateaction value function f F and dual variable function g G, i.e., bfg = arg min Q F b LRFQI(Q; f, g). By recursive use of the obtained inequality (23) (see Appendix C) and using uniform bound, we get

Qπ QK 1,ν γK

C 1 γ sup f F Tf Tbgf f 1,µ +

C 1 γ sup f F sup g G Tgf bfg 2,µ.

Step 3: We recognize that supf F Tf Tbgf f 1,µ is an empirical risk minimization error term. Using Rademacher complexity based bounds, we show in Lemma 6 that this error is O(log(|F|/δ)/

N) with high probability. Step 4: Similarly, we also recognize that supf F supg G Tgf bfg 2,µ is a least-squares regression error term. We also show that this error is O(log(|F||G|/δ)/

N) with high probability. We adapt the generalized least squares regression result to accommodate the modified target functions resulting from the robust Bellman operator to obtain this bound (see Lemma 7). The proof is complete after combining steps 1-4 above.

5 Experiments

80 70 60 50 40 30 20 Percentage change from nominal value

Average cumulative reward in 20 games

"force_mag" perturbation

RFQI SR-DQN FQI DQN

Figure 1: Cart Pole

0 20 40 60 80 100 Prob. of picking a random action

Average cumulative reward in 20 games

Action perturbation

RFQI SR-DQN FQI DQN

Figure 2: Cart Pole

0 10 20 30 40 50 60 'leg_joint_stiffness' values (default=0.0)

Average cumulative reward in 20 games

"leg_joint_stiffness" perturbation

RFQI SRDDPG FQI TD3

Figure 3: Hopper

Here, we demonstrate the robust performance of our RFQI algorithm by evaluating it on Cartpole and Hopper environments in Open AI Gym (Brockman et al., 2016). In all the figures shown, the quantity

in the vertical axis is averaged over 20 different seeded runs depicted by the thick line and the band around it is the 0.5 standard deviation. A more detailed description of the experiments, and results on additional experiments, are deferred to Appendix E. We provide our code in github webpage https: //github.com/zaiyan-x/RFQI containing instructions to reproduce all results in this paper.

For the Cartpole, we compare RFQI algorithm against the non-robust RL algorithms FQI and DQN, and the soft-robust RL algorithm proposed in Derman et al. (2018). We test the robustness of the algorithms by changing the parameter force_mag (to model external force disturbance), and also by introducing action perturbations (to model actuator noise). Fig. 1 and Fig. 2 shows superior robust performance of RFQI compared to the non-robust FQI and DQN. The RFQI performance is similar to that of soft-robust DQN. We note that soft-robust RL algorithm (here soft-robust DQN) is an online deep RL algorithm (and not an offline RL algorithm) and has no provable performance guarantee. Moreover, soft-robust RL algorithm requires generating online data according a number of models in the uncertainty set, whereas RFQI only requires offline data according to a single nominal training model.

For the Hopper, we compare RFQI algorithm against the non-robust RL algorithms FQI and TD3 (Fujimoto et al., 2018), and the soft-robust RL (here soft-robust DDPG) algorithm proposed in Derman et al. (2018). We test the robustness of the algorithms by changing the parameter leg_joint_stiffness. Fig. 3 shows the superior performance of our RFQI algorithm against the non-robust algorithms and soft-robust DDPG algorithm. The average episodic reward of RFQI remains almost the same initially, and later decays much less and gracefully when compared to the non-robust FQI and TD3.

6 Conclusion

In this work, we presented a novel robust RL algorithm called Robust Fitted Q-Iteration algorithm with provably optimal performance for an RMDP with arbitrarily large state space, using only offline data with function approximation. We also demonstrated the superior performance of the proposed algorithm on standard benchmark problems.

One limitation of our present work is that, we considered only the uncertainty set defined with respect to the total variation distance. In future work, we will consider uncertainty sets defined with respect to other f-divergences such as KL-divergence and Chi-square divergence. Finding a lower bound for the sample complexity and relaxing the assumptions used are also important and challenging problems.

7 Acknowledgements

This work was supported in part by the National Science Foundation (NSF) grants NSF-CAREEREPCN-2045783 and NSF ECCS 2038963. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsoring agencies.

Agarwal, A., Jiang, N., Kakade, S. M., and Sun, W. (2019). Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep. 4, 5, 19, 21

Agarwal, A., Kakade, S., and Yang, L. F. (2020). Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67 83. 23

Antos, A., Szepesvári, C., and Munos, R. (2008). Learning near-optimal policies with Bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89 129. 2, 23

Azar, M. G., Munos, R., and Kappen, H. J. (2013). Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Mach. Learn., 91(3):325 349. 23

Bertsekas, D. P. (2011). Approximate policy iteration: A survey and some new methods. Journal of

Control Theory and Applications, 9(3):310 335. 2, 23

Borkar, V. S. (2002). Q-learning for risk-sensitive control. Mathematics of operations research, 27(2):294 311. 23

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. ar Xiv preprint ar Xiv:1606.01540. 8, 24, 27

Chen, J. and Jiang, N. (2019). Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042 1051. 2, 4, 5, 7, 8, 23

Csiszár, I. (1967). Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229 318. 16

Derman, E., Mankowitz, D., Mann, T., and Mannor, S. (2020). A bayesian approach to robust reinforcement learning. In Uncertainty in Artificial Intelligence, pages 648 658. 2

Derman, E., Mankowitz, D. J., Mann, T. A., and Mannor, S. (2018). Soft-robust actor-critic policygradient. In AUAI press for Association for Uncertainty in Artificial Intelligence, pages 208 218. 2, 9, 23, 26

Duchi, J. and Namkoong, H. (2018). Learning models with uniform performance via distributionally robust optimization. ar Xiv preprint ar Xiv:1810.08750. 5, 6, 16

Dullerud, G. E. and Paganini, F. (2013). A course in robust control theory: a convex approach, volume 36. Springer Science & Business Media. 23

Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based batch mode reinforcement learning.

Journal of Machine Learning Research, 6:503 556. 2, 23

Farahmand, A.-m., Szepesvári, C., and Munos, R. (2010). Error propagation for approximate policy and value iteration. Advances in Neural Information Processing Systems, 23. 2, 23

Fei, Y., Yang, Z., Chen, Y., and Wang, Z. (2021). Exponential bellman equation and improved regret bounds for risk-sensitive reinforcement learning. In Annual Conference on Neural Information Processing Systems 2021, pages 20436 20446. 23

Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. (2020). D4rl: Datasets for deep data-driven reinforcement learning. 26, 27, 28

Fujimoto, S., Hoof, H., and Meger, D. (2018). Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1582 1591. 9, 27

Fujimoto, S., Meger, D., and Precup, D. (2019). Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052 2062. 2, 23, 24, 25, 27

Gordon, G. J. (1995). Stable function approximation in dynamic programming. In Machine learning

proceedings 1995, pages 261 268. 2, 23

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861 1870. 25, 26

Haskell, W. B., Jain, R., and Kalathil, D. (2016). Empirical dynamic programming. Mathematics of

Operations Research, 41(2):402 429. 23

Huang, P., Xu, M., Fang, F., and Zhao, D. (2022). Robust reinforcement learning as a stackelberg game via adaptively-regularized adversarial training. ar Xiv preprint ar Xiv:2202.09514. 23

Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2):257 280. 1, 2, 3, 4, 21, 23

Kalathil, D., Borkar, V. S., and Jain, R. (2021). Empirical Q-Value Iteration. Stochastic Systems, 11(1):1 18. 23

Kaufman, D. L. and Schaefer, A. J. (2013). Robust modified policy iteration. INFORMS Journal on

Computing, 25(3):396 410. 23

Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. ar Xiv preprint

ar Xiv:1412.6980. 24

Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114. 24

Kumar, A., Fu, J., Soh, M., Tucker, G., and Levine, S. (2019). Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pages 11784 11794. 2, 23

Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020). Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179 1191. 2, 23

Lange, S., Gabel, T., and Riedmiller, M. (2012). Batch reinforcement learning. In Reinforcement

learning, pages 45 73. Springer. 2, 23

Lazaric, A., Ghavamzadeh, M., and Munos, R. (2012). Finite-sample analysis of least-squares policy iteration. Journal of Machine Learning Research, 13:3041 3074. 2, 23

Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643. 2, 23, 26, 27, 28

Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. (2020). Breaking the sample size barrier in model-based reinforcement learning with a generative model. In Advances in Neural Information Processing Systems, volume 33, pages 12861 12872. 23

Lim, S. H. and Autef, A. (2019). Kernel-based reinforcement learning in robust Markov decision processes. In International Conference on Machine Learning, pages 3973 3981. 23

Lim, S. H., Xu, H., and Mannor, S. (2013). Reinforcement learning in robust Markov decision processes. In Advances in Neural Information Processing Systems, pages 701 709. 2

Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E. (2020). Provably good batch off-policy reinforcement learning without great exploration. In Neural Information Processing Systems. 2, 4, 23, 24, 25, 26, 27, 28

Mankowitz, D. J., Levine, N., Jeong, R., Abdolmaleki, A., Springenberg, J. T., Shi, Y., Kay, J., Hester, T., Mann, T., and Riedmiller, M. (2020). Robust reinforcement learning for continuous control with model misspecification. In International Conference on Learning Representations. 2, 23

Mannor, S., Mebel, O., and Xu, H. (2016). Robust mdps with k-rectangular uncertainty. Mathematics

of Operations Research, 41(4):1484 1509. 2

Moses, A. K. and Sundaresan, R. (2011). Further results on geometric properties of a family of relative entropies. In 2011 IEEE International Symposium on Information Theory Proceedings, pages 1940 1944. 16

Munos, R. (2003). Error bounds for approximate policy iteration. In ICML, volume 3, pages 560 567.

Munos, R. and Szepesvári, C. (2008). Finite-time bounds for fitted value iteration. Journal of

Machine Learning Research, 9(27):815 857. 2, 23

Nilim, A. and El Ghaoui, L. (2005). Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780 798. 1, 2, 4, 23

Panaganti, K. and Kalathil, D. (2021). Robust reinforcement learning using least squares policy iteration with provable performance guarantees. In Proceedings of the 38th International Conference on Machine Learning, pages 511 520. 2, 23

Panaganti, K. and Kalathil, D. (2022). Sample complexity of robust reinforcement learning with a generative model. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, pages 9582 9602. 2, 8, 24

Peng, X. B., Andrychowicz, M., Zaremba, W., and Abbeel, P. (2018). Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA), pages 3803 3810. IEEE. 1

Pinto, L., Davidson, J., Sukthankar, R., and Gupta, A. (2017). Robust adversarial reinforcement learning. In International Conference on Machine Learning, pages 2817 2826. 2, 23

Prashanth, L. A. and Ghavamzadeh, M. (2016). Variance-constrained actor-critic algorithms for discounted and average reward mdps. Mach. Learn., 105(3):367 417. 23

Raffin, A. (2020). Rl baselines3 zoo. https://github.com/DLR-RM/rl-baselines3-zoo. 25

Rockafellar, R. T. and Wets, R. J.-B. (2009). Variational analysis, volume 317. Springer Science & Business Media. 6, 15, 16

Roy, A., Xu, H., and Pokutta, S. (2017). Reinforcement learning under model mismatch. In Advances

in Neural Information Processing Systems, pages 3043 3052. 2, 23

Russel, R. H. and Petrik, M. (2019). Beyond confidence regions: Tight bayesian ambiguity sets for robust mdps. Advances in Neural Information Processing Systems. 2

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347. 25

Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to

algorithms. Cambridge university press. 15

Shapiro, A. (2017). Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4):2258 2275. 5, 6, 16

Sidford, A., Wang, M., Wu, X., Yang, L. F., and Ye, Y. (2018). Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 5192 5202. 23

Singh, S. P. and Yee, R. C. (1994). An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3):227 233. 23

Sünderhauf, N., Brock, O., Scheirer, W., Hadsell, R., Fox, D., Leitner, J., Upcroft, B., Abbeel, P., Burgard, W., Milford, M., et al. (2018). The limits and potentials of deep learning for robotics. The International journal of robotics research, 37(4-5):405 420. 1

Szepesvári, C. and Munos, R. (2005). Finite time bounds for sampling based fitted value iteration. In

Proceedings of the 22nd international conference on Machine learning, pages 880 887. 4

Tamar, A., Mannor, S., and Xu, H. (2014). Scaling up robust mdps using function approximation. In

International Conference on Machine Learning, pages 181 189. 2, 23

Tessler, C., Efroni, Y., and Mannor, S. (2019). Action robust reinforcement learning and applications in continuous control. In International Conference on Machine Learning, pages 6215 6224. 23

Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23 30. 1

Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data

Science, volume 47. Cambridge University press. 15

Wang, R., Foster, D., and Kakade, S. M. (2021). What are the statistical limits of offline {rl} with linear function approximation? In International Conference on Learning Representations. 4, 5, 7

Wang, Y. and Zou, S. (2021). Online robust reinforcement learning with model uncertainty. Advances

in Neural Information Processing Systems, 34. 2

Wiesemann, W., Kuhn, D., and Rustem, B. (2013). Robust Markov decision processes. Mathematics

of Operations Research, 38(1):153 183. 2, 23

Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P., and Agarwal, A. (2021). Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34. 2, 4, 5, 7, 23

Xie, T. and Jiang, N. (2020). Q* approximation schemes for batch reinforcement learning: A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pages 550 559. 2, 23

Xu, H. and Mannor, S. (2010). Distributionally robust Markov decision processes. In Advances in

Neural Information Processing Systems, pages 2505 2513. 2, 23

Yang, W., Zhang, L., and Zhang, Z. (2021). Towards theoretical understandings of robust markov decision processes: Sample complexity and asymptotics. ar Xiv preprint ar Xiv:2105.03863. 2, 8, 24

Yu, P. and Xu, H. (2015). Distributionally robust counterpart in Markov decision processes. IEEE

Transactions on Automatic Control, 61(9):2538 2543. 2

Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J., Levine, S., Finn, C., and Ma, T. (2020). Mopo: Model-based offline policy optimization. In Advances in Neural Information Processing Systems. 2, 23

Zhang, H., Chen, H., Xiao, C., Li, B., Liu, M., Boning, D., and Hsieh, C.-J. (2020a). Robust deep reinforcement learning against adversarial perturbations on state observations. Advances in Neural Information Processing Systems, 33:21024 21037. 2

Zhang, K., Hu, B., and Basar, T. (2020b). Policy optimization for H2 linear control with H robustness guarantee: Implicit regularization and global convergence. In Proceedings of the 2nd Annual Conference on Learning for Dynamics and Control, volume 120, pages 179 190. 23

Zhang, S. and Jiang, N. (2021). Towards hyperparameter-free policy selection for offline reinforcement learning. In Advances in Neural Information Processing Systems, pages 12864 12875. 2, 23

Zhang, X., Chen, Y., Zhu, X., and Sun, W. (2022). Corruption-robust offline reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 5757 5773. PMLR. 23

Zhang, Y., Yang, Z., and Wang, Z. (2021). Provably efficient actor-critic for risk-sensitive and robust adversarial rl: A linear-quadratic case. In International Conference on Artificial Intelligence and Statistics, pages 2764 2772. 23

Zhou, K., Doyle, J. C., Glover, K., et al. (1996). Robust and optimal control, volume 40. Prentice hall New Jersey. 23

Zhou, Z., Bai, Q., Zhou, Z., Qiu, L., Blanchet, J., and Glynn, P. (2021). Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3331 3339. 2, 24

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] See contributions in the Introduction. (b) Did you describe the limitations of your work? [Yes] The discussions on the assumptions describes the limitations. (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] See Sections

3-4.3 (b) Did you include complete proofs of all theoretical results? [Yes] We provide proof sketch 4.4 in main paper and the complete proof in Appendix with self-contained material. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Described in the Appendix. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] Described in the main paper and the Appendix. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Mentioned in the Appendix. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]