# conservative_offline_distributional_reinforcement_learning__4486de5e.pdf

Conservative Ofﬂine Distributional

Reinforcement Learning

Yecheng Jason Ma, Dinesh Jayaraman, Osbert Bastani

University of Pennsylvania {jasonyma, dineshj, obastani}@seas.upenn.edu

Many reinforcement learning (RL) problems in practice are ofﬂine, learning purely from observational data. A key challenge is how to ensure the learned policy is safe, which requires quantifying the risk associated with different actions. In the online setting, distributional RL algorithms do so by learning the distribution over returns (i.e., cumulative rewards) instead of the expected return; beyond quantifying risk, they have also been shown to learn better representations for planning. We propose Conservative Ofﬂine Distributional Actor Critic (CODAC), an ofﬂine RL algorithm suitable for both risk-neutral and risk-averse domains. CODAC adapts distributional RL to the ofﬂine setting by penalizing the predicted quantiles of the return for out-of-distribution actions. We prove that CODAC learns a conservative return distribution in particular, for ﬁnite MDPs, CODAC converges to an uniform lower bound on the quantiles of the return distribution; our proof relies on a novel analysis of the distributional Bellman operator. In our experiments, on two challenging robot navigation tasks, CODAC successfully learns risk-averse policies using ofﬂine data collected purely from risk-neutral agents. Furthermore, CODAC is state-of-the-art on the D4RL Mu Jo Co benchmark in terms of both expected and risk-sensitive performance. Code is available at: https://github.com/Jason Ma2016/CODAC

1 Introduction

In many applications of reinforcement learning, actively gathering data through interactions with the environment can be risky and unsafe. Ofﬂine (or batch) reinforcement learning (RL) avoids this problem by learning a policy solely from historical data (called observational data) [9, 22, 23].

A shortcoming of most existing approaches to ofﬂine RL [11, 46, 20, 21, 48, 18] is that they are designed to maximize the expected value of the cumulative reward (which we call the return) of the policy. As a consequence, they are unable to quantify risk and ensure that the learned policy acts in a safe way. In the online setting, there has been recent work on distributional RL algorithms [7, 6, 27, 38, 17], which instead learn the full distribution over future returns. They can use this distribution to plan in a way that avoids taking risky, unsafe actions. Furthermore, when coupled with deep neural network function approximation, they can learn better state representations due to the richer distributional learning signal [4, 26], enabling them to outperform traditional RL algorithms even on the risk-neutral, expected return objective [4, 7, 6, 47, 14].

We propose Conservative Ofﬂine Distributional Actor-Critic (CODAC), which adapts distributional RL to the ofﬂine setting. A key challenge in ofﬂine RL is accounting for high uncertainty on out-of-distribution (OOD) state-action pairs for which observational data is limited [23, 20]; the value estimates for these state-action pairs are intrinsically high variance, and may be exploited by the policy without correction due to the lack of online data gathering and feedback. We build on conservative Q-learning [21], which penalizes Q values for OOD state-action pairs to ensure that

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

the learned Q-function lower bounds the true Q-function. Analogously, CODAC uses a penalty to ensure that the quantiles of the learned return distribution lower bound those of the true return distribution. Crucially, the lower bound is data-driven and selectively penalizes the quantile estimates of state-actions that are less frequent in the ofﬂine dataset; see Figure 1.

Figure 1: CODAC obtains conservative estimates of the true return quantiles (black); it penalizes out-of-distribution actions, µ(a | s), more heavily than indistribution actions, ˆβ(a | s).

We prove that for ﬁnite MDPs, CODAC converges to an estimate of the return distribution whose quantiles uniformly lower bound the quantiles of the true return distribution; in addition, this data-driven lower bound is tight up to the approximation error in estimating the quantiles using ﬁnite data. Thus, CODAC obtains a uniform lower bound on all integrations of the quantiles, including the standard RL objective of expected return, the risk-sensitive conditional-value-at-risk (CVa R) objective [35], as well as many other risk-sensitive objectives. We additionally prove that CODAC expands the gap in quantile estimates between in-distribution and OOD actions, thus avoiding overconﬁdence when extrapolating to OOD actions [11].

Our theoretical guarantees rely on novel techniques for analyzing the distributional Bellman operator, which is challenging since it acts on the inﬁnite-dimensional function space of return distributions (whereas the traditional Bellman operator acts on a ﬁnite-dimensional vector space). We provide several novel results that may be of independent interest; for instance, our techniques can be used to bound the error of the ﬁxed-point of the empirical distributional Bellman operator; see Appendix A.6.

Finally, to obtain a practical algorithm, CODAC builds on existing distributional RL algorithms by integrating conservative return distribution estimation into a quantile-based actor-critic framework.

In our experiments, we demonstrate the effectiveness of CODAC on both risk-sensitive and riskneutral RL. First, on two novel risk-sensitive robot navigation tasks, we show that CODAC successfully learns risk-averse policies using ofﬂine datasets collected purely from a risk-neutral agent, a challenging task that all our baselines fail to solve. Next, on the D4RL Mujoco suite [10], a popular ofﬂine RL benchmark, we show that CODAC achieves state-of-art results on both the original risk-neutral version as well a modiﬁed risk-sensitive version [43]. Finally, we empirically show that CODAC computes quantile lower-bounds and gap-expanded quantiles even on high-dimensional continuous-control problems, validating our key theoretical insights into the effectiveness of CODAC.

Related work. There has been growing interest in ofﬂine (or batch) RL [22, 23]. The key challenge in ofﬂine RL is to avoid overestimating the value of out-of-distribution (OOD) actions rarely taken in the observational dataset [40, 44, 20]. The problem is that policy learning optimizes against the value estimates; thus, even if the estimation error is i.i.d., policy optimization biases towards taking actions with high variance value estimates, since some of these values will be large by random chance. In risk-sensitive or safety-critical settings, these actions are exactly the ones that should be avoided.

One solution is to constrain the learned policy to take actions similar to the ones in the dataset (similar to imitation learning) e.g., by performing support matching [46] or distributional matching [20, 12]. However, these approaches tend to perform poorly when data is gathered from suboptimal policies. An alternative is to regularize the Q-function estimates to be conservative at OOD actions [21, 48, 18]. CODAC builds on these approaches, but obtains conservative estimates of all quantile values of the return distribution rather than just the expected return. Traditionally, the literature on off-policy evaluation (OPE) [32, 16, 39, 25, 37] aims to estimate the expected return of a policy using precollected ofﬂine data; CODAC proposes an OPE procedure amenable to all objectives that can be expressed as integrals of the return quantiles. Consequently, our ﬁne-grained approach not only enables risk-sensitive policy learning, but also improves performance on risk-neutral domains.

In particular, CODAC builds on recent works on distributional RL [4, 7, 6, 47], which parameterize and estimate the entire return distribution instead of just a point estimate of the expected return (i.e., the Q-function) [29, 28]. Distributional RL algorithms have been shown to achieve state-of-art performance on Atari and continuous control domains [14, 3]; intuitively, they provide richer training signals that stabilize value network training [4]. Existing distributional RL algorithms parameterize

the policy return distribution in many different ways, including canonical return atoms [4], the expectiles [36], the moments [31], and the quantiles [7, 6, 47]. CODAC builds on the quantile approach due to its suitability for risk-sensitive policy optimization. The quantile representation provides an uniﬁed framework for optimizing different objectives of interest [7], such as the riskneutral expected return, and a family of risk-sensitive objectives representable by the quantiles; this family includes, for example, the conditional-value-at-risk (CVa R) return [35, 5, 38, 17], the Wang measure [45], and the cumulative probability weighting (CPW) metric [42]. Recent work has provided theoretical guarantees on learning CVa R policies [17]; however, their approach cannot provide bounds on the quantiles of the estimated return distribution, which is signiﬁcantly more challenging since there is no closed-form expression for the Bellman update on the return quantiles.

Finally, there has been some recent work adapting distributional RL to the ofﬂine setting [1, 43]. First, REM [1] builds on QR-DQN [7], an online distributional RL algorithm; however, REM can be applied to regular DQN [29] and does not directly utilize the distributional aspect of QR-DQN. The closest work to ours is ORAAC [43], which uses distributional RL to learn a CVa R policy in the ofﬂine setting. ORAAC uses imitation learning to avoid OOD actions and stay close to the data distribution. As discussed above, imitation learning strategies can perform poorly unless the dataset comes from an optimal policy; in our experiments, we ﬁnd that CODAC signiﬁcantly outperforms ORAAC. Furthermore, unlike ORAAC, we provide theoretical guarantees on our approach.

2 Background

Ofﬂine RL. Consider a Markov Decision Process (MDP) [33] (S, A, P, R, γ), where S is the state space, A is the action space, P(s0 | s, a) is the transition distribution, R(r | s, a) is the reward distribution, and γ 2 (0, 1) is the discount factor, and consider a stochastic policy (a | s) : S ! (A). A rollout using from state s using initial action a is the random sequence = ((s0, a0, r0), (s1, a1, r1), ...) such that s0 = s, a0 = a, at ( | st) (for t > 0), rt R( | st, at), and st+1 P( | st, at); we denote the distribution over rollouts by D ( | s, a). The Q-function Q : S A ! R of is its expected discounted cumulative return Q (s, a) = ED ( |s,a)[P1

t=0 γtrt]. Assuming the rewards satisfy rt 2 [Rmin, Rmax], then we have Q (s, a) 2 [Vmin, Vmax] [Rmin/(1 γ), Rmax/(1 γ)].

A standard goal of reinforcement learning (RL), which we call risk-neutral RL, is to learn the optimal policy such that Q (s, a) Q (s, a) for all s 2 S, a 2 A and all .

In ofﬂine RL, the learning algorithm only has access to a ﬁxed dataset D := {(s, a, r, s0)}, where r R( | s, a) and s0 P( | s, a). The goal is to learn the optimal policy without any interaction with the environment. Though we do not assume that D necessarily comes from a single behavior

policy, we deﬁne the empirical behavior policy to be ˆ β(a | s) :=

(s0=s,a0=a) P

(s0=s) . With slight abuse of notation, we write (s, a, r, s0) D to denote a uniformly random sample from the dataset. Also, in this paper, we broadly refer to actions not drawn from ˆ β( | s) (i.e., low probability density) as out-of-distribution (OOD).

Fitted Q-evaluation (FQE) [9, 34] uses Q-learning for ofﬂine RL, which leverages the fact that Q = T Q is the unique ﬁxed point of the Bellman operator T : R|S||A| ! R|S||A| deﬁned by

T Q(s, a) = ER(r|s,a)[r] + γ EP (s0,a0|s,a)[Q(s0, a0)],

where P (s0, a0|s, a) = P(s0 | s, a) (a0 | s0). In the ofﬂine setting, we do not have access to T ; instead, FQE uses an approximation ˆT obtained by replacing R and P in T with estimates ˆR and

ˆP based on D. Then, we can estimate Q by starting from an arbitrary ˆQ0 and iteratively computing

ˆQk+1 := arg min

L( ˆQ, ˆT ˆQk) where L(Q, Q0) = ED(s,a)

(Q(s, a) Q0(s, a))2

Assuming we search over the space of all possible Q (i.e., do not use function approximation), then the minimizer is ˆQk+1 = ˆT ˆQk, so ˆQk = ( ˆT )k Q0. If ˆT = T , then limk!1 ˆQk = Q .

Distributional RL. In distributional RL, the goal is to learn the distribution of the discounted cumulative rewards (i.e., returns) [4]. Given a policy , we denote its return distribution as the random variable Z (s, a) = P1

t=0 γtrt, which is a function of a random rollout D ( | s, a);

note that Z includes three sources of randomness: (1) the reward R( | s, a), (2) the transition P( | s, a), and (3) the policy ( | s). Also, note that Q (s, a) = ED ( |s,a)[Z (s, a)]. Analogous to Q-function Bellman operator, the distributional Bellman operator for is

D := r + γZ(s0, a0) where r R( | s, a), s0 P( | s, a), a0 ( | s0), (1)

D= indicates equality in distribution. As with Q , Z is the unique ﬁxed point of T in Eq. 1.

Next, let FZ(s,a)(x) : [Vmin, Vmax] ! [0, 1] be the cumulative density function (CDF) for return distribution Z(s, a), and FR(s,a) be the CDF of R( | s, a) Then, we have the following equality, which captures how the distributional Bellman operator T operates on the CDF FZ(s,a) [17]:

FT Z(s,a)(x) =

P (s0, a0 | s, a)

d FR(s,a)(r). (2)

Let X and Y be two random variables. Then, the quantile function (i.e., inverse CDF) F 1

X of X is F 1

X ( ) := inf{x 2 R | FX(x)}, and the p-Wasserstein distance between X and Y is Wp(X, Y ) = (

X ( )|pd )1/p. Then, the distributional Bellman operator T is a γ-contraction in the Wp [4] i.e., letting dp(Z1, Z2) := sups,a Wp(Z1(s, a), Z2(s, a)) be the largest Wasserstein distance over (s, a), and Z = {Z : S A ! P(R) | 8(s, a) . E[|Z(s, a)|p] < 1} be the space of distributions over R with bounded p-th moment, then

dp(T Z1, T Z2) γ dp(Z1, Z2) (8Z1, Z2 2 Z). (3)

As a result, Z may be obtained by iteratively applying T to an initial distribution Z.

As before, in the ofﬂine setting, we can approximate T by ˆT using D. Then, we can compute Z

(represented as F 1

Z(s,a); see below) by starting from an arbitrary ˆZ0, and iteratively computing

ˆZk+1 = arg min

Lp(Z, ˆT ˆZk) where Lp(Z, Z0) = ED(s,a) [Wp(Z(s, a), Z0(s, a))p] . (4)

We call this procedure ﬁtted distributional evaluation (FDE).

One distributional RL algorithmic framework is quantile-based distributional RL [7, 6, 47, 27, 38, 43], where the return distribution Z is represented by its quantile function F 1

Z(s,a)( ) : [0, 1] ! R. Given

a distribution g( ) over [0, 1], the distorted expectation of Z is Φg(Z(s, a)) =

Z(s,a)( )g( )d , and the corresponding policy is g(s) := arg maxa Φg(Z(s, a)) [7]. If g = Uniform([0, 1]), then Q (s, a) = Φg(Z(s, a)); alternatively, g = Uniform([0, ]) corresponds to the CVa R [35, 5, 6] objective, where only the bottom -percentile of the return is considered. Additional risk-sensitive objectives are also compatible. For example, CPW [42] amounts to g( ) = β/( β + (1 )β)

1 β , and Wang [45] has g( ) = FN (F 1

N ( ) + β), where FN is the standard Gaussian CDF.

A drawback of FDE is that it does not account for estimation error, especially for pairs (s, a) that rarely appear in the given dataset D; thus, ˆZk(s, a) may be an overestimate of Zk(s, a) [12, 20, 21], even in distributional RL (since the learned distribution does not include randomness in the dataset) [14, 3]. Importantly, since we act by optimizing with respect to ˆZk(s, a), the optimization algorithm will exploit these errors, biasing towards actions with higher uncertainty, which is the opposite of what is desired. In Section 3, we propose and analyze a penalty designed to avoid this issue.

3 Conservative ofﬂine distributional policy evaluation

We describe our algorithm for computing a conservative estimate of Z (s, a), and provide theoretical guarantees for ﬁnite MDPs. In particular, we modify Eq. 4 to include a penalty term:

Zk+1 = arg min

EU( ),D(s,a)

c0(s, a) F 1

+ Lp(Z, ˆT Zk) (5)

for some state-action dependent scale factor c0; here, U = Uniform([0, 1]). This objective adapts the conservative penalty in prior work [21] to the distributional RL setting; in particular, the ﬁrst term in

Figure 2: Overview of our theoretical results.

the objective is a penalty that aims to shrink the quantile values for out-of-distribution (OOD) actions compared to those of in-distribution actions; intuitively, c0(s, a) should be larger for OOD actions. For now, we let c0 be arbitrary; we describe our choice in Section 4. 2 R>0 is a hyperparameter controlling the magnitude of the penalty term with respect to the usual FDE objective. We call this iterative algorithm conservative distribution evaluation (CDE).

Next, we analyze theoretical properties of CDE in the setting of ﬁnite MDPs; Figure 2 overviews these results. First, we prove that CDE iteratively obtains conservative quantile estimates (Lemma 3.4) and deﬁnes a contraction operator on return distributions (Lemma 3.5). Then, our main result (Theorem 3.6) is that CDE converges to a ﬁxed point Z whose quantile function lower bounds that of the true returns Z . We also prove that CDE is gap-expanding (Theorem 3.8) i.e., it is more conservative for actions that are rare in D. These results translate to RL objectives computed by integrating the return quantiles, including expected and CVa R returns (Corollaries 3.7 & 3.9).

We begin by describing our assumptions on the MDP and dataset. First, we assume that our dataset D has nonzero coverage of all actions for states in the dataset [23, 21]. Assumption 3.1. For all s 2 D and a 2 A, we have ˆ β(a | s) > 0.

This assumption is only needed by our theoretical analysis to avoid division-by-zero and ensure that all estimates are well-deﬁned; alternatively, we could assign a very low value ˆ β(a | s) := for all actions not visited at state s in the ofﬂine dataset and renormalize ˆ β(a | s) accordingly. Next, we impose regularity conditions on the ﬁxed point Z of the distributional Bellman operator T . Assumption 3.2. For all s 2 S and a 2 A, FZ (s,a) is smooth. Furthermore, there exists 2 R>0 such that for all s 2 S and a 2 A, FZ (s,a) is -strongly monotone i.e., we have F 0

Z (s,a)(x) .

The smoothness assumption ensures that the pth moments of Z (s, a) are bounded (since Z 2 [Vmin, Vmax] is also bounded), which in turn ensures that Z 2 Z. The monotonicity assumption is needed to ensure convergence of F 1

Z (s,a). Next, we assume that the search space in (5) includes all possible functions (i.e., no function approximation). Assumption 3.3. The search space of the minimum over Z in (5) is over all smooth functions FZ(s,a) (for all s 2 S and a 2 A) with support on [Vmin, Vmax].

This assumption is required for us to analytically characterize the solution Zk+1 of the CDE objective. Finally, we also assume p > 1 (i.e., we use the p-Wasserstein distance for some p > 1).

Now, we describe our key results. Our ﬁrst result characterizes the CDE iterates Zk+1; importantly, if c0(s, a) > 0, then these iterates encode successively more conservative quantile estimates. Lemma 3.4. For all s 2 D, a 2 A, k 2 N, and 2 [0, 1], we have

Zk+1(s,a)( ) = F 1

ˆT Zk(s,a)( ) c(s, a) where c(s, a) = | p 1c0(s, a)|1/(p 1) sign(c0(s, a)).

We give a proof in Appendix A.1; roughly speaking, it follows by setting the gradient of (5) equal to zero, relying on results from the calculus of variations to handle the fact that F 1

Z(s,a) is a function.

Next, we deﬁne the CDE operator T = Oc ˆT to be the composition of ˆT with the shift operator Oc : Z ! Z deﬁned by F 1

Oc Z(s,a)( ) = F 1

Z(s,a)( ) c(s, a); thus, Lemma 3.4 says Zk+1 = T Zk.

Now, we show that T is a contraction in the maximum Wasserstein distance dp.

Lemma 3.5. T is a γ-contraction in dp, so Zk converges to a unique ﬁxed point Z .

The ﬁrst part follows since ˆT is a γ-contraction in dp [4, 7], and Oc is a non-expansion in dp, so by composition, T is a γ-contraction in dp; the second follows by the Banach ﬁxed point theorem.

Now, our ﬁrst main theorem says that the ﬁxed point Z of T is a conservative estimate of Z at all quantiles i.e., CDE computes quantile estimates that lower bound the quantiles of the true return; furthermore, it says that this lower bound is tight.

Theorem 3.6. For any δ 2 R>0, c0(s, a) > 0, with probability at least 1 δ,

Z (s,a)( ) F 1

Z (s,a)( ) + (1 γ) 1 min

s0,a0{c(s0, a0) (s0, a0)},

Z (s,a)( ) F 1

Z (s,a)( ) + (1 γ) 1 max

s0,a0 {c(s0, a0) (s0, a0)}

for all s 2 D, a 2 A, and 2 [0, 1], where (s, a) = 1

5|S| n(s,a) log 4|S||A|

δ . Furthermore, for

sufﬁciently large (i.e., maxs,a{ p (s,a)p 1

c0(s,a) }), we have F 1

Z (s,a)( ) F 1

Z (s,a)( ).

We give a proof in Appendix A.2. The ﬁrst inequality says that the quantile estimates computed by CDE form a lower bound on the true quantiles; this bound is not vacuous as long as satisﬁes the given condition. Furthermore, the second inequality states that this lower bound is tight.

Many RL objectives (e.g., expected or CVa R return) are distorted expectations (i.e, integrals of the return quantiles). We can extend Theorem 3.6 to obtain conservative estimates for all such objectives:

Corollary 3.7. For any δ 2 R>0, c0(s, a) > 0, sufﬁciently large, and g( ), with probability at least 1 δ, for all s 2 D, a 2 A, we have Φg(Z (s, a)) Φg( Z (s, a)).

Choosing g = Uniform([0, 1]) gives Q (s, a) Q (s, a) i.e., a lower bound on the Q-function. CQL [21] obtains a similar lower-bound; thus, CDE generalizes CQL to other objectives e.g., it can be used in conjunction with any distorted expectation objective (e.g., CVa R, Wang, CPW, etc.) for risk-sensitive ofﬂine RL.

Note that Theorem 3.6 does not preclude the possibility that the lower bounds are more conservative for good actions (i.e., ones for which ˆ β(a | s) is larger). We prove that under the choice1

c0(s, a) = µ(a | s) ˆ β(a | s)

ˆ β(a | s) (6)

for some µ(a | s) 6= ˆ β(a | s), then T is gap-expanding i.e., the difference in quantile values between in-distribution and out-of-distribution actions is larger under T than under T . Intuitively, c0(s, a) is large for actions a with higher probability under µ than under ˆ β (i.e., an OOD action).

Theorem 3.8. For p = 2, sufﬁciently large, and c0 as in (6), for all s 2 S and 2 [0, 1],

Eˆ β(a|s)F 1

Z (s,a)( ) Eµ(a|s)F 1

Z (s,a)( ) Eˆ β(a|s)F 1

Z (s,a)( ) Eµ(a|s)F 1

Z (s,a)( ).

As before, the gap-expansion property implies gap-expansion of integrals of the quantiles i.e.:

Corollary 3.9. For p = 2, sufﬁciently large, c0 as in (6), and any g( ), for all s 2 S,

Eˆ β(a|s)Φg( Z (s, a)) Eµ(a|s)Φg( Z (s, a)) Eˆ β(a|s)Φg(Z (s, a)) Eµ(a|s)Φg(Z (s, a)).

Together, Corollaries 3.7 & 3.9 say that CDE provides conservative lower bounds on the return quantiles while being less conservative for in-distribution actions.

Finally, we brieﬂy discuss the condition on in Theorems 3.6 & 3.8. In general, can be taken to be small as long as (s, a) is small for all s 2 S and a 2 A, which in turn holds as long as n(s, a) is large i.e., the dataset D has wide coverage.

1We may have c0(s, a) 0; we can use c0

0(s, a) = c0(s, a) + (1 mins,a c0(s, a)) to avoid this issue.

CODAC-C CODAC-N ORAAC CQL

Figure 3: 2D visualization of evaluation trajectories on the Risky Point Mass environment. The red region is risky, the solid blue circles indicate initial states, and the blue lines are trajectories. CODAC-C is the only algorithm that successfully avoids the risky region.

4 Conservative ofﬂine distributional actor critic

Next, we incorporate the distributional evaluation algorithm in Section 3 into an actor-critic framework. Following [21], we propose a min-max objective where the inner loop chooses the current policy to maximize the CDE objective, and the outer loop minimizes the CDE objective for this policy:

ˆZk+1 = arg min

ED(s),µ(a|s)F 1

Z(s,a)( ) ED(s,a)F 1

+ Lp(Z, ˆT k ˆZk)

where we have used c0 as in (6). We can interpret µ as an actor policy, the ﬁrst term as the objective for µ, and the overall objective as an actor-critic algorithm [8]. In this framework, a natural choice for µ is a maximum entropy policy µ(a | s) / exp(Q(s, a)) [49]. Then, our objective becomes

ˆZk+1 = arg min

Z(s,a)( )) ED(s,a)F 1

+ Lp(Z, ˆT k ˆZk)

where U = Uniform([0, 1]); we provide a derivation in Appendix B. We call this strategy conservative ofﬂine distributional actor critic (CODAC). To optimize over Z, we represent the quantile function as a DNN G ( ; s, a) F 1

Z(s,a)( ). The main challenge is optimizing the term

Lp(Z, ˆT ˆZk) = Wp(Z, ˆT ˆZk)p. We do so using distributional temporal-differences (TD) [6]. For a sample (s, a, r, s0) D and a0 ( | s0) and random quantiles , 0 U, we have

Lp(Z, ˆT ˆZk) L (δ; ) where δ = r + γG 0( 0; s0, a0) G ( ; s, a).

Here, δ is the distributional TD error, 0 are the parameters of the target Q-network [30], and

(δ < 0)| δ2/(2 ) if |δ| |

(δ < 0)| (|δ| /2) otherwise . (7)

is the -Huber quantile regression loss at threshold [15]; then, EU( )L (δ; ) is an unbiased estimator of the Wasserstein distance that can be optimized using stochastic gradient descent (SGD) [19]. With this strategy, our overall objective can be optimized using any off-policy actorcritic method [24, 13, 11]; we use distributional soft actor-critic (DSAC) [27], which replaces the Q-network in SAC [13] with a quantile distributional critic network [7]. We provide the full CODAC pseudocode in Algorithm 1 of Appendix B.

5 Experiments

We show that CODAC achieves state-of-the-art results on both risk-sensitive (Sections 5.1 &5.2) and risk-neutral (Section 5.3) ofﬂine RL tasks, including our risky robot navigation and D4RL2 [10]. We also show that our lower bound (Theorem 3.6) and gap-expansion (Theorem 3.8) results approximately hold in practice, (Section 5.4), validating our theory on CODAC s effectiveness. We provide additional details (e.g., environment descriptions, hyperparameters, and additional results) in Appendix C.

5.1 Risky robot navigation

2The D4RL dataset license is Apache License 2.0.

Table 1: Risky robot navigation quantitative evaluation. CODAC-C achieves the best performance on most metrics and is the only method that learns risk-averse behavior. This table is reproduced with standard deviations in Table 6 in Appendix C.

Algorithm Risky Point Mass Risky Ant Mean Median CVa R0.1 Violations Mean Median CVa R0.1 Violations

DSAC (Online) -7.69 -3.82 -49.9 94 -866.1 -833.3 -1422.7 2247 CODAC-C (Ours) -6.05 -4.89 -14.73 0.0 -456.0 -433.4 -686.6 347.8 CODAC-N (Ours) -8.60 -4.05 -51.96 108.3 -432.7 -395.1 -847.1 936.0 ORAAC -10.67 -4.55 -64.12 138.7 -788.1 -795.3 -1247.2 1196 CQL -7.51 -4.18 -43.44 93.4 -967.8 -858.5 -1887.3 1854.3

Figure 4: Risky Ant.

Tasks. We consider an Ant robot whose goal is to navigate from a random initial state to the green circle as quickly as possible (see Figure 4 for a visualization). Passing through the red circle triggers a high cost with small probability, introducing risk. A risk-neutral agent may pass through the red region, but a risk-aware agent should not. We also consider a Point Mass variant for illustrative purposes.

We construct an ofﬂine dataset that is the replay buffer of a riskneutral distributional SAC [27] agent. Intuitively, this choice matches the practical goals of ofﬂine RL, where data is gathered from a diverse range of sources with no assumptions on their quality and risk tolerance levels. [23] See Appendix C.1 for details on the environments and datasets.

Approaches. We consider two variants of CODAC: (i) CODAC-N, which maximizes the expected return and (ii) CODAC-C, which optimizes the CVa R0.1 objective. We compare to Ofﬂine Risk Averse Actor Critic (ORAAC) [43], a state-of-the-art ofﬂine risk-averse RL algorithm that combines a distributional critic with an imitation-learning based policy to optimize a risk-sensitive objective, and to Conservative Q-Learning (CQL) [21], a state-of-art ofﬂine RL algorithm that is non-distributional.

Results. We evaluate each approach using 100 test episodes, reporting the mean, median, and CVa R0.1 (i.e., average over bottom 10 episodes) returns, and the total number of violations (i.e., time steps spent inside the risky region), all averaged over 5 random seeds. We also report the performance of the online DSAC agent used to collect data. See Appendix C for details. Results are in Table 1.

Performance of CODAC. CODAC-C consistently outperforms the other approaches on the CVa R0.1 return, as well as the number of violations, demonstrating that CODAC-C is able to avoid risky actions. It is also competitive in terms of mean return due to its high CVa R0.1 performance, but performs slightly worse on median return, since it is not designed to optimize this objective. Remarkably, on Risky Point Mass, CODAC-C learns a safe policy that completely avoids the risky region (i.e., zero violations), even though such behavior is absent in the dataset. In Appendix C.1, we also show that CODAC can successfully optimize alternative risk-sensitive objectives such as Wang and CPW.

Comparison to ORAAC. While ORAAC also optimizes the CVa R objective, it uses imitation learning to regularize the learned policy to stay close to the empirical behavior policy. However, the dataset contains many sub-optimal trajectories generated early in training, and is furthermore risk-neutral. Thus, imitating the behavioral policy encourages poor performance. In practice, a key use of ofﬂine RL is to leverage large datasets available for training, and such datasets will rarely consist of data from a single, high-quality behavioral policy. Our results demonstrate that CODAC is signiﬁcantly better suited to learned in these settings compared to ORAAC.

Comparison to CQL. On Risky Point Mass, CQL learns a risky policy with poor tail-performance, indicated by its high median performance but low CVa R0.1 performance. Interestingly, its mean performance is also poor; intuitively, the mean is highly sensitive to outliers that may not be present in the training dataset. On Risky Ant, possibly due to the added challenge of high-dimensionality, CQL performs poorly on all metrics, failing to reach the goal and to avoid the risky region. As expected, these results show that accounting for risk is necessary in risky environments.

Qualitative analysis. In Figure 3, we show the 100 evaluation rollouts from each policy on Risky Point Mass. As can be seen, CODAC-C extrapolates a safe policy that distances itself from the risky

Table 2: D4RL results. CODAC achieves the best overall performance in both risk-sensitive (Left) and risk-neutral (Right) variants of the benchmark. These tables are reproduced with standard deviations in Tables 7 & 9 in Appendix C.

Algorithm Medium Mixed Mean CVa R0.1 Mean CVa R0.1

CQL 33.2 -15.0 214.1 12.0 ORAAC 361.4 91.3 307.1 118.9 CODAC-N 338 -41 347.7 149.2 CODAC-C 335 -27 396.4 238.5

CQL 877.9 693.0 189.2 -21.4 ORAAC 1007.1 767.6 876.3 524.9 CODAC-N 993.7 952.5 1483.9 1457.6 CODAC-C 1014.0 976.4 1551.2 1449.6

CQL 1524.3 1343.8 74.3 -64.0 ORAAC 1134.1 663.0 222.0 -69.6 CODAC-N 1537.3 1158.8 358.7 106.4 CODAC-C 1120.8 902.3 450.0 261.4

Dataset BCQ MOPO CQL ORAAC CODAC

halfcheetah-random 2.2 35.4 35.4 13.5 34.6 hopper-random 10.6 11.7 10.8 9.8 11.0 walker2d-random 4.9 13.6 7.0 3.2 18.7

halfcheetah-medium 40.7 42.3 44.4 41.0 46.3 walker2d-medium 53.1 17.8 79.2 27.3 82.0 hopper-medium 54.5 28.0 58.0 1.48 70.8

halfcheetah-mixed 38.2 53.1 46.2 30.0 44.1 hopper-mixed 33.1 67.5 48.6 16.3 100.2 walker2d-mixed 15.0 39.0 26.7 28 33.2

halfcheetah-med-exp 64.7 63.3 62.4 24.0 70.4 walker2d-med-exp 57.5 44.6 98.7 28.2 106.0 hopper-med-exp 110.9 23.7 111.0 18.2 112.0

region before proceeding to the goal; in contrast, all other agents traverse the risky region. For Ant, we include plots of the trajectories of trained agents in Appendix C.1, and videos in the supplement.

5.2 Risk-sensitive D4RL

Tasks. Next, we consider stochastic D4RL [43]. The original D4RL benchmark [10] consists of datasets collected by SAC agents of varying performance (Mixed, Medium, and Expert) on the Hopper, Walker2d, and Half Cheetah Mu Jo Co environments [41]; stochastic D4RL relabels the rewards to represent stochastic robot damage for behaviors such as unnatural gaits or high velocities; see Appendix C.2. The Expert dataset consists of rollouts from a ﬁxed SAC agent trained to convergence; the Medium dataset is constructed the same way except the agent is trained to only achieve 50% of the expert agent s return. The Mixed dataset is the replay buffer of the Medium agent.

Results. In Table 2 (Left), we report the mean and CVa R0.1 returns on test episodes from each approach, averaged over 5 random seeds. We show results on the Expert dataset in Appendix C.2; CODAC still achieves the strongest performance. As can be seen, CODAC-C and CODAC-N outperform both CQL and ORAAC on most datasets. Surprisingly, CODAC-N is quite effective on the CVa R0.1 metric despite its risk-neutral objective; a likely explanation is that for these datasets, mean and CVa R performance are highly correlated. Furthermore, we observe that directly optimizing CVa R may lead to unstable training, potentially since CVa R estimates have higher variance. This instability occurs for both CODAC-C and ORAAC on Walker2d-Medium, they perform worse than the risk-neutral algorithms. Overall, CODAC-C outperforms CODAC-N in terms of CVa R0.1 on about half of the datasets, and often improves mean performance as well. Next, while ORAAC is generally effective on Medium datasets, it performs poorly on Mixed datasets; these results mirror the ones in Section 5.1. Finally, CQL s performance varies drastically across datasets; we hypothesize that learning the full distribution helps stabilize training in CODAC. In Appendix C.2, we also qualitatively analyze the behavior learned by CODAC compared to the baselines, demonstrating that the better CVa R performance CODAC obtains indeed translates to safer locomotion behaviors.

5.3 Risk-neutral D4RL

Task. Next, we show that CODAC is effective even when the goal is to optimize the standard expected return. To this end, we evaluate CODAC-N on the popular D4RL Mujoco benchmark [10].

Baselines. We compare to state-of-art algorithms benchmarked in [10] and [48], including Batch Constrained Q-Learning (BCQ), Model-Based Ofﬂine Policy Optimization (MOPO) [48], and CQL. We also include ORAAC as an ofﬂine distributional RL baseline. We have omitted less competitive baselines included in [10] from the main text; a full comparison is included in Appendix C.3.

Results. Results for non-distributional approaches are directly taken from [10]; for ORAAC and CODAC, we evaluate them using 10 test episodes in the environment, averaged over 5 random seeds. As shown in Table 2 (Right), CODAC achieves strong performance across all 12 datasets, obtaining state-of-art results on 5 datasets (walker2d-random, hopper-medium, hopper-mixed, halfcheetahmedium-expert, and walker2d-medium-expert), demonstrating that performance improvements from

distributional learning also apply in the ofﬂine setting. Note that CODAC s advantage is not solely due to distributional RL ORAAC also uses distributional RL, but in most cases underperforms prior state-of-the-art, These results suggest that CODAC s use of a conservative penalty is critical for it to achieve strong performance.

5.4 Analysis of Theoretical Insights

Table 3: Monte-Carlo estimate vs. critic prediction. The CODACpredicted expected and CVa R0.1 return is a lower bound on a MC estimate of the true value.

Regular Walker2d-Medium Walker2d-Mixed Walker2d-Medium-Expert MC Return Q-Estimate MC Return Q-Estimate MC Return Q-Estimate

CODAC 240.2 55.7 127.1 97.6 370. 39.7 CQL 247.2 53.0 124.5 -45.2 369.7 116.4 ORAAC 245.2 302.2 118.2 7.70 105 68.2 322.2

Stochastic Walker2d-Medium Walker2d-Mixed Walker2d-Medium-Expert MC CVa R0.1 Z-Estimate MC CVa R0.1 Z-Estimate MC CVa R0.1 Z-Estimate

CODAC 185.7 204.2 85.6 59.9 265.3 -127.8 ORAAC 201.9 367.6 50.9 1.54 106 199.5 343.5

We perform additional experiments to validate that our theoretical insights in Section 3 hold in practice, suggesting that they help explain CODAC s empirical performance.

Lower bound. We show that in practice, CODAC obtains conservative estimates of the Q and CVa R objectives across different dataset types (i.e., Medium vs. Mixed vs. Medium-Expert). Given an initial state s0, we obtain a Monte Carlo (MC) estimate of Q and CVa R for (s0, (s0)) based on sampled rollouts from s0, and compare them to the values predicted by the critic. In Table 3, we show results averaged over 10 random s0 and with 100 MC samples for each s0. CODAC obtains conservative estimates for both Q and CVa R; in contrast, ORAAC overestimates these values, especially on Mixed datasets, and CQL only obtains conservative estimates for Q, not CVa R.

Table 4: Gap-expansion: CODAC expands the quantile gap and obtains higher returns than an ablation without the conservative penalty.

Half Cheetah-Medium-Expert Hopper-Medium-Expert Walker2d-Medium-Expert Positive Gap % Return Positive Gap % Return Positive Gap % Return

CODAC 95.3 93.6 91.3 111.9 91.1 111.3 CODAC w.o. Penalty 4.7 12.1 8.7 25.8 8.9 5.9

Gap-Expansion. Next, we verify that CODAC s quantile estimates expand their gap between in-distribution and out-of-distribution actions. We use the D4RL Medium-Expert datasets where CODAC uniformly performs well, making them ideal for understanding the source of CODAC s empirical performance. We train CODAC w.o. Penalty , a non-conservative variant of CODAC (i.e., = 0), and use its actor as µ and its critic as F 1

Z . Next, for each dataset, we randomly sample 1000 state-action pairs and 32 quantiles , resulting in 32000 (s, a, ) tuples; for each one, we compute the quantile gaps for CODAC and CODAC w.o. Penalty.

In Table 4, we show the percentage of tuples where each CODAC variant has a larger quantile gap, along with their average return. As can be seen, CODAC has a larger gap for more than 90% of the tuples on all datasets, as well as signiﬁcantly higher returns. These results show that gap-expansion holds in practice and suggest that it helps CODAC achieve good performance.

6 Conclusion

We have introduced Conservative Ofﬂine Distributional Actor-Critic (CODAC), a general purpose ofﬂine distributional reinforcement learning algorithm. We have proven that CODAC obtains conservative estimates of the return quantile, which translate into lower bounds on Q and CVa R values. In our experiments, CODAC outperforms prior approaches on both stochastic, risk-sensitive ofﬂine RL benchmarks, as well as traditional, risk-neutral benchmarks.

One limitation of our work is that CODAC has hyperparameters that must be tuned (in particular, the penalty magnitude ). As in prior work, we choose these hyperparameters by evaluate online rollouts in the environment. Designing better hyperparameter selection strategies for ofﬂine RL is an important direction for future work. Finally, we do not foresee any societal impacts or ethical concerns for our work, other than the usual risks around algorithms for improving robotics capabilities.

Acknowledgments and Disclosure of Funding

This work is funded in part by an Amazon Research Award, gift funding from NEC Laboratories America, NSF Award CCF-1910769, and ARO Award W911NF-20-1-0080. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

[1] Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective

on ofﬂine reinforcement learning. In International Conference on Machine Learning, pages 104 114. PMLR, 2020.

[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ar Xiv preprint

ar Xiv:1607.06450, 2016.

[3] Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva

Tb, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. ar Xiv preprint ar Xiv:1804.08617, 2018.

[4] Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforce-

ment learning. In International Conference on Machine Learning, pages 449 458. PMLR, 2017.

[5] Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained

reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070 6120, 2017.

[6] Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for

distributional reinforcement learning. In International conference on machine learning, pages 1096 1105. PMLR, 2018.

[7] Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforce-

ment learning with quantile regression. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 32, 2018.

[8] Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. ar Xiv preprint

ar Xiv:1205.4839, 2012.

[9] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement

learning. Journal of Machine Learning Research, 6:503 556, 2005.

[10] Justin Fu, Aviral Kumar, Oﬁr Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for

deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020.

[11] Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in

actor-critic methods. In International Conference on Machine Learning, pages 1587 1596. PMLR, 2018.

[12] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning

without exploration. In International Conference on Machine Learning, pages 2052 2062. PMLR, 2019.

[13] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-

policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861 1870. PMLR, 2018.

[14] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney,

Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 32, 2018.

[15] Peter J. Huber. Robust Estimation of a Location Parameter. The Annals of Mathematical

Statistics, 35(1):73 101, 1964.

[16] Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning,

[17] Ramtin Keramati, Christoph Dann, Alex Tamkin, and Emma Brunskill. Being optimistic to

be conservative: Quickly learning a cvar policy. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pages 4436 4443, 2020.

[18] Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel:

Model-based ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:2005.05951, 2020.

[19] Roger Koenker and Kevin F Hallock. Quantile regression. Journal of economic perspectives,

z15(4):143 156, 2001.

[20] Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning

via bootstrapping error reduction. ar Xiv preprint ar Xiv:1906.00949, 2019.

[21] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning

for ofﬂine reinforcement learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1179 1191. Curran Associates, Inc., 2020.

[22] Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. In

Reinforcement learning, pages 45 73. Springer, 2012.

[23] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Ofﬂine reinforcement learning:

Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020.

[24] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,

David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015.

[25] Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon:

Inﬁnite-horizon off-policy estimation. ar Xiv preprint ar Xiv:1810.12429, 2018.

[26] Clare Lyle, Marc G Bellemare, and Pablo Samuel Castro. A comparative analysis of expected

and distributional reinforcement learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pages 4504 4511, 2019.

[27] Xiaoteng Ma, Qiyuan Zhang, Li Xia, Zhengyuan Zhou, Jun Yang, and Qianchuan Zhao.

Distributional soft actor critic for risk sensitive learning. ar Xiv preprint ar Xiv:2004.14547, 2020.

[28] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,

Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928 1937. PMLR, 2016.

[29] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan

Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013.

[30] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G

Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529 533, 2015.

[31] Thanh Tang Nguyen, Sunil Gupta, and Svetha Venkatesh. Distributional reinforcement learning

via moment matching, 2020.

[32] Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department

Faculty Publication Series, page 80, 2000.

[33] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming.

John Wiley & Sons, 2014.

[34] Martin Riedmiller. Neural ﬁtted q iteration ﬁrst experiences with a data efﬁcient neural reinforcement learning method. In European Conference on Machine Learning, pages 317 328. Springer, 2005.

[35] R Tyrrell Rockafellar and Stanislav Uryasev. Conditional value-at-risk for general loss distribu-

tions. Journal of banking & ﬁnance, 26(7):1443 1471, 2002.

[36] Mark Rowland, Robert Dadashi, Saurabh Kumar, Rémi Munos, Marc G Bellemare, and Will

Dabney. Statistics and samples in distributional reinforcement learning. In International Conference on Machine Learning, pages 5528 5536. PMLR, 2019.

[37] Simon P. Shen, Yecheng Jason Ma, Omer Gottesman, and Finale Doshi-Velez. State relevance

for off-policy evaluation, 2021.

[38] Yichuan Charlie Tang, Jian Zhang, and Ruslan Salakhutdinov. Worst cases policy gradients.

ar Xiv preprint ar Xiv:1911.03618, 2019.

[39] Philip Thomas and Emma Brunskill. Data-efﬁcient off-policy policy evaluation for reinforce-

ment learning. In International Conference on Machine Learning, pages 2139 2148. PMLR, 2016.

[40] Sebastian Thrun and Anton Schwartz. Issues in using function approximation for reinforcement

learning. In Proceedings of the Fourth Connectionist Models Summer School, pages 255 263. Hillsdale, NJ, 1993.

[41] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based

control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026 5033. IEEE, 2012.

[42] Amos Tversky and Daniel Kahneman. Advances in prospect theory: Cumulative representation

of uncertainty. Journal of Risk and uncertainty, 5(4):297 323, 1992.

[43] Núria Armengol Urpí, Sebastian Curi, and Andreas Krause. Risk-averse ofﬂine reinforcement

learning. ar Xiv preprint ar Xiv:2102.05371, 2021.

[44] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double

q-learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 30, 2016.

[45] Shaun S Wang. A class of distortion operators for pricing ﬁnancial and insurance risks. Journal

of risk and insurance, pages 15 36, 2000.

[46] Yifan Wu, George Tucker, and Oﬁr Nachum. Behavior regularized ofﬂine reinforcement

learning. ar Xiv preprint ar Xiv:1911.11361, 2019.

[47] Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tieyan Liu. Fully parameterized

quantile function for distributional reinforcement learning. ar Xiv preprint ar Xiv:1911.02140, 2019.

[48] Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea

Finn, and Tengyu Ma. Mopo: Model-based ofﬂine policy optimization. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 14129 14142. Curran Associates, Inc., 2020.

[49] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy

inverse reinforcement learning. In Aaai, volume 8, pages 1433 1438. Chicago, IL, USA, 2008.