# offline_rl_without_offpolicy_evaluation__640ea65b.pdf

Ofﬂine RL Without Off-Policy Evaluation

David Brandfonbrener William F. Whitney Rajesh Ranganath Joan Bruna Department of Computer Science, Center for Data Science New York University david.brandfonbrener@nyu.edu

Most prior approaches to ofﬂine reinforcement learning (RL) have taken an iterative actor-critic approach involving off-policy evaluation. In this paper we show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well. This onestep algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark. The one-step baseline achieves this strong performance while being notably simpler and more robust to hyperparameters than previously proposed iterative algorithms. We argue that the relatively poor performance of iterative approaches is a result of the high variance inherent in doing off-policy evaluation and magniﬁed by the repeated optimization of policies against those estimates. In addition, we hypothesize that the strong performance of the one-step algorithm is due to a combination of favorable structure in the environment and behavior policy.

1 Introduction

An important step towards effective real-world RL is to improve sample efﬁciency. One avenue towards this goal is ofﬂine RL (also known as batch RL) where we attempt to learn a new policy from data collected by some other behavior policy without interacting with the environment. Recent work in ofﬂine RL is well summarized by Levine et al. [2020]. In this paper, we challenge the dominant paradigm in the deep ofﬂine RL literature that primarily relies on actor-critic style algorithms that alternate between policy evaluation and policy improvement [Fujimoto et al., 2018a, 2019, Peng et al., 2019, Kumar et al., 2019, 2020, Wang et al., 2020b, Wu et al., 2019, Kostrikov et al., 2021, Jaques et al., 2019, Siegel et al., 2020, Nachum et al., 2019]. All these algorithms rely heavily on off-policy evaluation to learn the critic. Instead, we ﬁnd that a simple baseline which only performs one step of policy improvement using the behavior Q function often outperforms the more complicated iterative algorithms. Explicitly, we ﬁnd that our one-step algorithm beats prior results of iterative algorithms on most of the gym-mujoco [Brockman et al., 2016] and Adroit [Rajeswaran et al., 2017] tasks in the the D4RL benchmark suite [Fu et al., 2020]. We then dive deeper to understand why such a simple baseline is effective. First, we examine what goes wrong for the iterative algorithms. When these algorithms struggle, it is often due to poor off-policy evaluation leading to inaccurate Q values. We attribute this to two causes: (1) distribution shift between the behavior policy and the policy to be evaluated, and (2) iterative error exploitation whereby policy optimization introduces bias and dynamic programming propagates this bias across the state space. We show that empirically both issues exist in the benchmark tasks and that one way to avoid these issues is to simply avoid off-policy evaluation entirely. Finally, we recognize that while the the one-step algorithm is a strong baseline, it is not always the best choice. In the ﬁnal section we provide some guidance about when iterative algorithms can perform better than the simple one-step baseline. Namely, when the dataset is large and behavior policy has good coverage of the state-action space, then off-policy evaluation can succeed and iterative

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Figure 1: A cartoon illustration of the difference between one-step and multi-step methods. All algorithms constrain themselves to a neighborhood of safe policies around β. A one-step approach (left) only uses the on-policy b Qβ, while a multi-step approach (right) repeatedly uses off-policy b Qπi.

algorithms can be effective. In contrast, if the behavior policy is already fairly good, but as a result does not have full coverage, then one-step algorithms are often preferable. Our main contributions are:

A demonstration that a simple baseline of one step of policy improvement outperforms more complicated iterative algorithms on a broad set of ofﬂine RL problems. An examination of failure modes of off-policy evaluation in iterative ofﬂine RL algorithms. A description of when one-step algorithms are likely to outperform iterative approaches.

2 Setting and notation

We will consider an ofﬂine RL setup as follows. Let M = {S, A, ρ, P, R, γ} be a discounted inﬁnitehorizon MDP. In this work we focus on applications in continuous control, so we will generally assume that both S and A are continuous and bounded. We consider the ofﬂine setting where rather than interacting with M, we only have access to a dataset DN of N tuples of (si, ai, ri) collected by some behavior policy β with initial state distribution ρ. Let r(s, a) = Er|s,a[r] be the expected reward. Deﬁne the state-action value function for any policy π by Qπ(s, a) := EP,π|s0=s, a0=a[P t=0 γtr(st, at)]. The objective is to maximize the expected return J of the learned policy:

J(π) := E ρ,P,π

t=0 γtr(st, at)

= E s ρ a π|s [Qπ(s, a)]. (1)

Following Fu et al. [2020] and others in this line of work, we allow access to the environment to tune a small (< 10) set of hyperparameters. See Paine et al. [2020] for a discussion of the active area of research on hyperparameter tuning for ofﬂine RL. We also discuss this further in Appendix C.

3 Related work

Iterative algorithms. Most prior work on deep ofﬂine RL consists of iterative actor-critic algorithms. The primary innovation of each paper is to propose a different mechanism to ensure that the learned policy does not stray too far from the data generated by the behavior policy. Broadly, we group these methods into three camps: policy constraints/regularization, modiﬁcations of imitation learning, and Q regularization:

1. The majority of prior work acts directly on the policy. Some authors have proposed explicit constraints on the learned policy to only select actions where (s, a) has sufﬁcient support under the data generating distribution [Fujimoto et al., 2018a, 2019, Laroche et al., 2019]. Another proposal is to regularize the learned policy towards the behavior policy [Wu et al., 2019] usually either with a KL divergence [Jaques et al., 2019] or MMD [Kumar et al., 2019]. This is a very straighforward way to stay close to the behavior with a hyperparameter that determines just how close. All of these algorithms are iterative and rely on off-policy evaluation.

2. Siegel et al. [2020], Wang et al. [2020b], Chen et al. [2020] all use algorithms that ﬁlter out datapoints with low Q values and then perform imitation learning. Wang et al. [2018], Peng et al. [2019] use a weighted imitation learning algorithm where the weights are determined by exponentiated Q values. These algorithms are iterative. 3. Another way to prevent the learned policy from choosing unknown actions is to incorporate some form of regularization to encourage staying near the behavior and being pessimistic about unknown state, action pairs [Wu et al., 2019, Nachum et al., 2019, Kumar et al., 2020, Kostrikov et al., 2021, Gulcehre et al., 2021]. However, being able to properly quantify uncertainty about unknown states is notoriously difﬁcult when dealing with neural network value functions [Buckman et al., 2020].

One-step algorithms. Some recent work has also noted that optimizing policies based on the behavior value function can perform surprisingly well. As we do, Goo and Niekum [2020] studies the continuous control tasks from the D4RL benchmark, but they examine a complicated algorithm involving ensembles, distributional Q functions, and a novel regularization technique. In contrast, we analyze a substantially simpler algorithm and get better performance on the D4RL tasks. We also focus more of our contribution on understanding and explaining this performance. Gulcehre et al. [2021] studies the discrete action setting and ﬁnds that a one-step algorithm (which they call behavior value estimation ) outperforms prior work on Atari games and other discrete action tasks from the RL Unplugged benchmark [Gulcehre et al., 2020]. They also introduce a novel regularizer for the evaluation step. In contrast, we consider the continuous control setting. This is a substantial difference in setting since continuous control requires actor-critic algorithms with parametric policies while in the discrete setting the policy improvement step can be computed exactly from the Q function. Moreover, while Gulcehre et al. [2021] attribute the poor performance of iterative algorithms to overestimation , we deﬁne and separate the issues of distribution shift and iterative error exploitation which can combine to cause overestimation. This separation helps to expose the difference between the fundamental limits of off-policy evaluation from the speciﬁc problems induced by iterative algorithms, and will hopefully be a useful distinction to inspire future work. Finally, a one-step variant is also brieﬂy discussed in Nadjahi et al. [2019], but is not the focus of that work.

There are also important connections between the one-step algorithm and the literature on conservative policy improvement [Kakade and Langford, 2002, Schulman et al., 2015, Achiam et al., 2017], which we discuss in more detail in Appendix B.

4 Deﬁning the algorithms

In this section we provide a uniﬁed algorithmic template for model-free ofﬂine RL algorithms as ofﬂine approximate modiﬁed policy iteration. We show how this template captures our one-step algorithm as well as a multi-step policy iteration algorithm and an iterative actor-critic algorithm. Then any choice of policy evaluation and policy improvement operators can be used to deﬁne one-step, multi-step, and iterative algorithms.

4.1 Algorithmic template

Algorithm 1: OAMPI

input :K, dataset DN, estimated behavior ˆβ Set π0 = ˆβ. Initialize b Qπ 1 randomly. for k = 1, ..., K do

Policy evaluation: b Qπk 1 = E(πk 1, DN, b Qπk 2) Policy improvement: πk = I( b Qπk 1, ˆβ, DN, πk 1) end

We consider a generic ofﬂine approximate modiﬁed policy iteration (OAMPI) scheme, shown in Algorithm 1 (and based off of Puterman and Shin [1978], Scherrer et al. [2012]). Essentially the algorithm alternates between two steps. First, there is a policy evaluation step where we estimate the Q function of the current policy πk 1 by b Qπk 1 using only the dataset DN. Implementations also often use the prior Q estimate b Qπk 2 to warm-start the approximation process. Second, there is a policy improvement step. This step takes in the estimated Q function b Qπk 1, the estimated behavior ˆβ, and the dataset DN and produces a new policy πk. Again an algorithm may use πk 1 to warm-start the optimization. Moreover, we expect this improvement step to be regularized or constrained to ensure that πk remains in the support of β and DN. Choices for this step are discussed below. Now we discuss a few ways to instantiate the template.

One-step. The simplest algorithm sets the number of iterations K = 1. We learn ˆβ by maximum likelihood and train the policy evaluation step to estimate Qβ. Then we use any one of the policy improvement operators discussed below to learn π1. Importantly, this algorithm completely avoids off-policy evaluation.

Multi-step. The multi-step algorithm now sets K > 1. The evaluation operator must evaluate off-policy since DN is collected by β, but evaluation steps for K 2 require evaluating policies πk 1 = β. Each iteration is trained to convergence in both the estimation and improvement steps.

Iterative actor-critic. An actor critic approach looks somewhat like the multi-step algorithm, but does not attempt to train to convergence at each iteration and uses a much larger K. Here each iteration consists of one gradient step to update the Q estimate and one gradient step to improve the policy. Since all of the evaluation and improvement operators that we consider are gradient-based, this algorithm can adapt the same evaluation and improvement operators used by the multi-step algorithm. Most algorithms from the literature fall into this category [Fujimoto et al., 2018a, Kumar et al., 2019, 2020, Wu et al., 2019, Wang et al., 2020b, Siegel et al., 2020].

4.2 Policy evaluation operator

Following prior work on continuous state and action problems, we always evaluate by simple ﬁtted Q evaluation [Fujimoto et al., 2018a, Kumar et al., 2019, Siegel et al., 2020, Wang et al., 2020b, Paine et al., 2020, Wang et al., 2021]. In practice this is optimized by TD-style learning with the use of a target network [Mnih et al., 2015] as in DDPG [Lillicrap et al., 2015]. We do not use any double Q learning or Q ensembles [Fujimoto et al., 2018b]. For the one-step and multi-step algorithms we train the evaluation procedure to convergence on each iteration and for the iterative algorithm each iteration takes a single stochastic gradient step. See Voloshin et al. [2019], Wang et al. [2021] for more comprehensive examinations of policy evaluation and some evidence that this simple ﬁtted Q iteration approach is reasonable. It is an interesting direction for future work to consider other operators that use things like importance weighting [Munos et al., 2016] or pessimism [Kumar et al., 2020, Buckman et al., 2020].

4.3 Policy improvement operators

To instantiate the template, we also need to choose a speciﬁc policy improvement operator I. We consider the following improvement operators selected from those discussed in the related work section. Each operator has a hyperparameter controlling deviation from the behavior policy.

Behavior cloning. The simplest baseline worth including is to just return ˆβ as the new policy π. Any policy improvement operator ought to perform at least as well as this baseline.

Constrained policy updates. Algorithms like BCQ [Fujimoto et al., 2018a] and SPIBB [Laroche et al., 2019] constrain the policy updates to be within the support of the data/behavior. In favor of simplicity, we implement a simpliﬁed version of the BCQ algorithm that removes the perturbation network which we call Easy BCQ. We deﬁne a new policy ˆπM k by drawing M samples from ˆβ and then executing the one with the highest value according to b Qβ. Explicitly:

ˆπM k (a|s) = 1[a = arg max aj { b Qπk 1(s, aj) : aj πk 1( |s), 1 j M}]. (2)

Regularized policy updates. Another common idea proposed in the literature is to regularize towards the behavior policy [Wu et al., 2019, Jaques et al., 2019, Kumar et al., 2019]. For a general divergence D we can deﬁne an algorithm that maximizes a regularized objective:

ˆπα k = arg max π

i E a π|s[ b Qπk 1(si, a)] αD(ˆβ( |si), π( |si)) (3)

A comprehensive review of different variants of this method can be found in Wu et al. [2019] which does not ﬁnd dramatic differences across regularization techniques. In practice, we will use reverse KL divergence, i.e. KL(π( |si) ˆβ( |si)). To compute the reverse KL, we draw samples from π( |si) and use the density estimate ˆβ to compute the divergence. Intuitively, this regularization forces π to remain within the support of β rather than incentivizing π to cover β.

Variants of imitation learning. Another idea, proposed by [Wang et al., 2018, Siegel et al., 2020, Wang et al., 2020b, Chen et al., 2020] is to modify an imitation learning algorithm either by ﬁltering or weighting the observed actions to incentivize policy improvement. The weighted version that we implement uses exponentiated advantage estimates to weight the observed actions:

ˆπτ k = arg max π

i exp(τ( b Qπk 1(si, ai) b V (si))) log π(ai|si). (4)

With these deﬁnitions, we can now move on to testing various combinations of algorithmic template (one-step, multi-step, or iterative) and improvement operator (Easy BCQ, reverse KL regularization, or exponentially weighted imitation).

5 Benchmark Results

Our main empirical ﬁnding is that one step of policy improvement is sufﬁcient to beat state of the art results on much of the D4RL benchmark suite [Fu et al., 2020]. This is striking since prior work focuses on iteratively estimating the Q function of the current policy iterate, but we only use one step derived from b Qβ. Results are shown in Table 1. Full experimental details are in Appendix C and code can be found at https://github.com/davidbrandfonbrener/onestep-rl.

Table 1: Results of one-step algorithms on the D4RL benchmark. The ﬁrst column gives the best results across several iterative algorithms considered in Fu et al. [2020]. Each algorithm is tuned over 6 values of their respective hyperparameter. We report the mean and standard error over 10 seeds of the training process and using 100 evaluation episodes per seed. We bold the best result on each dataset and blue any result where a one-step algorithm beat the best reported iterative result from Fu et al. [2020]. We use m for medium, m-e for medium-expert, m-re for medium-replay, r for random, and c for cloned.

Iterative One-step

Fu et al. [2020] BC Easy BCQ Rev. KL Reg Exp. Weight

halfcheetah-m 46.3 42.1 0.1 52.6 0.1 55.6 0.2 48.6 0.0 walker2d-m 81.1 70.2 1.3 86.9 0.4 85.6 0.4 80.3 1.1 hopper-m 58.8 49.8 0.6 69.7 2.1 83.3 1.4 56.7 0.8

halfcheetah-m-e 64.7 60.1 0.8 77.0 0.9 93.5 0.1 91.7 0.9 walker2d-m-e 111.0 93.6 5.6 111.8 0.2 110.9 0.1 112.9 0.2 hopper-m-e 111.9 48.1 1.5 81.4 1.9 102.1 1.3 83.1 7.0

halfcheetah-m-re 47.7 34.9 0.3 38.4 0.3 42.4 0.1 38.6 0.5 walker2d-m-re 26.7 23.9 1.6 66.4 2.0 71.6 3.1 49.3 3.5 hopper-m-re 48.6 21.2 1.3 77.3 2.7 71.0 8.1 94.1 2.4

halfcheetah-r 35.4 2.2 0.0 5.4 0.1 6.9 1.0 3.7 0.2 walker2d-r 7.3 0.7 0.1 4.2 0.2 6.1 0.3 5.2 0.2 hopper-r 12.2 2.6 0.4 6.7 0.1 7.8 0.3 5.6 0.6

pen-c 56.9 49.3 2.2 67.0 1.1 55.3 1.9 54.7 2.3 hammer-c 2.1 0.5 0.1 2.8 0.5 0.2 0.0 1.2 0.2 relocate-c -0.1 0.0 0.0 0.3 0.0 0.1 0.0 0.1 0.0 door-c 0.4 0.0 0.0 0.4 0.2 0.0 0.1 0.1 0.1

As we can see in the table, all of the one-step algorithms usually outperform the best iterative algorithms tested by Fu et al. [2020]. The one notable exception is the case of random data (especially on halfcheetah), where iterative algorithms have a clear advantage. We will discuss potential causes of this further in Section 7. To give a more direct comparison that controls for any potential implementation details, we use our implementation of reverse KL regularization to create multi-step and iterative algorithms. We are not using algorithmic modiﬁcations like Q ensembles, regularized Q values, or early stopping that have been used in prior work. But, our iterative algorithm recovers similar performance to prior regularized actor-critic approaches. These results are shown in Table 2.

Table 2: Results of reverse KL regularization on the D4RL benchmark across one-step, multi-step, and iterative algorithms. Again we run 6 hyperparameters and report the mean and standard error across 10 seeds using 100 evaluation episodes.

One-step Multi-step Iterative

halfcheetah-m 55.6 0.2 40.8 8.6 47.4 3.5 walker2d-m 85.6 0.4 75.9 0.5 75.4 0.8 hopper-m 83.3 1.4 53.0 1.0 54.2 0.6

halfcheetah-m-e 93.5 0.1 93.6 0.3 93.6 0.2 walker2d-m-e 110.9 0.1 76.3 15.9 108.2 0.3 hopper-m-e 102.1 1.3 101.3 3.9 82.7 7.4

halfcheetah-r 6.9 1.0 13.7 1.7 16.3 1.6 walker2d-r 6.1 0.3 5.0 0.3 5.1 0.3 hopper-r 7.8 0.3 15.4 2.9 9.7 0.1

Put together, these results immediately suggest some guidance to the practitioner: it is worthwhile to run the one-step algorithm as a baseline before trying something more elaborate. The one-step algorithm is substantially simpler than prior work, but frequently achieves better performance.

6 What goes wrong for iterative algorithms?

The benchmark experiments show that one step of policy improvement often beats iterative and multi-step algorithms. In this section we dive deeper to understand why this happens. First, by examining the learning curves of each of the algorithms we note that iterative algorithms require stronger regularization to avoid instability. Then we identify two causes of this instability: distribution shift and iterative error exploitation.

Distribution shift causes evaluation error by reducing the effective sample size in the ﬁxed dataset for evaluating the current policy and has been extensively considered in prior work as discussed below. Iterative error exploitation occurs when we repeatedly optimize policies against our Q estimates and exploit their errors. This introduces a bias towards overestimation at each step (much like the training error in supervised learning is biased to be lower than the test error). Moreover, by iteratively re-using the data and using prior Q estimates to warmstart training at each step, the errors from one step are ampliﬁed at the next. This type of error is particular to multi-step and iterative algorithms.

6.1 Learning curves and hyperparameter sensitivity

To begin to understand why iterative and multi-step algorithms can fail it is instructive to look at the learning curves. As shown in Figure 2, we often observe that the iterative algorithm will begin to learn and then crash. Regularization can help to prevent this crash since strong enough regularization towards the behavior policy ensures that the evaluation is nearly on-policy.

Figure 2: Learning curves and ﬁnal performance on halfcheetah-medium across different algorithms and regularization hyperparameters (all using the reverse KL regularized improvement operator). Error bars show min and max over 3 seeds. Similar ﬁgures for other datasets from D4RL can be found in Appendix D.

In contrast, the one-step algorithm is more robust to the regularization hyperparameter. The rightmost panel of the ﬁgure shows this clearly. While iterative and multi-step algorithms can have their performance degrade very rapidly with the wrong setting of the hyperparameter, the one-step approach is more stable. Moreover, we usually ﬁnd that the optimal setting of the regularization hyperparameter is lower for the one-step algorithm than the iterative or multi-step approaches.

6.2 Distribution shift

Any algorithm that relies on off-policy evaluation will struggle with distribution shift in the evaluation step. Trying to evaluate a policy that is substantially different from the behavior reduces the effective sample size and increases the variance of the estimates. Explicitly, by distribution shift we mean the shift between the behavior distribution (the distribution over state-action pairs in the dataset) and the evaluation distribution (the distribution that would be induced by the policy π we want to evaluate).

Prior work. There is a substantial body of prior theoretical work that suggests that off-policy evaluation can be difﬁcult and this difﬁculty scales with some measure of distribution shift. Wang et al. [2020a], Amortila et al. [2020], Zanette [2021] give exponential (in horizon) lower bounds on sample complexity in the linear setting even with good feature representations that can represent the desired Q function and assuming good data coverage. Upper bounds generally require very strong assumptions on both the representation and limits on the distribution shift [Wang et al., 2021, Duan et al., 2020, Chen and Jiang, 2019]. Moreover, the assumed bounds on distribution shift can be exponential in horizon in the worst case. On the empirical side, Wang et al. [2021] demonstrates issues with distribution shift when learning from pre-trained features and provides a nice discussion of why distribution shift causes error ampliﬁcation. Fujimoto et al. [2018a] raises a similar issue under the name extrapolation error . Regularization and constraints are meant to reduce issues stemming from distribution shift, but also reduce the potential for improvement over the behavior.

Empirical evidence. Both the multi-step and iterative algorithms in our experiments rely on offpolicy evaluation as a key subroutine. We examine how easy it is to evaluate the policies encountered along the learning trajectory. To control for issues of iterative error exploitation (discussed in the next subsection), we train Q estimators from scratch on a heldout evaluation dataset sampled from the behavior policy. We then evaluate these trained Q function on rollouts from 1000 datapoints sampled from the replay buffer. Results are shown in Figure 3. The results show a correlation betweed KL and MSE. Moreover, we see that the MSE generally increases over training. One way to mitigate this, as seen in the ﬁgure, is to use a large value of α. We just cannot take a very large step before running into problems with distribution shift. But, when we take such a small step, the information from the on-policy b Qβ is about as useful as the newly estimated b Qπ. This is seen, for example, in Figure 2 where we get very similar performance across algorithms at high levels of regularization.

Figure 3: Results of running the iterative algorithm on halfcheetah-medium. Each checkpointed policy is evaluated by a Q function trained from scratch on heldout data. MSE refers to Es,a β[( ˆQπi(s, a) Qπi(s, a))2] and KL refers to Es β[KL(π( |s) β( |s)]. Left: 90 policies taken from various points in training with various hyperaparmeters and random seeds. Center: MSE learning curves. Right: KL learning curves. Error bars show min and max over 3 random seeds.

6.3 Iterative error exploitation

The previous subsection identiﬁes how any algorithm that uses off-policy evaluation is fundamentally limited by distribution shift, even if we were given fresh data and trained Q functions from scratch at every iteration. But, in practice, iterative algorithms repeatedly iterate between optimizing policies against estimated Q functions and re-estimating the Q functions using the same data and using the Q function from the previous step to warm-start the re-estimation. This induces dependence between steps that causes a problem that we call iterative error exploitation.

Intuition about the problem. In short, iterative error exploitation happens because πi tends to choose overestimated actions in the policy improvement step, and then this overestimation propagates via dynamic programming in the policy evaluation step. To illustrate this issue more formally, consider the following: at each s, a we suffer some Bellman error επ β(s, a) based on our ﬁxed dataset collected by β. Formally,

b Qπ(s, a) = r(s, a) + γ E s |s,a a π|s [ b Qπ(s , a )] + επ β(s, a). (5)

Intuitively, επ β will be larger at state-actions with less coverage in the dataset collected by β. Note that επ β can absorb all error whether it is caused by the ﬁnite sample size or function approximation error.

All that is needed to cause iterative error exploitation is that the ϵπ β are highly correlated across different π, but for simplicity, we will assume that επ β is the same for all policies π estimated from our ﬁxed ofﬂine dataset and instead write εβ. Now that the errors do not depend on the policy we can treat the errors as auxiliary rewards that obscure the true rewards and see that

b Qπ(s, a) = Qπ(s, a) + e Qπ β(s, a), e Qπ β(s, a) := E π|s0,a0=s,a

t=0 γtεβ(st, at)

This assumption is somewhat reasonable since we expect the error to primarily depend on the data. And, when the prior Q function is used to warm-start the current one (as is generally the case in practice), the approximation errors are automatically passed between steps. Now we can explain the problem. Recall that under our assumption the εβ are ﬁxed once we have a dataset and likely to have larger magnitude the further we go from the support of the dataset. So, with each step πi is able to better maximize εβ, thus moving further from β and increasing the magnitude of e Qπi β relative to Qπi. Even though Qπi may provide better signal than Qβ, it can easily be drowned

out by e Qπi β . In contrast, e Qβ β has small magnitude, so the one-step algorithm is robust to errors1.

An example. Now we consider a simple gridworld example to illustrate iterative error exploitation. This example ﬁts exactly into the setup outlined above since all errors are due to reward estimation so the εβ is indeed constant over all π. The gridworld we consider has one deterministic good state with reward 1 and many stochastic bad states that have rewards distributed as N( 0.5, 1). We collect a dataset of 100 trajectories, each of length 100. One run of the multi-step ofﬂine regularized policy iteration algorithm is illustrated in Figure 4. In the example we see that one step often outperforms multiple steps of improvement. Intuitively, when there are so many noisy states, it is likely that a few of them will be overestimated. Since the data is re-used for each step, these overestimations persist and propagate across the state space due to iterative error exploitation. This property of having many bad, but poorly estimated states likely also exists in the high-dimensional control problems encountered in the benchmark where there are many ways for the robots to fall down that are not observed in the data for non-random behavior. Moreover, both settings have larger errors in areas where we have less data. So even though the errors in the gridworld are caused by noise in the rewards, while errors in D4RL are caused by function approximation, we think this is a useful mental model of the problem.

1We should note that iterative error exploitation is similar to the overestimation addressed by double Q learning [Van Hasselt et al., 2016, Fujimoto et al., 2018b], but distinct. Since we are in the ofﬂine setting, the errors due to our ﬁnite dataset can be iteratively exploited more and more, while in the online setting considered by double Q learning, fresh data prevents this issue. We are also considering an algorithm based on policy iteration rather than value iteration.

Figure 4: An illustration of multi-step ofﬂine regularized policy iteration. The leftmost panel in each row shows the true reward (top) or error εβ (bottom). Then each subsequent panel plots πi (with arrow size proportional to πi(a|s)) over either Qπi (top) or e Qπ β (bottom), averaged over actions at each state. The one-step policy (π1) has the highest value. The behavior policy here is a mixture of optimal π and uniform u with coefﬁcient 0.2 so that β = 0.2 π + 0.8 u. We set α = 0.1 as the regularization parameter for reverse KL regularization.

Figure 5: Histograms of overestimation error ( b Qπi(s, a) Qπi(s, a)) on halfcheetah-medium with the iterative algorithm. Left: errors from the training Q function. Right: errors from an independently trained Q function.

Empirical evidence. In practice we cannot easily visualize the progression of errors. However, the dependence between steps still arises as overestimation of the Q values. We can track the overestimation of the Q values over training as a way to measure how much bias is being induced by optimizing against our dependent Q estimators. As a control we can also train Q estimators from scratch on independently sampled evaluation data. These independently trained Q functions do not have the same overestimation bias even though the squared error does tend to increase as the policy moves further from the behavior (as seen in Figure 3). Explicitly, we track 1000 state, action pairs from the replay buffer over training. For each checkpointed policy we perform 3 rollouts at each state to get an estimate of the true Q value and compare this to the estimated Q value. Results are shown in Figure 5.

7 When are multiple steps useful?

So far we have focused on why the one-step algorithm often works better than the multi-step and iterative algorithms. However, we do not want to give the impression that one-step is always better. Indeed, our own experiments in Section 5 show a clear advantage for the multi-step and iterative approaches when we have randomly collected data. While we cannot offer a precise delineation of when one-step will outperform multi-step, in this section we offer some intuition as to when we can expect to see beneﬁts from multiple steps of policy improvement.

As seen in Section 6, multi-step and iterative algorithms have problems when they propagate estimation errors. This is especially problematic in noisy and/or high dimensional environments. While the multi-step algorithms propagate this noise more widely than the one-step algorithm, they also

Figure 6: Performance of all three algorithms with reverse KL regularization across mixtures between halfcheetah-random and halfcheetah-medium. Error bars indicate min and max over 3 seeds.

propagate the signal. So, when we have sufﬁcient coverage to reduce the magnitude of the noise, this increased propagation of signal can be beneﬁcial. The D4RL experiments suggest that we are usually on the side of the tradeoff where the errors are large enough to make one-step preferable.

In Appendix A we illustrate a simple gridworld example where a slight modiﬁcation of the behavior policy from Figure 4 makes multi-step dramatically outperform one-step. This modiﬁed behavior policy (1) has better coverage of the noisy states (which reduces error, helping multi-step), and (2) does a worse job propagating the reward from the good state (hurting one-step). We can also test empirically how the behavior policy effects the tradeoff between error and signal propagation. To do this we construct a simple experiment where we mix data from the random behavior policy with data from the medium behavior policy. Explicitly we construct a dataset D out of the datasets Dr for random and Dm for medium such that each trajectory in D comes from the medium dataset with probability pm. So for pm = 0 we have the random dataset and pm = 1 we have the medium dataset, and in between we have various mixtures. Results are shown in Figure 6. It takes surprisingly little data from the medium policy for one-step to outperform the iterative algorithm.

8 Discussion, limitations, and future work

This paper presents the surprising effectiveness of a simple one-step baseline for ofﬂine RL. We examine the failure modes of iterative algorithms and the conditions where we might expect them to outperform the simple one-step baseline. This provides guidance to a practitioner that the simple one-step baseline is a good place to start when approaching an ofﬂine RL problem. But, we leave many questions unanswered. One main limitation is that we lack a clear theoretical characterization of which environments and behaviors can guarantee that one-step outperforms multi-step or visa versa. Such results will likely require strong assumptions, but could provide useful insight. We don t expect this to be easy as it requires understanding policy iteration which has been notoriously difﬁcult to analyze, often converging much faster than the theory would suggest [Sutton and Barto, 2018, Agarwal et al., 2019]. Another limitation is that while only using one step is perhaps the simplest way to avoid the problems of off-policy evaluation, there are possibly other more elaborate algorithmic solutions that we did not consider here. However, our strong empirical results suggest that the one-step algorithm is at least a strong baseline.

Broader impact. Our paper studies a simple and effective baseline approach to the ofﬂine RL problem. The effectiveness of this baseline raises some serious questions about the utility of prior work proposing substantially more complicated methods. By making this observation of prior shortcomings, our paper has the potential to encourage researchers to derive new and better methods for ofﬂine RL. This has many potential impacts on ﬁelds as diverse as robotics and healthcare where better ofﬂine decision making can lead to better real-world performance. As always, we note that machine learning improvements come in the form of building machines to do X better . For a sufﬁciently malicious or ill-informed choice of X, almost any progress in machine learning might indirectly lead to a negative outcome, and our work is not excluded from that.

Acknowledgements

This work is partially supported by the Alfred P. Sloan Foundation, NSF RI-1816753, NSF CAREER CIF 1845360, NSF CHS-1901091, Samsung Electronics, and the Institute for Advanced Study. DB is supported by the Department of Defense (Do D) through the National Defense Science & Engineering Graduate Fellowship (NDSEG) Program.

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International Conference on Machine Learning, pages 22 31. PMLR, 2017.

Alekh Agarwal, Nan Jiang, and S. Kakade. Reinforcement learning: Theory and algorithms. 2019.

P. Amortila, Nan Jiang, and Tengyang Xie. A variant of the wang-foster-kakade lower bound for the discounted setting. Ar Xiv, abs/2011.01075, 2020.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. Co RR, abs/1606.01540, 2016. URL http://arxiv.org/abs/ 1606.01540.

Jacob Buckman, Carles Gelada, and Marc G. Bellemare. The importance of pessimism in ﬁxed-dataset policy optimization, 2020.

John Burkhardt. The truncated normal distribution, 2014.

Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning. PMLR, 2019.

Xinyue Chen, Zijian Zhou, Zheng Wang, Che Wang, Yanqiu Wu, and Keith Ross. Bail: Bestaction imitation learning for batch deep reinforcement learning. Advances in Neural Information Processing Systems, 33, 2020.

Yaqi Duan, Zeyu Jia, and Mengdi Wang. Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pages 2701 2709. PMLR, 2020.

Justin Fu, Aviral Kumar, Oﬁr Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020.

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. ar Xiv preprint ar Xiv:1812.02900, 2018a.

Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. ar Xiv preprint ar Xiv:1802.09477, 2018b.

Scott Fujimoto, Edoardo Conti, Mohammad Ghavamzadeh, and Joelle Pineau. Benchmarking batch deep reinforcement learning algorithms. ar Xiv preprint ar Xiv:1910.01708, 2019.

Wonjoon Goo and Scott Niekum. You only evaluate once a simple baseline algorithm for ofﬂine rl. In Ofﬂine Reinforcement Learning Workshop at Neural Information Processing Systems, 2020.

Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Tom Le Paine, Sergio Gómez Colmenarejo, Konrad Zolna, Rishabh Agarwal, Josh Merel, Daniel Mankowitz, Cosmin Paduraru, et al. Rl unplugged: Benchmarks for ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:2006.13888, 2020.

Caglar Gulcehre, Sergio Gómez Colmenarejo, Ziyu Wang, Jakub Sygnowski, Thomas Paine, Konrad Zolna, Yutian Chen, Matthew Hoffman, Razvan Pascanu, and Nando de Freitas. Regularized behavior value estimation. ar Xiv preprint ar Xiv:2103.09575, 2021.

Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog, 2019.

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267 274, 2002.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Ilya Kostrikov, Jonathan Tompson, Rob Fergus, and Oﬁr Nachum. Ofﬂine reinforcement learning with ﬁsher divergence critic regularization. ar Xiv preprint ar Xiv:2103.08050, 2021.

Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pages 11761 11771, 2019.

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:2006.04779, 2020.

Romain Laroche, Paul Trichelair, and Remi Tachet Des Combes. Safe policy improvement with baseline bootstrapping. In International Conference on Machine Learning, pages 3652 3661. PMLR, 2019.

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Ofﬂine reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020.

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.

Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efﬁcient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pages 1054 1062, 2016.

Oﬁr Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. ar Xiv preprint ar Xiv:1912.02074, 2019.

Kimia Nadjahi, Romain Laroche, and Rémi Tachet des Combes. Safe policy improvement with soft baseline bootstrapping, 2019.

Tom Le Paine, Cosmin Paduraru, Andrea Michi, Caglar Gulcehre, Konrad Zolna, Alexander Novikov, Ziyu Wang, and Nando de Freitas. Hyperparameter selection for ofﬂine reinforcement learning, 2020.

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. ar Xiv preprint ar Xiv:1910.00177, 2019.

Martin L Puterman and Moon Chirl Shin. Modiﬁed policy iteration algorithms for discounted markov decision problems. Management Science, 24(11):1127 1137, 1978.

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. ar Xiv preprint ar Xiv:1709.10087, 2017.

Bruno Scherrer, Victor Gabillon, Mohammad Ghavamzadeh, and Matthieu Geist. Approximate modiﬁed policy iteration. ar Xiv preprint ar Xiv:1205.3054, 2012.

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889 1897, 2015.

Noah Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin Riedmiller. Keep doing what worked: Behavior modelling priors for ofﬂine reinforcement learning. In International Conference on Learning Representations, 2020.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In Thirtieth AAAI conference on artiﬁcial intelligence, 2016.

Cameron Voloshin, Hoang M Le, Nan Jiang, and Yisong Yue. Empirical study of off-policy policy evaluation for reinforcement learning. ar Xiv preprint ar Xiv:1911.06854, 2019.

Qing Wang, Jiechao Xiong, Lei Han, Han Liu, Tong Zhang, et al. Exponentially weighted imitation learning for batched historical data. In Advances in Neural Information Processing Systems, pages 6288 6297, 2018.

Ruosong Wang, Dean P. Foster, and Sham M. Kakade. What are the statistical limits of ofﬂine rl with linear function approximation?, 2020a.

Ruosong Wang, Yifan Wu, Ruslan Salakhutdinov, and Sham M Kakade. Instabilities of ofﬂine rl with pre-trained neural representation. ar Xiv preprint ar Xiv:2103.04947, 2021.

Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh S Merel, Jost Tobias Springenberg, Scott E Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess, et al. Critic regularized regression. Advances in Neural Information Processing Systems, 33, 2020b.

Yifan Wu, George Tucker, and Oﬁr Nachum. Behavior regularized ofﬂine reinforcement learning, 2019.

Andrea Zanette. Exponential lower bounds for batch reinforcement learning: Batch rl can be exponentially harder than online rl, 2021.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 8 and Section 7.

(c) Did you discuss any potential negative societal impacts of your work? [Yes] See Section 8. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See supplement. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Appendix C (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] In all relevant ﬁgures. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix C 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] Data from Fu et al. [2020]. (b) Did you mention the license of the assets? [Yes] The license is Apache 2.0.

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]

Code in supplement. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] Data is simulated. (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] Data is simulated. 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]