# offline_reinforcement_learning_with_implicit_qlearning__0dd80739.pdf

Published as a conference paper at ICLR 2022

OFFLINE REINFORCEMENT LEARNING WITH IMPLICIT Q-LEARNING

Ilya Kostrikov, Ashvin Nair & Sergey Levine Department of Electrical Engineering and Computer Science University of California, Berkeley {kostrikov,anair17}@berkeley.edu, svlevine@eecs.berkeley.edu

Ofﬂine reinforcement learning requires reconciling two conﬂicting aims: learning a policy that improves over the behavior policy that collected the dataset, while at the same time minimizing the deviation from the behavior policy so as to avoid errors due to distributional shift. This trade-off is critical, because most current ofﬂine reinforcement learning methods need to query the value of unseen actions during training to improve the policy, and therefore need to either constrain these actions to be in-distribution, or else regularize their values. We propose a new ofﬂine RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization. The main insight in our work is that, instead of evaluating unseen actions from the latest policy, we can approximate the policy improvement step implicitly by treating the state value function as a random variable, with randomness determined by the action (while still integrating over the dynamics to avoid excessive optimism), and then taking a state conditional upper expectile of this random variable to estimate the value of the best actions in that state. This leverages the generalization capacity of the function approximator to estimate the value of the best available action at a given state without ever directly querying a Q-function with this unseen action. Our algorithm alternates between ﬁtting this upper expectile value function and backing it up into a Q-function, without any explicit policy. Then, we extract the policy via advantage-weighted behavioral cloning, which also avoids querying out-of-sample actions. We dub our method implicit Q-learning (IQL). IQL is easy to implement, computationally efﬁcient, and only requires ﬁtting an additional critic with an asymmetric L2 loss. IQL demonstrates the state-of-the-art performance on D4RL, a standard benchmark for ofﬂine reinforcement learning. We also demonstrate that IQL achieves strong performance ﬁne-tuning using online interaction after ofﬂine initialization.

1 INTRODUCTION

Ofﬂine reinforcement learning (RL) addresses the problem of learning effective policies entirely from previously collected data, without online interaction (Fujimoto et al., 2019; Lange et al., 2012). This is very appealing in a range of real-world domains, from robotics to logistics and operations research, where real-world exploration with untrained policies is costly or dangerous, but prior data is available. However, this also carries with it major challenges: improving the policy beyond the level of the behavior policy that collected the data requires estimating values for actions other than those that were seen in the dataset, and this, in turn, requires trading off policy improvement against distributional shift, since the values of actions that are too different from those in the data are unlikely to be estimated accurately. Prior methods generally address this by either constraining the policy to limit how far it deviates from the behavior policy (Fujimoto et al., 2019; Wu et al., 2019; Fujimoto & Gu, 2021; Kumar et al., 2019; Nair et al., 2020; Wang et al., 2020), or by regularizing the learned value functions to assign low values to out-of-distribution actions (Kumar et al., 2020; Kostrikov et al., 2021). Nevertheless, this imposes a trade-off between how much the policy improves and how vulnerable it is to misestimation due to distributional shift. Can we devise an ofﬂine RL method that

Published as a conference paper at ICLR 2022

avoids this issue by never needing to directly query or estimate values for actions that were not seen in the data?

In this work, we start from an observation that in-distribution constraints widely used in prior work might not be sufﬁcient to avoid value function extrapolation, and we ask whether it is possible to learn an optimal policy with in-sample learning, without ever querying the values of any unseen actions. The key idea in our method is to approximate an upper expectile of the distribution over values with respect to the distribution of dataset actions for each state. We alternate between ﬁtting this value function with expectile regression, and then using it to compute Bellman backups for training the Q-function. We show that we can do this simply by modifying the loss function in a SARSA-style TD backup, without ever using out-of-sample actions in the target value. Once this Qfunction has converged, we extract the corresponding policy using advantage-weighted behavioral cloning. This approach does not require explicit constraints or explicit regularization of out-ofdistribution actions during value function training, though our policy extraction step does implicitly enforce a constraint, as discussed in prior work on advantage-weighted regression (Peters & Schaal, 2007; Peng et al., 2019; Nair et al., 2020; Wang et al., 2020).

Our main contribution is implicit Q-learning (IQL), a new ofﬂine RL algorithm that avoids ever querying values of unseen actions while still being able to perform multi-step dynamic programming updates. Our method is easy to implement by making a small change to the loss function in a simple SARSA-like TD update and is computationally very efﬁcient. Furthermore, our approach demonstrates the state-of-the-art performance on D4RL, a popular benchmark for ofﬂine reinforcement learning. In particular, our approach signiﬁcantly improves over the prior state-of-the-art on challenging Ant Maze tasks that require to stitch several sub-optimal trajectories. Finally, we demonstrate that our approach is suitable for ﬁnetuning; after initialization from ofﬂine RL, IQL is capable of improving policy performance utilizing additional interactions.

2 RELATED WORK

A signiﬁcant portion of recently proposed ofﬂine RL methods are based on either constrained or regularized approximate dynamic programming (e.g., Q-learning or actor-critic methods), with the constraint or regularizer serving to limit deviation from the behavior policy. We will refer to these methods as multi-step dynamic programming algorithms, since they perform true dynamic programming for multiple iterations, and therefore can in principle recover the optimal policy if provided with high-coverage data. The constraints can be implemented via an explicit density model (Wu et al., 2019; Fujimoto et al., 2019; Kumar et al., 2019; Ghasemipour et al., 2021), implicit divergence constraints (Nair et al., 2020; Wang et al., 2020; Peters & Schaal, 2007; Peng et al., 2019; Siegel et al., 2020), or by adding a supervised learning term to the policy improvement objective (Fujimoto & Gu, 2021). Several works have also proposed to directly regularize the Q-function to produce low values for out-of-distribution actions (Kostrikov et al., 2021; Kumar et al., 2020; Fakoor et al., 2021). Our method is also a multi-step dynamic programming algorithm. However, in contrast to prior works, our method completely avoids directly querying the learned Q-function with unseen actions during training, removing the need for any constraint during this stage, though the subsequent policy extraction, which is based on advantage-weighted regression (Peng et al., 2019; Nair et al., 2020), does apply an implicit constraint. However, this policy does not actually inﬂuence value function training.

In contrast to multi-step dynamic programming methods, several recent works have proposed methods that rely either on a single step of policy iteration, ﬁtting the value function or Q-function of the behavior policy and then extracting the corresponding greedy policy (Peng et al., 2019; Brandfonbrener et al., 2021; Gulcehre et al., 2021), or else avoid value functions completely and utilize behavioral cloning-style objectives (Chen et al., 2021). We collectively refer to these as single-step approaches. These methods avoid needing to query unseen actions as well, since they either use no value function at all, or learn the value function of the behavior policy. Although these methods are simple to implement and effective on the Mu Jo Co locomotion tasks in D4RL, we show that such single-step methods perform very poorly on more complex datasets in D4RL, which require combining parts of suboptimal trajectories ( stitching ). Prior multi-step dynamic programming methods perform much better in such settings, as does our method. We discuss this distinction in more detail in Section 5.1. Our method also shares the simplicity and computational efﬁciency of single-step approaches, providing an appealing combination of the strengths of both types of methods.

Published as a conference paper at ICLR 2022

Our method is based on estimating the characteristics of a random variable. Several recent works involve approximating statistical quantities of the value function distribution. In particular, quantile regression (Koenker & Hallock, 2001) has been previously used in reinforcement learning to estimate the quantile function of a state-action value function (Dabney et al., 2018b;a; Kuznetsov et al., 2020). Although our method is related, in that we perform expectile regression, our aim is not to estimate the distribution of values that results from stochastic transitions, but rather estimate expectiles of the state value function with respect to random actions. This is a very different statistic: our aim is not to determine how the Q-value can vary with different future outcomes, but how the Q-value can vary with different actions while averaging together future outcomes due to stochastic dynamics. While prior work on distributional RL can also be used for ofﬂine RL, it would suffer from the same action extrapolation issues as other methods, and would require similar constraints or regularization, while our method does not.

3 PRELIMINARIES

The RL problem is formulated in the context of a Markov decision process (MDP) (S, A, p0(s), p(s0|s, a), r(s, a), γ), where S is a state space, A is an action space, p0(s) is a distribution of initial states, p(s0|s, a) is the environment dynamics, r(s, a) is a reward function, and γ is a discount factor. The agent interacts with the MDP according to a policy (a|s). The goal is to obtain a policy that maximizes the cumulative discounted returns:

γtr(st, at)|s0 p0( ), at ( |st), st+1 p( |st, at)

Off-policy RL methods based on approximate dynamic programming typically utilize a state-action value function (Q-function), referred to as Q(s, a), which corresponds to the discounted returns obtained by starting from the state s and action a, and then following the policy . Ofﬂine reinforcement learning. In contrast to online (on-policy or off-policy) RL methods, ofﬂine RL uses previously collected data without any additional data collection. Like many recent ofﬂine RL methods, our work builds on approximate dynamic programming methods that minimize temporal difference error, according to the following loss:

LT D( ) = E(s,a,s0) D[(r(s, a) + γ max

a0 Qˆ (s0, a0) Q (s, a))2], (1)

where D is the dataset, Q (s, a) is a parameterized Q-function, Qˆ (s, a) is a target network (e.g., with soft parameters updates deﬁned via Polyak averaging), and the policy is deﬁned as (s) = arg maxa Q (s, a). Most recent ofﬂine RL methods modify either the value function loss (above) to regularize the value function in a way that keeps the resulting policy close to the data, or constrain the arg max policy directly. This is important because out-of-distribution actions a0 can produce erroneous values for Qˆ (s0, a0) in the above objective, often leading to overestimation as the policy is deﬁned to maximize the (estimated) Q-value.

4 IMPLICIT Q-LEARNING

In this work, we aim to entirely avoid querying out-of-sample (unseen) actions in our TD loss. Although the goal of this work is to approximate the optimal Q-function, we start by considering ﬁtted Q evaluation with a SARSA-style objective which has been considered in prior work on Ofﬂine Reinforcement Learning (Brandfonbrener et al., 2021; Gulcehre et al., 2021). This objective aims to learn the value of the dataset policy β (also called the behavior policy):

L( ) = E(s,a,s0,a0) D[(r(s, a) + γQˆ (s0, a0) Q (s, a))2]. (2)

This objective never queries values for out-of-sample actions, in contrast to Eqn. (1). One speciﬁc property of this objective that is important for this work is that it uses mean squared error (MSE) that ﬁts Q (s, a) to predict the mean statistics of the TD targets. Thus, if we assume unlimited capacity and no sampling error, the optimal parameters should satisfy

Q (s, a) r(s, a) + γEs0 p( |s,a)

[Qˆ (s0, a0)]. (3)

Prior work (Brandfonbrener et al., 2021; Gulcehre et al., 2021; Peng et al., 2019) has proposed directly using this objective to learn Q β, and then train the policy to maximize Q β. This avoids

Published as a conference paper at ICLR 2022

Figure 1: Left: The asymmetric squared loss used for expectile regression. = 0.5 corresponds to the standard mean squared error loss, while = 0.9 gives more weight to positives differences. Center: Expectiles of a normal distribution. Right: an example of estimating state conditional expectiles of a two-dimensional random variable. Each x corresponds to a distribution over y. We can approximate a maximum of this random variable with expectile regression: = 0.5 correspond to the conditional mean statistics of the distribution, while 1 approximates the maximum operator over in-support values of y.

any issues with out-of-distribution actions, since the TD loss only uses dataset actions. However, while this procedure works well empirically on simple Mu Jo Co locomotion tasks in D4RL, we will show that it performs very poorly on more complex tasks that beneﬁt from multi-step dynamic programming. In our method, which we derive next, we retain the beneﬁts of using this SARSA-like objective, but modify it so that it allows us to perform multi-step dynamic programming and learn a near-optimal Q-function.

Our method will perform a Q-function update similar to Eqn. (2), but we will aim to estimate the maximum Q-value over actions that are in the support of the data distribution. Crucially, we will show that it is possible to do this without ever querying the learned Q-function on out-of-sample actions by utilizing expectile regression. Formally, the value function we aim to learn is given by:

L( ) = E(s,a,s0) D[(r(s, a) + γ max a02A s.t. β(a0|s0)>0

Qˆ (s0, a0) Q (s, a))2]. (4)

Our algorithm, implicit Q-Learning (IQL), aims to estimate this objective while evaluating the Qfunction only on the state-action pairs in the dataset. To this end, we propose to ﬁt Q (s, a) to estimate state-conditional expectiles of the target values, and show that speciﬁc expectiles approximate the maximization deﬁned above. In Section 4.4 we show that this approach performs multi-step dynamic programming in theory, and in Section 5.1 we show that it does so in practice.

4.1 EXPECTILE REGRESSION

Practical methods for estimating various statistics of a random variable have been thoroughly studies in applied statistics and econometrics. The 2 (0, 1) expectile of some random variable X is deﬁned as a solution to the asymmetric least squares problem:

2(x m )], where L

(u < 0)|u2.

That is, for > 0.5, this asymmetric loss function downweights the contributions of x values smaller than m while giving more weights to larger values (see Fig. 1, left). Expectile regression is closely related to quantile regression (Koenker & Hallock, 2001), which is a popular technique for estimating quantiles of a distribution widely used in reinforcement learning (Dabney et al., 2018b;a) 1. The quantile regression loss is deﬁned as an asymmetric 1 loss.

1Our method could also be derived with quantiles, but since we are not interested in learning all of the expectiles/quantiles, unlike prior work (Dabney et al., 2018b;a), it is more convenient to estimate a single expectile because this involves a simple modiﬁcation to the MSE loss that is already used in standard RL methods. We found it to work somewhat better than quantile regression with its corresponding 1 loss.

Published as a conference paper at ICLR 2022

We can also use this formulation to predict expectiles of a conditional distribution:

2(y m (x))].

Fig. 1 (right) illustrates conditional expectile regression on a simple two-dimensional distribution. Note that we can optimize this objective with stochastic gradient descent. It provides unbiased gradients and is easy to implement with standard machine learning libraries.

4.2 LEARNING THE VALUE FUNCTION WITH EXPECTILE REGRESSION

Expectile regression provides us with a powerful framework to estimate statistics of a random variable beyond mean regression. We can use expectile regression to modify the policy evaluation objective in Eqn. (2) to predict an upper expectile of the TD targets that approximates the maximum of r(s, a) + γQˆ (s0, a0) over actions a0 constrained to the dataset actions, as in Eqn. (4). This leads to the following expectile regression objective:

L( ) = E(s,a,s0,a0) D[L

2(r(s, a) + γQˆ (s0, a0) Q (s, a))].

However, this formulation has a signiﬁcant drawback. Instead of estimating expectiles just with respect to the actions in the support of the data, it also incorporates stochasticity that comes from the environment dynamics s0 p( |s, a). Therefore, a large target value might not necessarily reﬂect the existence of a single action that achieves that value, but rather a lucky sample that happened to have transitioned into a good state. We resolve this by introducing a separate value function that approximates an expectile only with respect to the action distribution, leading to the following loss:

LV ( ) = E(s,a) D[L

2(Qˆ (s, a) V (s))]. (5)

We can then use this estimate to update the Q-functions with the MSE loss, which averages over the stochasticity from the transitions and avoids the lucky sample issue mentioned above:

LQ( ) = E(s,a,s0) D[(r(s, a) + γV (s0) Q (s, a))2]. (6)

Note that these losses do not use any explicit policy, and only utilize actions from the dataset for both objectives, similarly to SARSA-style policy evaluation. In Section 4.4, we will show that this procedure recovers the optimal Q-function under some assumptions. Also, even though only one action is available for every state in the dataset for continuous action spaces, due to neural network generalization, the expectile regression does not result in SARSA-style policy evaluation as shown in Section 5.2.

4.3 POLICY EXTRACTION AND ALGORITHM SUMMARY

Algorithm 1 Implicit Q-learning

Initialize parameters , , ˆ , φ. TD learning (IQL): for each gradient step do

λV r LV ( ) λQr LQ( )

ˆ (1 )ˆ + end for Policy extraction (AWR): for each gradient step do

φ φ λ rφL (φ) end for

While our modiﬁed TD learning procedure learns an approximation to the optimal Q-function, it does not explicitly represent the corresponding policy, and therefore requires a separate policy extraction step. While one can consider any technique for policy extraction that constrains the learned policy to stay close to the dataset actions, we aim for a simple method for policy extraction. As before, we aim to avoid using outof-samples actions. Therefore, we extract the policy with advantage-weighted regression (Peters & Schaal, 2007; Peng et al., 2019) previously successfully used for policy extraction in Ofﬂine RL (Wang et al., 2018; Nair et al., 2020; Brandfonbrener et al., 2021):

L (φ) = E(s,a) D[exp(β(Qˆ (s, a) V (s))) log φ(a|s)],

(7) where β 2 [0, 1) is an inverse temperature. Note that this objective does not clone all actions from the dataset but, as shown in prior work, this objective learns a policy that maximizes the Q-values subject to a distribution constraint (Peters & Schaal, 2007; Peng et al., 2019; Nair et al., 2020). This step can be seen as selecting and cloning the most optimal actions in the dataset.

Published as a conference paper at ICLR 2022

Our ﬁnal algorithm consists of two stages. First, we ﬁt the value function and Q, performing a number of gradient updates alternating between Eqn. (5) and (6). Second, we perform stochastic gradient descent on Eqn. (7). For both steps, we use a version of clipped double Q-learning (Fujimoto et al., 2018), taking a minimum of two Q-functions for V -function and policy updates. We summarize our ﬁnal method in Algorithm 1. Note that the policy does not inﬂuence the value function in any way, and therefore extraction could be performed either concurrently or after TD learning. Concurrent learning provides a way to use IQL with online ﬁnetuning, as we discuss in Section 5.3.

4.4 ANALYSIS

In this section, we will show that IQL can recover the optimal value function under the dataset support constraints. First, we prove a simple lemma that we will then use to show how our approach can enable learning the optimal value function. Lemma 1. Let X be a real-valued random variable with a bounded support and supremum of the support is x . Then, lim !1 m = x

Proof Sketch. One can show that expectiles of a random variable have the same supremum x . Moreover, for all 1 and 2 such that 1 < 2, we get m 1 m 2. Therefore, the limit follows from the properties of bounded monotonically non-decreasing functions.

In the following theorems, we show that under certain assumptions, our method indeed approximates the optimal state-action value Q and performs multi-step dynamical programming. We ﬁrst prove a technical lemma relating different expectiles of the Q-function, and then derive our main result regarding the optimality of our method.

For the sake of simplicity, we introduce the following notation for our analysis. Let E

x X[x] be a th expectile of X (e.g., E0.5 corresponds to the standard expectation). Then, we deﬁne V (s) and Q (s, a), which correspond to optimal solutions of Eqn. 5 and 6 correspondingly, recursively as:

a β( |s)[Q (s, a)],

Q (s, a) = r(s, a) + γEs0 p( |s,a)[V (s0)]. Lemma 2. For all s, 1 and 2 such that 1 < 2 we get V 1(s) V 2(s).

Proof. The proof follows the policy improvement proof (Sutton & Barto, 2018). See Appendix A.

Corollary 2.1. For any and s we have V (s) max a2A s.t. β(a|s)>0

Q (s, a) where V (s) is deﬁned

as above and Q (s, a) is an optimal state-action value function constrained to the dataset and deﬁned as

Q (s, a) = r(s, a) + γEs0 p( |s,a)

64 max a02A s.t. β(a0|s0)>0

Proof. The proof follows from the observation that convex combination is smaller than maximum.

lim !1 V (s) = max

a2A s.t. β(a|s)>0

Proof. Follows from combining Lemma 1 and Corollary 2.1.

Therefore, for a larger value of < 1, we get a better approximation of the maximum. On the other hand, it also becomes a more challenging optimization problem. Thus, we treat as a hyperparameter. Due to the property discussed in Theorem 3 we dub our method implicit Q-learning (IQL). We also emphasize that our value learning method deﬁnes the entire spectrum of methods between SARSA ( = 0.5) and Q-Learning ( ! 1). Note that in contrast to other multi-step methods, IQL absorbs the policy improvement step into value learning. Therefore, ﬁtting Q-function corresponds to the policy evaluation step, while ﬁtting the value function with IQL corresponds to implicit policy improvement.

Published as a conference paper at ICLR 2022

(a) toy maze MDP

(b) true optimal V ?

(c) One-step Policy Eval.

Figure 2: Evaluation of our algorithm on a toy umaze environment (a). When the static dataset is heavily corrupted by suboptimal actions, one-step policy evaluation results in a value function that degrades to zero far from the rewarding states too quickly (c). Our algorithm aims to learn a near-optimal value function, combining the best properties of SARSA-style evaluation with the ability to perform multi-step dynamic programming, leading to value functions that are much closer to optimality (shown in (b)) and producing a much better policy (d).

5 EXPERIMENTAL EVALUATION Our experiments aim to evaluate our method comparatively, in contrast to prior ofﬂine RL methods, and in particular to understand how our approach compares both to single-step methods and multi-step dynamic programming approaches. We will ﬁrst demonstrate the beneﬁts of multi-step dynamic programming methods, such as ours, in contrast to single-step methods, showing that on some problems this difference can be extremely large. We will then compare IQL with state-of-theart single-step and multi-step algorithms on the D4RL (Fu et al., 2020) benchmark tasks, studying the degree to which we can learn effective policies using only the actions in the dataset. We examine domains that contain near-optimal trajectories, where single-step methods perform well, as well as domains with no optimal trajectories at all, which require multi-step dynamic programming. Finally, we will study how IQL compares to prior methods when ﬁnetuning with online RL starting from an ofﬂine RL initialization.

5.1 THE DIFFERENCE BETWEEN ONE-STEP POLICY IMPROVEMENT AND IQL

We ﬁrst use a simple maze environment to illustrate the importance of multi-step dynamic programming for ofﬂine RL. The maze has a u-shape, a single start state, and a single goal state (see Fig. 2a). The agent receives a reward of 10 for entering the goal state and zero reward for all other transitions. With a probability of 0.25, the agent transitions to a random state, and otherwise to the commanded state. The dataset consists of 1 optimal trajectory and 99 trajectories with uniform random actions. Due to a short horizon of the problem, we use γ = 0.9.

Fig. 2 (c, d) illustrates the difference between single-step methods which ﬁt Q (s, a) via SARSAstyle objective, in this case represented by Onepstep RL (Brandfonbrener et al., 2021; Wang et al., 2018; Gulcehre et al., 2021) and IQL with = 0.95. Note that these methods represent a special case of our method with = 0.5. Although states closer to the high reward state will still have higher values, these values decay much faster as we move further away than they would for the optimal value function, and the resulting policy is highly suboptimal. Since IQL (d) performs iterative dynamic programming, it correctly propagates the signal, and the values are no longer dominated by noise. The resulting value function closely matches the true optimal value function (b).

5.2 COMPARISONS ON OFFLINE RL BENCHMARKS

Next, we evaluate our approach on the D4RL benchmark in comparison to prior methods (see Table 1). The Mu Jo Co tasks in D4RL consist of the Gym locomotion tasks, the Ant Maze tasks, and the Adroit and Kitchen robotic manipulation environments. Some prior works, particularly those proposing one-step methods, focus entirely on the Gym locomotion tasks. However, these tasks include a signiﬁcant fraction of near-optimal trajectories in the dataset. In contrast, the Ant Maze tasks, especially the medium and large ones, contain very few or no near-optimal trajectories, making them very challenging for one-step methods. These domains require stitching parts of suboptimal trajectories that travel between different states to ﬁnd a path from the start to the goal of the maze (Fu et al., 2020). As we will show, multi-step dynamic programming is essential in these domains. The Adroit and Kitchen tasks are comparatively less discriminating, and we found that most RL methods perform similarly to imitation learning in these domains (Florence et al., 2021).

Published as a conference paper at ICLR 2022

Table 1: Averaged normalized scores on Mu Jo Co locomotion and Ant Maze tasks. Our method outperforms prior methods on the challenging Ant Maze tasks, which require dynamic programming, and is competitive with the best prior methods on the locomotion tasks.

Dataset BC 10%BC BCQ DT ABM AWAC Onestep RL TD3+BC CQL IQL (Ours) halfcheetah-m-v2 42.6 42.5 47.0 42.6 0.1 53.6 43.5 48.4 0.1 48.3 0.3 44.0 5.4 47.4 0.2 hopper-m-v2 52.9 56.9 56.7 67.6 1.0 0.7 57.0 59.6 2.5 59.3 4.2 58.5 2.1 66.2 5.7 walker2d-m-v2 75.3 75.0 72.6 74.0 1.4 0.5 72.4 81.8 2.2 83.7 2.1 72.5 0.8 78.3 8.7 halfcheetah-m-r-v2 36.6 40.6 40.4 36.6 0.8 50.5 40.5 38.1 1.3 44.6 0.5 45.5 0.5 44.2 1.2 hopper-m-r-v2 18.1 75.9 53.3 82.7 7.0 49.6 37.2 97.5 0.7 60.9 18.8 95.0 6.4 94.7 8.6 walker2d-m-r-v2 26.0 62.5 52.1 66.6 3.0 53.8 27.0 49.5 12.0 81.8 5.5 77.2 5.5 73.8 7.1 halfcheetah-m-e-v2 55.2 92.9 89.1 86.8 1.3 18.5 42.8 93.4 1.6 90.7 4.3 91.6 2.8 86.7 5.3 hopper-m-e-v2 52.5 110.9 81.8 107.6 1.8 0.7 55.8 103.3 1.9 98.0 9.4 105.4 6.8 91.5 14.3 walker2d-m-e-v2 107.5 109.0 109.5 108.1 0.2 3.5 74.5 113.0 0.4 110.1 0.5 108.8 0.7 109.6 1.0 locomotion-v2 total 466.7 666.2 602.5 672.6 16.6 231.4 450.7 684.6 22.7 677.4 44.5 698.5 31.0 692.4 52.1 antmaze-u-v0 54.6 62.8 89.8 59.2 59.9 56.7 64.3 78.6 74.0 87.5 2.6 antmaze-u-d-v0 45.6 50.2 83.0 53.0 48.7 49.3 60.7 71.4 84.0 62.2 13.8 antmaze-m-p-v0 0.0 5.4 15.0 0.0 0.0 0.0 0.3 10.6 61.2 71.2 7.3 antmaze-m-d-v0 0.0 9.8 0.0 0.0 0.5 0.7 0.0 3.0 53.7 70.0 10.9 antmaze-l-p-v0 0.0 0.0 0.0 0.0 0. 0.0 0.0 0.2 15.8 39.6 5.8 antmaze-l-d-v0 0.0 6.0 0.0 0.0 0.0 1.0 0.0 0.0 14.9 47.5 9.5 antmaze-v0 total 100.2 134.2 187.8 112.2 109.1 107.7 125.3 163.8 303.6 378.0 49.9

total 566.9 800.4 790.3 784.8 340.5 558.4 809.9 841.2 1002.1 1070.4 102.0

kitchen-v0 total 154.5 - - - - - - - 144.6 159.8 22.6 adroit-v0 total 104.5 - - - - - - - 93.6 118.1 30.7

total+kitchen+adroit 825.9 - - - - - - - 1240.3 1348.3 155.3 runtime 10m 10m 960m 20m 20m 20m 80m 20m : Note that it is challenging to compare one-step and multi-step methods directly. Also, Brandfonbrener et al. (2021) reports results for a set of hyperparameters, such as batch and network size, that is signiﬁcantly different from other methods. We report results for the original hyperparameters and runtime for a comparable set of hyperparameters.

We therefore focus our analysis on the Gym locomotion and Ant Maze domains, but include full Adroit and Kitchen results in Appendix B for completeness.

Figure 3: Left: Estimating a larger expectile is crucial for antmaze tasks that require dynamical programming ( stitching ). Right: Clipped double Q-Learning (CDQ) is crucial for learning values for = 0.9.

Comparisons and baselines. We compare to methods that are representative of both multistep dynamic programming and one-step approaches. In the former category, we compare to CQL (Kumar et al., 2020), TD3+BC (Fujimoto & Gu, 2021), and AWAC (Nair et al., 2020). In the latter category, we compare to Onestep RL (Brandfonbrener et al., 2021) and Decision Transformers (Chen et al., 2021). We obtained the Decision Transformers results on Ant Maze subsets of D4RL tasks using the author-provided implementation2 and following authors instructions communicated over email. We obtained results for TD3+BC and Onestep RL (Exp. Weight) directly from the authors. Note that Chen et al. (2021) and Brandfonbrener et al. (2021) incorrectly report results for some prior methods, such as CQL, using the -v0 environments. These generally produce lower scores than the -v2 environments that these papers use for their own methods. We use the -v2 environments for all methods to ensure a fair comparison, resulting in higher values for CQL. Because of this ﬁx, our reported CQL scores are higher than all other prior methods. We obtained results for -v2 datasets using an author-suggested implementation.3 On the Gym locomotion tasks (halfcheetah, hopper, walker2d), we ﬁnd that IQL performs comparably to the best performing prior method, CQL. On the more challenging Ant Maze task, IQL outperforms CQL, and outperforms the one-step methods by a very large margin. Runtime. Our approach is also computationally faster than the baselines (see Table 1). For the baselines, we measure runtime for our reimplementations of the methods in JAX (Bradbury et al., 2018) built on top of JAXRL (Kostrikov, 2021), which are typically faster than the original imple-

2https://github.com/kzl/decision-transformer 3https://github.com/young-geng/CQL

Published as a conference paper at ICLR 2022

mentations. For example, the original implementation of CQL takes more than 4 hours to perform 1M updates, while ours takes only 80 minutes. Even so, IQL still requires about 4x less time than our reimplementation of CQL on average, and is comparable to the fastest prior one-step methods. We did not reimplement Decision Transformers due to their complexity and report runtime of the original implementation.

Effect of hyperparameter. We also demonstrate that it is crucial to compute a larger expectile on tasks that require stitching (see Fig. 3). We provide complete results in Appendix B. With larger values of , our method approximates Q-learning better, leading to better performance on the Ant Maze tasks. Moreover, due to neural network generalization, values learned with expectile regression increase with a larger and do not degrade to behavior policy values ( = 0.5). Finally, clipped double Q-Learning is crucial for estimating values for a larger = 0.9.

5.3 ONLINE FINE-TUNING AFTER OFFLINE RL

Dataset AWAC CQL IQL (Ours) antmaze-umaze-v0 56.7 ! 59.0 70.1 ! 99.4 88.0 ! 96.3 antmaze-umaze-diverse-v0 49.3 ! 49.0 31.1 ! 99.4 67.0 ! 49.0 antmaze-medium-play-v0 0.0 ! 0.0 23.0 ! 0.0 69.0 ! 89.2 antmaze-medium-diverse-v0 0.7 ! 0.3 23.0 ! 32.3 71.8 ! 91.4 antmaze-large-play-v0 0.0 ! 0.0 1.0 ! 0.0 36.8 ! 51.8 antmaze-large-diverse-v0 1.0 ! 0.0 1.0 ! 0.0 42.2 ! 59.8 antmaze-v0 total 107.7 ! 108.3 151.5 ! 231.1 374.8 ! 437.5 pen-binary-v0 44.6 ! 70.3 31.2 ! 9.9 37.4 ! 60.7 door-binary-v0 1.3 ! 30.1 0.2 ! 0.0 0.7 ! 32.3 relocate-binary-v0 0.8 ! 2.7 0.1 ! 0.0 0.0 ! 31.0 hand-v0 total 46.7 ! 103.1 31.5 ! 9.9 38.1 ! 124.0

total 154.4 ! 211.4 182.8 ! 241.0 412.9 ! 561.5

Table 2: Online ﬁnetuning results showing the initial performance after ofﬂine RL, and performance after 1M steps of online RL. In all tasks, IQL is able to ﬁnetune to a signiﬁcantly higher performance than the ofﬂine initialization, with ﬁnal performance that is comparable to or better than the best of either AWAC or CQL on all tasks except pen-binary-v0.

The policies obtained by ofﬂine RL can often be improved with a small amount of online interaction. IQL is well-suited for online ﬁne-tuning for two reasons. First, IQL has strong ofﬂine performance, as shown in the previous section, which provides a good initialization. Second, IQL implements a weighted behavioral cloning policy extraction step, which has previously been shown to allow for better online policy improvement compared to other types of ofﬂine constraints (Nair et al., 2020). To evaluate the ﬁnetuning capability of various RL algorithms, we ﬁrst run ofﬂine RL on each dataset, then run 1M steps of online RL, and then report the ﬁnal performance. We compare to AWAC (Nair et al., 2020), which has been proposed speciﬁcally for online ﬁnetuning, and CQL (Kumar et al., 2020), which showed the best performance among prior methods in our experiments in the previous section. Exact experimental details are provided in Appendix C. We use the challenging Ant Maze D4RL domains (Fu et al., 2020), as well as the high-dimensional dexterous manipulation environments from Rajeswaran et al. (2018), which Nair et al. (2020) propose to use to study online adaptation with AWAC. Results are shown in Table 5. On the Ant Maze domains, IQL signiﬁcantly outperforms both prior methods after online ﬁnetuning. CQL attains the second best score, while AWAC performs comparatively worse due to much weaker ofﬂine initialization. On the dexterous hand tasks, IQL performs signiﬁcantly better than AWAC on relocate-binary-v0, comparably on door-binary-v0, and slightly worse on pen-binary-v0, with the best overall score.

6 CONCLUSION

We presented implicit Q-Learning (IQL), a general algorithm for ofﬂine RL that completely avoids any queries to values of out-of-sample actions during training while still enabling multi-step dynamic programming. To our knowledge, this is the ﬁrst method that combines both of these features. This has a number of important beneﬁts. First, our algorithm is computationally efﬁcient: we can perform 1M updates on one GTX1080 GPU in less than 20 minutes. Second, it is simple to implement, requiring only minor modiﬁcations over a standard SARSA-like TD algorithm, and performing policy extraction with a simple weighted behavioral cloning procedure resembling supervised learning. Finally, despite the simplicity and efﬁciency of this method, we show that it attains excellent performance across all of the tasks in the D4RL benchmark, matching the best prior methods on the Mu Jo Co locomotion tasks, and exceeding the state-of-the-art performance on the challenging ant maze environments, where multi-step dynamic programming is essential for good performance.

Published as a conference paper at ICLR 2022

ACKNOWLEDGEMENTS

We thank Dibya Ghosh and the anonymous reviewers for helpful comments on earlier drafts of the paper. This research was supported by the Ofﬁce of Naval Research, C3.ai, and Intel, with compute support from Google.

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal

Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http: //github.com/google/jax.

David Brandfonbrener, William F Whitney, Rajesh Ranganath, and Joan Bruna. Ofﬂine rl without

off-policy evaluation. ar Xiv preprint ar Xiv:2106.08909, 2021.

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter

Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. ar Xiv preprint ar Xiv:2106.01345, 2021.

Will Dabney, Georg Ostrovski, David Silver, and R emi Munos. Implicit quantile networks for distributional reinforcement learning. In International conference on machine learning, pp. 1096 1105. PMLR, 2018a.

Will Dabney, Mark Rowland, Marc G Bellemare, and R emi Munos. Distributional reinforcement

learning with quantile regression. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018b.

Rasool Fakoor, Jonas Mueller, Kavosh Asadi, Pratik Chaudhari, and Alexander J Smola. Continuous

doubly constrained batch reinforcement learning. ar Xiv preprint ar Xiv:2102.09225, 2021.

Pete Florence, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian

Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit behavioral cloning. ar Xiv preprint ar Xiv:2109.00137, 2021.

Justin Fu, Aviral Kumar, Oﬁr Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep

data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020.

Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to ofﬂine reinforcement learning.

ar Xiv preprint ar Xiv:2106.06860, 2021.

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-

critic methods. In International Conference on Machine Learning, pp. 1587 1596. PMLR, 2018.

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without

exploration. In International Conference on Machine Learning, pp. 2052 2062. PMLR, 2019.

Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, and Shixiang Shane Gu. Emaq: Expected-

max q-learning operator for simple yet effective ofﬂine and online rl. In International Conference on Machine Learning, pp. 3682 3691. PMLR, 2021.

Caglar Gulcehre, Sergio G omez Colmenarejo, Ziyu Wang, Jakub Sygnowski, Thomas Paine, Konrad

Zolna, Yutian Chen, Matthew Hoffman, Razvan Pascanu, and Nando de Freitas. Regularized behavior value estimation. ar Xiv preprint ar Xiv:2103.09575, 2021.

Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas

Steiner, and Marc van Zee. Flax: A neural network library and ecosystem for JAX, 2020. URL http://github.com/google/flax.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint

ar Xiv:1412.6980, 2014.

Published as a conference paper at ICLR 2022

Roger Koenker and Kevin F Hallock. Quantile regression. Journal of economic perspectives, 15(4):

143 156, 2001.

Ilya Kostrikov. JAXRL: Implementations of Reinforcement Learning algorithms in JAX., 10 2021.

URL https://github.com/ikostrikov/jaxrl.

Ilya Kostrikov, Rob Fergus, Jonathan Tompson, and Oﬁr Nachum. Ofﬂine reinforcement learning

with ﬁsher divergence critic regularization. In International Conference on Machine Learning, pp. 5774 5783. PMLR, 2021.

Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via

bootstrapping error reduction. ar Xiv preprint ar Xiv:1906.00949, 2019.

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for ofﬂine

reinforcement learning. ar Xiv preprint ar Xiv:2006.04779, 2020.

Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin, and Dmitry Vetrov. Controlling overesti-

mation bias with truncated mixture of continuous distributional quantile critics. In International Conference on Machine Learning, pp. 5556 5566. PMLR, 2020.

Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. In Reinforce-

ment learning, pp. 45 73. Springer, 2012.

Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Awac: Accelerating online rein-

forcement learning with ofﬂine datasets. 2020.

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression:

Simple and scalable off-policy reinforcement learning. ar Xiv preprint ar Xiv:1910.00177, 2019.

Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational

space control. In Proceedings of the 24th international conference on Machine learning, pp. 745 750, 2007.

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, John Schulman, Emanuel Todorov, and

Sergey Levine. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations. In Robotics: Science and Systems, 2018. URL https://arxiv. org/pdf/1709.10087.pdf.

Noah Y Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Ne-

unert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin Riedmiller. Keep doing what worked: Behavioral modelling priors for ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:2002.08396, 2020.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.

Dropout: a simple way to prevent neural networks from overﬁtting. The journal of machine learning research, 15(1):1929 1958, 2014.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

Qing Wang, Jiechao Xiong, Lei Han, Peng Sun, Han Liu, and Tong Zhang. Exponentially weighted

imitation learning for batched historical data. In Neur IPS, pp. 6291 6300, 2018.

Ziyu Wang, Alexander Novikov, Konrad Zolna, Jost Tobias Springenberg, Scott Reed, Bobak

Shahriari, Noah Siegel, Josh Merel, Caglar Gulcehre, Nicolas Heess, et al. Critic regularized regression. ar Xiv preprint ar Xiv:2006.15134, 2020.

Yifan Wu, George Tucker, and Oﬁr Nachum. Behavior regularized ofﬂine reinforcement learning.

ar Xiv preprint ar Xiv:1911.11361, 2019.