# representation_balancing_offline_modelbased_reinforcement_learning__cffd8d62.pdf

Published as a conference paper at ICLR 2021

REPRESENTATION BALANCING OFFLINE MODEL-BASED REINFORCEMENT LEARNING

Byung-Jun Lee1,3, Jongmin Lee1 & Kee-Eung Kim1,2

1School of Computing, KAIST, Daejeon, Republic of Korea 2Graduate School of AI, KAIST, Daejeon, Republic of Korea 3Gauss Labs Inc., Seoul, Republic of Korea {bjlee,jmlee}@ai.kaist.ac.kr, kekim@kaist.ac.kr

One of the main challenges in ofﬂine and off-policy reinforcement learning is to cope with the distribution shift that arises from the mismatch between the target policy and the data collection policy. In this paper, we focus on a model-based approach, particularly on learning the representation for a robust model of the environment under the distribution shift, which has been ﬁrst studied by Representation Balancing MDP (Rep BM). Although this prior work has shown promising results, there are a number of shortcomings that still hinder its applicability to practical tasks. In particular, we address the curse of horizon exhibited by Rep BM, rejecting most of the pre-collected data in long-term tasks. We present a new objective for model learning motivated by recent advances in the estimation of stationary distribution corrections. This effectively overcomes the aforementioned limitation of Rep BM, as well as naturally extending to continuous action spaces and stochastic policies. We also present an ofﬂine model-based policy optimization using this new objective, yielding the state-of-the-art performance in a representative set of benchmark ofﬂine RL tasks.

1 INTRODUCTION

Reinforcement learning (RL) has accomplished remarkable results in a wide range of domains, but its successes were mostly based on a large number of online interactions with the environment. However, in many real-world tasks, exploratory online interactions are either very expensive or dangerous (e.g. robotics, autonomous driving, and healthcare), and applying a standard online RL would be impractical. Consequently, the ability to optimize RL agents reliably without online interactions has been considered as a key to practical deployment, which is the main goal of batch RL, also known as ofﬂine RL (Fujimoto et al., 2019; Levine et al., 2020).

In an ofﬂine RL algorithm, accurate policy evaluation and reliable policy improvement are both crucial for the successful training of the agent. Evaluating policies in ofﬂine RL is essentially an off-policy evaluation (OPE) task, which aims to evaluate the target policy given the dataset collected from the behavior policy. The difference between the target and the behavior policies causes a distribution shift in the estimation, which needs to be adequately addressed for accurate policy evaluation. OPE itself is one of the long-standing hard problems in RL (Sutton et al., 1998; 2009; Thomas & Brunskill, 2016; Hallak & Mannor, 2017).

However, recent ofﬂine RL studies mainly focus on how to improve the policy conservatively while using a common policy evaluation technique without much considerations for the distribution shift, e.g. mean squared temporal difference error minimization or maximum-likelihood training of environment model (Fujimoto et al., 2019; Kumar et al., 2019; Yu et al., 2020). While conservative policy improvement helps the policy evaluation by reducing the off-policyness, we hypothesize that addressing the distribution shift explicitly during the policy evaluation can further improve the overall performance, since it can provide a better foundation for policy improvement.

To this end, we aim to explicitly address the distribution shift of the OPE estimator used in the ofﬂine RL algorithm. In particular, we focus on the model-based approach, where we train an

Published as a conference paper at ICLR 2021

environment model robust to the distribution shift. One of the notable prior works is Representation Balancing MDP (Rep BM) (Liu et al., 2018b), which regularizes the representation learning of the model to be invariant between the distributions. However, despite the promising results by Rep BM, its step-wise estimation of the distance between the distributions has a few drawbacks that limit the algorithm from being practical: not only it assumes a discrete-action task where the target policy is deterministic, but it also performs poorly in long-term tasks due to the curse of horizon of step-wise importance sampling (IS) estimators (Liu et al., 2018a).

To address these limitations, we present the Representation Balancing with Stationary Distribution Estimation (Rep B-SDE) framework, where we aim to learn a balanced representation by regularizing, in the representation space, the distance between the data distribution and the discounted stationary distribution induced by the target policy. Motivated by the recent advances in estimating stationary distribution corrections, we present a new representation balancing objective to train a model of the environment that no longer suffers from the curse of horizon. We empirically show that the model trained by the Rep B-SDE objective is robust to the distribution shift for the OPE task, particularly when the difference between the target and the behavior is large. We also introduce a model-based ofﬂine RL algorithm based on the Rep B-SDE framework and report its performance on the D4RL benchmark (Fu et al., 2020), showing the state-of-the-art performance in a representative set of tasks.

2 RELATED WORK

Learning balanced representation Learning a representation invariant to speciﬁc aspects of data is an established method for overcoming distribution shift that arises in unsupervised domain adaptation (Ben-David et al., 2007; Zemel et al., 2013) and in causal inference from observational data (Shalit et al., 2017; Johansson et al., 2018). They have shown that imposing a bound on the generalization error under the distribution shift leads to the objective that learns a balanced representation such that the training and the test distributions look similar. Rep BM (Liu et al., 2018b) can be seen as a direct extension to the sequential case, which encourages the representation to be invariant under the target and behavior policies in each timestep.

Stationary distribution correction estimation (DICE) Step-wise importance sampling (IS) estimators (Precup, 2000) compute importance weights by taking the product of per-step distribution ratios. Consequently, these methods suffer from exponentially high variance in the lengths of trajectories, which is a phenomenon called the curse of horizon (Liu et al., 2018a). Recently, techniques of computing a stationary DIstribution Correction Estimation (DICE) have made remarkable progress that effectively addresses the curse of horizon (Liu et al., 2018a; Nachum et al., 2019a; Tang et al., 2020; Zhang et al., 2020; Mousavi et al., 2020). DICE has been also used to explicitly address the distribution shift in online model-free RL, by directly applying IS on the policy and action-value objectives (Liu et al., 2019; Gelada & Bellemare, 2019). We adopt one of the estimation techniques, Dual DICE (Nachum et al., 2019a), to measure the distance between the stationary distribution and the data distribution in the representation space.

Ofﬂine reinforcement learning There are extensive studies on improving standard online modelfree RL algorithms (Mnih et al., 2015; Lillicrap et al., 2016; Haarnoja et al., 2018) for stable learning in the ofﬂine setting. The main idea behind them is to conservatively improve policy by (1) quantifying the uncertainty of value function estimate, e.g. using bootstrapped ensembles (Kumar et al., 2019; Agarwal et al., 2020), or/and (2) constraining the optimized target policy to be close to the behavior policy (i.e. behavior regularization approaches) (Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2019; Lee et al., 2020). A notable exception is Algae DICE (Nachum et al., 2019b), which implicitly uses DICE to regularize the discounted stationary distribution induced by the target policy to be kept inside of the data support, similar to this work.

On the other hand, Yu et al. (2020) argued that the model-based approach can be advantageous due to its ability to generalize predictions on the states outside of the data support. They introduce MOPO (Yu et al., 2020), which uses truncated rollouts and penalized rewards for conservative policy improvement. MORe L (Kidambi et al., 2020) trains a state-action novelty detector and use it to penalize rewards in the data-sparse region. Matsushima et al. (2020), MOOSE (Swazinna et al., 2020)

Published as a conference paper at ICLR 2021

and MBOP (Argenson & Dulac-Arnold, 2020) guide their policy optimization using the behavior policy, similar to the behavior regularization approaches.

Note that these aforementioned ofﬂine RL methods build on the standard approximate dynamic programming algorithm for action-value estimation (model-free) or on a maximum-likelihood environment model (model-based), without explicitly addressing the distribution shift in the estimator. In contrast, we augment the objective for model learning to obtain a robust model under the distribution shift, which is the ﬁrst attempt for ofﬂine RL to the best of our knowledge.

3 PRELIMINARIES

A Markov Decision Process (MDP) is speciﬁed by a tuple M = S, A, T, R, d0, γ , consisting of state space S, action space A, transition function T : S A (S), reward function R : S A ([0, rmax]), initial state distribution d0, and discount rate γ. In this paper, we mainly focus on continuous state space S Rds and conduct experiments on both discrete action spaces A = {a0, ...ana} and continuous action spaces A Rda. Given MDP M and policy π, which is a (stochastic) mapping from state to action, the trajectory can be generated in the form of s0, a0, r0, s1, a1, r1, ..., where s0 d0 and for each timestep t 0, at π(st), rt R(st, at), and st+1 T(st, at). The goal of RL is to optimize or evaluate a policy, based on the normalized expected discounted return: Rπ (1 γ)EM,π [P t=0 γtrt].

A useful and important concept throughout the paper is the discounted stationary distribution, which represents the long-term occupancy of states:

dπ(s, a) (1 γ)

t=0 γt Pr(st = s, at = a|M, π).

From the deﬁnition, it can be observed that Rπ can be obtained by Rπ = E(s,a) dπ[r(s, a)].

Ofﬂine RL and off-policy evaluation In this paper, we focus on the ofﬂine RL problem where the agent can only access a static dataset D = {(si, ai, ri, s i)}N i=1 for the maximization of Rπ. We consider a behavior-agnostic setting where we do not have any knowledge of the data collection process. We denote the empirical distribution of the dataset by d D.

Before improving policy, we ﬁrst aim to better evaluate Rπ given a target policy π and a static dataset D, which corresponds to an off-policy evaluation (OPE) problem. We mainly focus on a modelbased approach where the algorithm ﬁrst estimates the unknown dynamics b T, b R using the dataset D. This deﬁnes an approximate MDP c M = S, A, b T, b R, d0, γ , with the approximate expected discounted return b Rπ (1 γ)Ec M,π [P t=0 γtrt] obtained from c M. In this paper, we are interested

in the MDP estimate c M that can effectively reduce the error in the evaluation of policy π, |Rπ b Rπ|.

In order to do so, we need to learn a good representation of a model that results in a small OPE error. We assume a bijective representation function φ : S A Z where Z Rdz is the representation space. We deﬁne the transition and the reward models in terms of the representation function φ, i.e. b T = b Tz φ and b R = b Rz φ. In practice, where we use a neural network for b T and b R, z can be chosen to be the output of an intermediate hidden layer, making φ represented by lower layers and b Tz, b Rz by the remaining upper layers. We deﬁne dπ φ(z) the discounted stationary distribution on Z induced by dπ(s, a) under the representation function z = φ(s, a), and similarly for d D φ (z).

4 REPRESENTATION BALANCING OFFLINE MODEL-BASED RL

4.1 GENERALIZATION ERROR BOUND FOR MODEL-BASED OFF-POLICY EVALUATION

We aim to construct a model c M from the dataset D, which can accurately evaluate the policy π, by minimizing a good upper bound of policy evaluation error |Rπ b Rπ|. We deﬁne the following point-wise model loss for notational convenience:

Eφ, b Rz, b Tz(s, a) = c RDT V R(r|s, a) b Rz r|φ(s, a) + c T DT V T(s |s, a) b Tz s |φ(s, a) ,

Published as a conference paper at ICLR 2021

where c R = 2(1 γ) and c T = 2γrmax. Then, we start by restating the simulation lemma (Kearns & Singh, 2002) to bound the policy evaluation error in terms of the point-wise model loss. The proof is available in Appendix A.

Lemma 4.1. Given an MDP M and its estimate c M with a bijective representation function φ, i.e. ( b T, b R) = b Tz φ, b Rz φ , the policy evaluation error of a policy π can be bounded by: Rπ b Rπ E(s,a) dπ h Eφ, b Rz, b Tz(s, a) i (1)

The Lemma 4.1 has a natural interpretation: if the model error is small in the states frequently visited by following the policy π, the resultant policy evaluation error will also be small. However, minimizing the RHS of Eq. (1) in the off-policy evaluation (OPE) task is generally intractable since the distribution dπ is not directly accessible. Therefore, the common practice has been to construct a maximum-likelihood MDP using D while ignoring the distribution shift, but its OPE performance is not guaranteed.

Instead, we will derive a tractable upper bound on the policy evaluation error by eliminating the direct dependence on dπ in Eq. (1). To this end, we adopt the distance metric between two distributions over representations dπ φ and d D φ that can bound their difference in expectations, which is the Integral Probability Metric (IPM) (M uller, 1997):

IPMG(p, q) = sup g G

Ez p[g(z)] Ez q[g(z)] . (2)

where particular choices of G make the IPM equivalent to different well-known distances of distributions, e.g. total variation distance or Wasserstein distance (Sriperumbudur et al., 2009).

Theorem 4.2. Given an MDP M and its estimate c M with a bijective representation function φ, i.e. ( b T, b R) = b Tz φ, b Rz φ , assume that there exists a constant Bφ > 0 and a function class G {g : Z R} such that 1 Bφ Eφ, b Rz, b Tz φ 1( ) G. Then, for any policy π, Rπ b Rπ E(s,a) d D

h Eφ, b Rz, b Tz(s, a) i + BφIPMG(dπ φ, d D φ ) (3)

This theorem is an adaptation of Lemma 1 of Shalit et al. (2017) to an inﬁnite horizon model-based policy evaluation and can be derived by the deﬁnition of IPMG(dπ φ, d D φ ) since it serves as an upper bound of the difference in the expectations of any function in G. The ﬁrst term in Eq. (3) corresponds to the ﬁtness to the data following d D, while the second term serves as a regularizer. To see this, minimizing the second term would yield a near-constant representation function, which would be bad for the ﬁrst term since it cannot distinguish states and actions well enough. It shows a natural trade-off between optimizing the model that ﬁts data better and learning the representation that is invariant with respect to dπ φ and d D φ .

Nevertheless, RHS of Eq. (3) still cannot be evaluated naively due to its dependence on dπ in estimating the IPM. We address this challenge via a change of variable, which is known as a Dual DICE trick (Nachum et al., 2019a). Deﬁne ν : Z R as an arbitrary function of state-action pairs that satisﬁes:

ν φ(s, a) g φ(s, a) + γEs T (s,a) a π(s )

h ν φ(s , a ) i , (s, a) S A.

Then we can rewrite the IPM as:

IPMG(dπ φ, d D φ ) = sup g G

E(s,a) dπ h g φ(s, a) i E(s,a) d D h g φ(s, a) i

(1 γ)E s d0

h ν φ(s, a) i E(s,a,s ) d D

h ν φ(s, a) γν φ(s , a ) i , (4)

ν : ν(z) = ET,π

t=0 γtg(φ(st, at)) (s0, a0) = φ 1(z)

In other words, we are now taking a supremum over the new function class F, which captures a function that returns the expected discounted sum of g(φ(st, at)) following the policy π in an MDP

Published as a conference paper at ICLR 2021

M given an initial representation z. While it is now difﬁcult to choose F from G, Eq. (3) still can be kept valid by using a sufﬁciently rich function class for F.

In this work, we choose F to be the family of functions in the unit ball in a reproducing kernel Hilbert space (RKHS) Hk with the kernel k, which allows the following closed-form formula (see Lemma A.3 in Appendix for details):

IPMG(dπ φ, d D φ )2 = Es0 d0,a0 π(s0),(s,a,s ) d D,a π(s ) s0 d0, a0 π( s0),( s, a, s ) d D, a π( s )

k φ(s, a), φ( s, a) + (1 γ)2k φ(s0, a0), φ( s0, a0) + γ2k φ(s , a ), φ( s , a )

2(1 γ)k φ(s0, a0), φ( s, a) 2γk φ(s , a ), φ( s, a) + 2γ(1 γ)k φ(s , a ), φ( s0, a0) .

This completes our derivation for the tractable upper bound of policy evaluation error (Eq. (3)), whose direct dependence on dπ is eliminated by Eq. (5). Finally, we can train a model by minimizing the upper bound that encourages us to learn balanced representation while improving data ﬁtness, where the trained model can readily provide a model-based OPE.

The IPMG(dπ φ, d D φ )2 in Eq. (5) can be estimated via ﬁnite random samples, and we denote its

sampled-based estimator as d IPM(dπ φ, d D φ )2. We show in the following that, a valid upper bound can be established based on the sample-based estimators instead of the exact terms in the RHS of Eq. (3) under certain conditions.

Theorem 4.3. Given an MDP M, its estimate c M with a bijective representation function φ, i.e. ( b T, b R) = b Tz φ, b Rz φ , and an RKHS Hk (Z R) induced by a universal kernel k such that supz Z k(z, z) = k, assume that fφ, b Rz, b Tz(z) =

ET,π h P t=0 γt Eφ, b Rz, b Tz(st, at) (s0, a0) = φ 1(z) i Hk with Bφ = fφ, b Rz, b Tz Hk and the loss

is bounded by E = sups S,a A Eφ, b Rz, b Tz(s, a). Let n be the number of data in D. With probability 1 2δ, Rπ b Rπ 1

(s,a) D Eφ, b Rz, b Tz(s, a)+Bφ d IPM(dπ φ, d D φ )+

This result can be proved by adapting the convergence results of the empirical estimate of the MMD (Gretton et al., 2012) and Hoeffding s inequality (Hoeffding, 1963). With the choice of an RKHS Hk, we can now interpret Bφ as the RKHS norm fφ, b Rz, b Tz Hk, which captures the magnitude and the smoothness of the expected cumulative model loss fφ, b Rz, b Tz. In general, assuming smooth underlying dynamics, we can expect Bφ to be small when the model error is small. Furthermore, although k depends on the kernel function we use, we can always let k = 1 and subsume it into Bφ as long as it is bounded, i.e. using Bφ Bφ k. In the next section, we develop algorithms based on practical approximations of Eq. (3).

Detailed comparison to Rep BM As previously stated, Rep BM (Liu et al., 2018b) is a modelbased ﬁnite-horizon OPE algorithm that trains the model to have balanced representation φ, which is encouraged to be invariant under the target and behavior policies. Speciﬁcally, given the deterministic target policy π and the behavior policy µ, at each timestep t, it deﬁnes the factual distribution on Z given that the actions until timestep t have been executed according to the policy π: p F φ,t(z) = Pr(zt|M, µ, a0:t = π(s0:t)) and the counterfactual distribution on Z given the same condition except the action at timestep t: p CF φ,t (z) = Pr(zt|M, µ, a0:t 1 = π(s0:t 1), at = π(st)). Then, Rep BM bounds the OPE error as,1

Rπ b Rπ (1 γ)

t=0 γt Ep(st|M,µ,a0:t=π(s0:t))[Eφ, b Rz, b Tz(st, π(st))] + Bφ,t IPMGt(p F φ,t, p CF φ,t ) .

Although Rep BM achieves performance improvement over other OPE algorithms, we found a number of practical challenges: from the deﬁnition of the IPMGt(p F φ,t, p CF φ,t ), it requires a discrete-action

1We adapted their formulation to the inﬁnite horizon discounted MDP setting.

Published as a conference paper at ICLR 2021

environment and a deterministic policy π, which cannot be met by many practical RL settings. In addition, since the sample-based estimation of IPMGt(p F φ,t, p CF φ,t ) requires samples consistent with the policy π, i.e. a0:t 1 = π(s0:t 1), the algorithm would reject exponentially many samples with respect to t, which is the curse of horizon (Liu et al., 2018a). When there is a large difference between the behavior and the target policies in long-term tasks, their implementation becomes close to using the maximum likelihood objective, which can also be observed empirically in our experiments. In contrast, our work is free from the abovementioned practical limitations by performing balancing between the discounted stationary distribution dπ and the data distribution d D, leveraging the recent advances in stationary distribution correction estimation (i.e. the Dual DICE trick) to overcome the difﬁculties pertinent to the expectation concerning dπ required to evaluate the IPM in the objective.

4.2 REPRESENTATION BALANCING WITH STATIONARY DISTRIBUTION ESTIMATION

In the following, we describe algorithms for OPE and ofﬂine RL based on the practical approximations to Eq. (3), which we call the Rep B-SDE framework.

Objective for off-policy evaluation As we mentioned earlier, we aim to minimize the upper bound of OPE error |Rπ b Rπ| speciﬁed in Theorem 4.2. To make the RHS of Eq. (3) tractable for optimization, we replace the intractable total variation distance with KL-divergence, which can be easily minimized by maximizing the data log-likelihood. We also replace the IPM with its samplebased estimator to obtain the learning objective:

min φ, b Rz, b Tz

(s,a,s ,r) D

h log b Rz r|φ(s, a)

DKL(R(r|s,a)|| b Rz(r|φ(s,a)))

log b Tz s |φ(s, a)

DKL(T (s |s,a)|| b Tz(s |φ(s,a)))

i + αM d IPM(dπ φ, d D φ ) (6)

The constant Bφ in Theorem 4.2 depends on the function classes and cannot be estimated, and thus, we replace it with a tunable hyperparameter αM that balances between data ﬁtness and representation invariance. Remark that αM = 0 recovers the simple maximum-likelihood objective.

By simulating the target policy π under the environment model ( b T, b R) obtained by minimizing Eq. (6), it is possible to perform a model-based OPE that approximately minimizes the upper bound of OPE error.

Objectives for ofﬂine model-based RL By rearranging Eq. (3), we have,

Rπ b Rπ E(s,a) D h Eφ, b Rz, b Tz(s, a) i BφIPMG(dπ φ, d D φ ). (7)

Then, we can maximize the RHS of Eq. (7) to get the model and the policy that maximizes the lower bound of true return Rπ. Similar to the derivation of Eq. (6), we replace the total variation distance by KL-divergence to obtain the following learning objectives:

Lc M(c M, π, αM) = Ed D h log b Rz r|φ(s, a) log b Tz s |φ(s, a) i + αMIPMG(dπ φ, d D φ ), (8)

Jπ(π, c M, απ) = Ec M,π

απIPMG(dπ φ, d D φ ). (9)

where the expectation in Eq. (9) can be optimized using various model-based RL algorithms, e.g. with planning (Chua et al., 2018) or using a model-free learner (Janner et al., 2019). By iterating between the minimization of Lc M with respect to c M and the maximization of Jπ with respect to π by stochastic gradient method, it is possible to perform ofﬂine model-based RL that approximately maximizes the lower bound of the true return Rπ.

Implementation details Following the recent practice (Chua et al., 2018; Janner et al., 2019; Yu et al., 2020), we model the dynamics ( b T, b R) using a bootstrap ensemble of neural networks. To optimize a policy based on the objective, we perform full rollouts (until reaching the terminal states or maximum timesteps) using learned dynamics ( b T, b R). During obtaining the rollouts, we pessimistically augment the estimated reward function using the penalty proportional to the bootstrapped uncertainty, which helped the algorithm to perform robustly. We suspect the difﬁculty in calculating

Published as a conference paper at ICLR 2021

0.1 0.3 0.5 0.7 0.9 Behavior off-policyness ( )

Normalized Individual MSE

Cart Pole-v0

0.1 0.3 0.5 0.7 0.9 Behavior off-policyness ( )

0.1 0.3 0.5 0.7 0.9 Behavior off-policyness ( )

HIV simulator

Baseline Rep BM Rep B-SDE (ours)

Figure 1: The OPE results of different model learning algorithms with varying off-policyness on the x-axis. The y-axis plots the normalized individual MSE on test trajectories where the performance of the baseline model is set to 1. The tasks used are inﬁnite-horizon discounted environments (γ = 0.98 (HIV), γ = 0.99 (others)), where we truncated at t = 1000. The experiments are repeated 200 times, and the error bars indicate 95% conﬁdence intervals.

accurate IPM estimation is what makes the additional pessimism beneﬁcial. We store the generated experiences to a separate dataset b D and update the policy π with IPM-regularized soft actor-critic (SAC) (Haarnoja et al., 2018) using samples from both datasets D b D similar to MBPO (Janner et al., 2019).

Since the presented model objective requires a policy π to perform a balancing, we initially trained the model and the policy using αM = απ = 0: by c M0 = arg minc M Lc M(c M, , 0) and π0 = arg minπ Lπ(π, c M0, 0). Then, we retrained the model and the policy using π0: c M1 = arg minc M Lc M(c M, π0, αM) and π1 = arg minπ Lπ(π, c M1, απ) for some non-negative αM and απ. While it is desirable to repeat the optimization of the model and the policy until convergence, we did not observe signiﬁcant improvement after the ﬁrst iteration and reported the performance of the policy after the ﬁrst iteration, π1.

5 EXPERIMENTS

We demonstrate the effectiveness of the Rep B-SDE framework, by comparing the OPE performance of Eq. (6) to that of Rep BM and evaluating the presented model-based ofﬂine RL algorithm on the benchmarks. The code used to produce the results is available online.2 A detailed description of the experiments can be found in Appendix B.

5.1 MODEL-BASED OFF-POLICY EVALUATION

For the sake of comparison with Rep BM, we test our OPE algorithm on three continuous-state discrete-action tasks where the goal is to evaluate a deterministic target policy. We trained a suboptimal deterministic target policy π and used an ϵ-greedy policy with various values of ϵ as the data collection policy. In each experiment, we trained environment models for a ﬁxed number of epochs concerning three different objectives: simple maximum-likelihood baseline, step-wise representation balancing objective used in Rep BM (Liu et al., 2018b), and the presented OPE objective of Rep B-SDE (Eq. (6)). We measured the individual mean squared error (Liu et al., 2018b).

The normalized error of each algorithm, relative to the error of the baseline valued at 1, is presented in Figure 1. We can observe that the presented objective of Rep B-SDE can reduce the OPE error from the baseline signiﬁcantly, outperforming Rep BM in most of the cases. As the off-policyness between the policies (ϵ) increases, representation balancing algorithms should more beneﬁt compared to the maximum-likelihood baseline in principle. However, the result shows that the performance of Rep BM merely increases due to the increased sample rejection rate under large ϵ.

2https://github.com/dlqudwns/repb-sde

Published as a conference paper at ICLR 2021

Table 1: Normalized scores on D4RL Mu Jo Co benchmark datasets (Fu et al., 2020) where the score of 0 corresponds to a random policy and 100 corresponds to a converged SAC policy. All results (except MF, which is taken from Fu et al. (2020) and Kumar et al. (2020)) are averaged over 5 runs, where denotes the standard error. The highest scores are highlighted with boldface.

Dataset type Environment Rep B-SDE (ours) RP Base MOPO MF BC

Random Walker2d 21.1 1.0 18.4 16.4 1.3 7.3 BEAR 0.0 Medium Walker2d 72.1 1.9 56.3 5.5 -0.1 81.1 BRAC 7.7 Med-Replay Walker2d 49.8 11.4 41.1 6.2 47.8 26.7 CQL 8.0 Med-Expert Walker2d 88.8 6.9 72.6 51.7 32.4 98.7 CQL 3.3 Random Hopper 8.6 1.0 8.3 8.3 9.1 12.2 BRAC 9.0 Medium Hopper 34.0 2.8 27.5 19.8 19.2 58.0 CQL 34.5 Med-Replay Hopper 62.2 6.7 49.8 32.9 80.8 48.6 CQL 13.2 Med-Expert Hopper 82.6 7.0 74.0 19.1 23.2 111.0 CQL 41.9 Random Half Cheetah 32.9 1.1 31.3 26.1 29.9 35.4 CQL -2.4 Medium Half Cheetah 49.1 0.3 47.3 22.1 45.0 46.3 BRAC 31.0 Med-Replay Half Cheetah 57.5 0.8 53.6 54.0 53.8 47.7 BRAC 14.7 Med-Expert Half Cheetah 55.4 8.3 36.7 31.9 91.0 64.7 BCQ 29.5

5.2 OFFLINE REINFORCEMENT LEARNING WITH REPRESENTATION BALANCED MODEL

We evaluate the ofﬂine model-based RL algorithm presented in Section 4.2 on a subset of datasets in the D4RL benchmark (Fu et al., 2020): using four types of datasets (Random, Medium, Medium Replay, and Medium-Expert) from three different Mu Jo Co environments (Half Cheetah-v2, Hopperv2, and Walker2d-v2) (Todorov et al., 2012). Random dataset contains 106 experience tuples from a random policy. Medium dataset contains 106 experience tuples from a policy trained to approximately 1/3 the performance of the expert, which is an agent trained to completion with SAC. Med-Replay dataset contains 105 (2 105 for Walker2d-v2) experience tuples, which are from the replay buffer of a policy trained up to the performance of the medium agent. Med-Expert dataset is a Medium dataset combined with 106 samples from the expert. This experimental setting exactly follows that of (Yu et al., 2020; Argenson & Dulac-Arnold, 2020).

The normalized score of each algorithm is presented in Table 1. MF denotes the best score from ofﬂine model-free algorithms (taken from Fu et al. (2020) and Kumar et al. (2020)), including SAC (Haarnoja et al., 2018), BCQ (Fujimoto et al., 2019), BEAR (Kumar et al., 2019), BRAC (Wu et al., 2019), AWR (Peng et al., 2019), c REM (Agarwal et al., 2020), Algae DICE (Nachum et al., 2019b), and CQL (Kumar et al., 2020). The actual algorithm that achieves the reported score is presented next to the numbers. Base shows the performance of the most naive baseline, which attempts to maximize the estimated policy return under the maximum-likelihood model. RP denotes the performance of Base equipped with the appropriate reward penalty using the bootstrapped uncertainty of the model, which is equivalent to π0 described in Section 4.2. Rep B-SDE denotes the performance after a single iteration of our algorithm, corresponding to π1. We also provide BC, the performance of direct behavior cloning from the data, and MOPO (Yu et al., 2020), an ofﬂine model-based RL algorithm that optimizes policy based on truncated rollouts with the heuristic reward penalty.

The signiﬁcant gap between Rep B-SDE and RP in the results shows the advantage brought by our framework that encourages balanced representation. While our approach was less successful on some of the datasets (mostly on the Hopper-v2 environment), we hypothesize that the conservative training techniques: the behavior regularization approaches exploited in the model-free algorithms, the rollout truncation technique in MOPO, and the pessimistic training based on the bootstrapped uncertainty estimates adopted in our algorithm exhibit their strengths in different datasets. For example, it may be the case that the ensemble models are overconﬁdent especially in Hopper-v2, and should be regularized with more explicit methods. Nevertheless, we emphasize that the presented framework can be used jointly with any other conservative training technique to improve their performance.

Published as a conference paper at ICLR 2021

6 CONCLUSION AND FUTURE WORK

In this paper, we presented Rep B-SDE, a framework for balancing the model representation with stationary distribution estimation, aiming at obtaining a model robust to the distribution shift that arises in off-policy and ofﬂine RL. We started from the theoretical observation that the model-based policy evaluation error can be upper-bounded by the data ﬁtness and the distance between two distributions in the representation space. Motivated by the bound, we presented a model learning objective for off-policy evaluation and model-based ofﬂine policy optimization. Rep B-SDE can be seen as an extension of Rep BM, which addresses the curse of horizon by leveraging the recent advances in stationary distribution correction estimation (i.e. the Dual DICE trick). Using stationary distribution also frees us from other limitations of Rep BM to be applied to more practical settings. To the best of our knowledge, it is the ﬁrst attempt to introduce an augmented objective for the learning of model robust to a speciﬁc distribution shift in ofﬂine RL.

In the experiments, we empirically demonstrated that we can signiﬁcantly reduce the OPE error from the baseline, outperforming Rep BM in most cases. We also showed that the robust model also helps in the ofﬂine model-based policy optimization, yielding the state-of-the-art performance in a representative set of D4RL benchmarks. We emphasize that our approach can be directly adopted in many other model-based ofﬂine RL algorithms.

There are a number of promising directions for future work. Most importantly, we have not leveraged the learned representation in the policy when optimizing the policy, yet it is very natural to do so. We can easily incorporate the representation into the policy by assuming energy-based models, but this would make the computation of entropy intractable in entropy-regularized policy optimization algorithms. It would be also interesting to see if the proposed framework for learning balanced representation can beneﬁt off-policy (and ofﬂine) model-free methods.

ACKNOWLEDGMENTS

This work was supported by the National Research Foundation (NRF) of Korea (NRF2019M3F2A1072238 and NRF-2019R1A2C1087634), and the Ministry of Science and Information communication Technology (MSIT) of Korea (IITP No. 2019-0-00075, IITP No. 2020-0-00940 and IITP No. 2017-0-01779 XAI).

Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on ofﬂine reinforcement learning. In International Conference on Machine Learning, 2020.

Arthur Argenson and Gabriel Dulac-Arnold. Model-based ofﬂine planning. ar Xiv preprint ar Xiv:2008.05556, 2020.

Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems, 2007.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Open AI Gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

Kurtland Chua, Roberto Calandra, Rowan Mc Allister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, 2018.

Justin Fu, Aviral Kumar, Oﬁr Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020.

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, 2019.

Published as a conference paper at ICLR 2021

Carles Gelada and Marc G Bellemare. Off-policy deep reinforcement learning by bootstrapping the covariate shift. In AAAI Conference on Artiﬁcial Intelligence, 2019.

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch olkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723 773, 2012.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018.

Assaf Hallak and Shie Mannor. Consistent on-line off-policy evaluation. In International Conference on Machine Learning, 2017.

Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963.

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Modelbased policy optimization. In Advances in Neural Information Processing Systems, 2019.

Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, 2016.

Fredrik D Johansson, Nathan Kallus, Uri Shalit, and David Sontag. Learning weighted representations for generalization across designs. ar Xiv preprint ar Xiv:1802.08598, 2018.

Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3):209 232, 2002.

Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. MORe L: Modelbased ofﬂine reinforcement learning. In Advances in Neural Information Processing Systems, 2020.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, 2019.

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for ofﬂine reinforcement learning. In Advances in Neural Information Processing Systems, 2020.

Byung-Jun Lee, Jongmin Lee, Peter Vrancx, Dongho Kim, and Kee-Eung Kim. Batch reinforcement learning with hyperparameter gradients. In International Conference on Machine Learning, 2020.

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Ofﬂine reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020.

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.

Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Inﬁnitehorizon off-policy estimation. In Advances in Neural Information Processing Systems, 2018a.

Yao Liu, Omer Gottesman, Aniruddh Raghu, Matthieu Komorowski, Aldo A Faisal, Finale Doshi Velez, and Emma Brunskill. Representation balancing MDPs for off-policy policy evaluation. In Advances in Neural Information Processing Systems, 2018b.

Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. ar Xiv preprint ar Xiv:1904.08473, 2019.

Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Oﬁr Nachum, and Shixiang Gu. Deploymentefﬁcient reinforcement learning via model-based ofﬂine optimization. ar Xiv preprint ar Xiv:2006.03647, 2020.

Published as a conference paper at ICLR 2021

Colin Mc Diarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148 188, 1989.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015.

Ali Mousavi, Lihong Li, Qiang Liu, and Denny Zhou. Black-box off-policy estimation for inﬁnitehorizon reinforcement learning. In International Conference on Learning Representations, 2020.

Alfred M uller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429 443, 1997.

Oﬁr Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dual DICE: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems, 2019a.

Oﬁr Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algae DICE: Policy gradient from arbitrary experience. ar Xiv preprint ar Xiv:1912.02074, 2019b.

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. ar Xiv preprint ar Xiv:1910.00177, 2019.

Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, pp. 80, 2000.

Prajit Ramachandran, Barret Zoph, and Quoc V Le. Swish: a self-gated activation function. ar Xiv preprint ar Xiv:1710.05941, 7, 2017.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Uri Shalit, Fredrik D Johansson, and David Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In International Conference on Machine Learning, 2017.

Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Sch olkopf, and Gert RG Lanckriet. On integral probability metrics, φ-divergences and binary classiﬁcation. ar Xiv preprint ar Xiv:0901.2698, 2009.

Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998.

Richard S Sutton, Hamid R Maei, and Csaba Szepesv ari. A convergent o(n) temporal-difference algorithm for off-policy learning with linear function approximation. In Advances in neural information processing systems, 2009.

Phillip Swazinna, Steffen Udluft, and Thomas Runkler. Overcoming model bias for robust ofﬂine deep reinforcement learning. ar Xiv preprint ar Xiv:2008.05533, 2020.

Ziyang Tang, Yihao Feng, Lihong Li, Dengyong Zhou, and Qiang Liu. Doubly robust bias reduction in inﬁnite horizon off-policy estimation. In International Conference on Learning Representations, 2020.

Philip Thomas and Emma Brunskill. Data-efﬁcient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, 2016.

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, 2012.

Yifan Wu, George Tucker, and Oﬁr Nachum. Behavior regularized ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:1911.11361, 2019.

Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. MOPO: Model-based ofﬂine policy optimization. In Advances in Neural Information Processing Systems, 2020.

Published as a conference paper at ICLR 2021

Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In International Conference on Machine Learning, 2013.

Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. Gen DICE: Generalized ofﬂine estimation of stationary values. In International Conference on Learning Representations, 2020.

Published as a conference paper at ICLR 2021

Lemma 4.1. Given an MDP M and its estimate c M with a bijective representation function φ, i.e. ( b T, b R) = b Tz φ, b Rz φ , the policy evaluation error of a policy π can be bounded by: Rπ b Rπ E(s,a) dπ h Eφ, b Rz, b Tz(s, a) i (1)

Proof. We deﬁne a value function V π(s) = EM,π[P t=0 γtrt|s0 = s], the expected discounted return starting from the state s, for the following proof. Due to its deﬁnition, we can write Rπ = (1 γ)Es d0[V π(s)]. The following recursive equation also holds:

V π(s) = Ea π(s) r(s, a) + γEs T (s,a)[V (s )] ,

and similarly with an approximate value function b V π(s) = Ec M,π[P t=0 γtrt|s0 = s] given an MDP

estimate c M. Then,

Rπ b Rπ = Ed0 h V π(s0) b V π(s0) i

= Ed0,π [r0 br0] + γEd0

Z V π(s1)T(s1|s0, a0) b V π(s1) b T(s1|s0, a0) π(a0|s0)ds1da0

= Ed0,π [r0 br0] + γEd0

Z V π(s1)T(s1|s0, a0) b V π(s1)T(s1|s0, a0)

+ b V π(s1)T(s1|s0, a0) b V π(s1) b T(s1|s0, a0) π(a0|s0)ds1da0

= Ed0,π [r0 br0] + γEd0

Z V π(s1) b V π(s1)

| {z } This forms a recursive equation

T(s1|s0, a0)π(a0|s0)ds1da0

Z b V π(s1) T(s1|s0, a0) b T(s1|s0, a0) π(a0|s0)ds1da0

= E(s,a) dπ Z R(r|s, a) b R(r|s, a) dr + γ Z b V π(s ) T(s |s, a) b T(s |s, a) ds

E(s,a) dπ Z R(r|s, a) b R(r|s, a) dr + γ Z b V π(s ) T(s |s, a) b T(s |s, a) ds

E(s,a) dπ Z R(r|s, a) b R(r|s, a) dr + γ Z rmax 1 γ

T(s |s, a) b T(s |s, a) ds

= 2E(s,a) dπ h DT V (R(r|s, a)|| b Rz(r|φ(s, a))) i

1 γ E(s,a) dπ h DT V (T(s |s, a)|| b Tz(s |φ(s, a))) i

Theorem 4.2. Given an MDP M and its estimate c M with a bijective representation function φ, i.e. ( b T, b R) = b Tz φ, b Rz φ , assume that there exists a constant Bφ > 0 and a function class G {g : Z R} such that 1 Bφ Eφ, b Rz, b Tz φ 1( ) G. Then, for any policy π, Rπ b Rπ E(s,a) d D

h Eφ, b Rz, b Tz(s, a) i + BφIPMG(dπ φ, d D φ ) (3)

Proof. From Lemma 4.1, we directly have: Rπ b Rπ E(s,a) d D h Eφ, b Rz, b Tz(s, a) i + Z Eφ, b Rz, b Tz(s, a) dπ(s, a) d D(s, a) dsda.

Published as a conference paper at ICLR 2021

Then, Z Eφ, b Rz, b Tz(s, a) dπ(s, a) d D(s, a) dsda

= Z Eφ, b Rz, b Tz(φ 1(z)) dπ (z) d D φ 1(φ 1(z)) d(s, a)

= Z Eφ, b Rz, b Tz(φ 1(z)) dπ φ(z) d D φ (z) dz

Z 1 Bφ Eφ, b Rz, b Tz(φ 1(z)) dπ φ(z) d D φ (z) dz (Bφ > 0)

Z g(z) dπ φ(z) d D φ (z) dz

Bφ Eφ, b Rz, b Tz(φ 1(z)) G

= BφIPMG dπ φ, d D φ

We state some previous results, which are required for further proof. Theorem A.1 (Mc Diarmid s inequality (Mc Diarmid, 1989)). Let {Xi}n i=1 be independent random variables taking values in set X, and assume that f : X n R satisﬁes sup {xi}n i=1 X n, x X |f({xi}n i=1) f(x1, ..., xi 1, x, xi+1, ..., xn)| ci. (10)

Then for every ϵ > 0,

Pr{f({Xi}n i=1) E{Xi}n i=1[f({Xi}n i=1)] ϵ} exp 2ϵ2 Pn i=1 c2 i

Lemma A.2 (Rademacher complexity of RKHS (Bartlett & Mendelson, 2002)). Let F be a unit ball in a universal RKHS on the compact domain X, with kernel bounded according to 0 k(x, x ) k. Let {xi}n i=1 be an i.i.d. sample of size n drawn according to a probability measure p on X, and let σi be i.i.d. and take values in { 1, 1} with equal probability. The Rademacher complexity, which is deﬁned as below, is upper bounded as:

Rn(F) E{xi}n i=1,σ sup f F

i=1 σif(xi)

The upper bound is followed by Lemma 22 of (Bartlett & Mendelson, 2002).

Now we prove the following using the results above. Lemma A.3. Let Hk be a RKHS associated with universal kernel k( , ). Let , Hk be the inner product of Hk, which satisﬁes the reproducing property ν(z) = ν, k( , z) Hk. When G is chosen such that

G = g (Z R) : g(z) = ν(z) γEs T (φ 1(z)) a π(s ) [ν(φ(s , a ))], ν (Z R), ν, ν Hk 1 ,

the IPMG(dπ φ, d D φ ) has the following closed form deﬁnition:

IPMG(dπ φ, d D φ )2 = Es0 d0,a0 π(s0),(s,a,s ) d D,a π(s ) s0 d0, a0 π( s0),( s, a, s ) d D, a π( s )

k(φ(s, a), φ( s, a)) + (1 γ)2k(φ(s0, a0), φ( s0, a0)) + γ2k(φ(s , a ), φ( s , a ))

2(1 γ)k(φ(s0, a0), φ( s, a)) 2γk(φ(s , a ), φ( s, a)) + 2γ(1 γ)k(φ(s , a ), φ( s0, a0)) .

Furthermore, suppose that k supz Z k(z, z). The estimator d IPM(dπ φ, d D φ )2, which is the samplebased estimation of IPMG(dπ φ, d D φ )2 from n samples, satisﬁes with probability at least 1 δ, IPMG(dπ φ, d D φ ) d IPM(dπ φ, d D φ )

Published as a conference paper at ICLR 2021

Proof. In the below we write in shorthand, IPMG to denote IPMG(dπ φ, d D φ ). As in Eq. (4), we can rewrite the IPM term as:

IPMG = sup ν F

(1 γ)E s d0 a π(s) [ν(φ(s, a))] E(s,a,s ) d D

a π(s ) [ν(φ(s, a)) γν(φ(s , a ))]

and F here becomes F = {ν (Z R) : ν, ν Hk 1}, a unit ball in RKHS Hk. Using the reproducing property of Hk:

(1 γ)E s d0 a π(s) [ν(φ(s, a))] E(s,a,s ) d D

a π(s ) [ν(φ(s, a)) γν(φ(s , a ))]

(1 γ)E s d0 a π(s) [ν(φ(s, a))] E(s,a,s ) d D

a π(s ) [ν(φ(s, a)) γν(φ(s , a ))]

(1 γ)E s d0 a π(s) [ ν, k( , φ(s, a)) Hk]

E(s,a,s ) d D

a π(s ) [ ν, k( , φ(s, a)) Hk γ ν, k( , φ(s , a )) Hk] 2

= sup ν F ν, ν 2 Hk,

where ν ( ) = (1 γ)E s d0 a π(s) [k( , φ(s, a))] E(s,a,s ) d D

a π(s ) [k( , φ(s, a)) γk( , φ(s , a ))] .

Due to the Cauchy-Schwarz inequality and ν, ν Hk 1, ν F,

ν, ν 2 Hk ν, ν Hk ν , ν Hk ν , ν Hk holds and supν F ν, ν 2 Hk = ν , ν Hk. Using the property that k( , z), k( , z) Hk = k(z, z), we can derive the closed form expression in the lemma from ν , ν Hk.

Now we prove the error bound of the estimator. First, we divide IPMG(dπ φ, d D φ ) into three parts:

IPMG(dπ φ, d D φ ) = sup ν F |f1(ν) + f2(ν) + f3(ν)|

where f1(ν) = (1 γ)Es d0,a π(s)[ν(φ(s, a))], f2(ν) = E(s,a) d D [ν(φ(s, a))] ,

f3(ν) = E(s,a,s ) d D,a π(s ) [γν(φ(s , a ))] .

Given n samples {s(i) 0 , a(i) 0 , s(i), a(i), s (i), a (i)}n i=1 from the generative process s(i) 0 d0, a(i) 0 π(s(i) 0 ), (s(i), a(i), s (i)) d D, a (i) π(s (i)) i, we deﬁne the sample-based estimator d IPM(dπ φ, d D φ ):

d IPM(dπ φ, d D φ )2 = 1

k(φ(s(i), a(i)), φ(s(j), a(j))) + (1 γ)2k(φ(s(i) 0 , a(i) 0 ), φ(s(j) 0 , a(j) 0 ))

+ γ2k(φ(s (i), a (i)), φ(s (j), a (j))) 2(1 γ)k(φ(s(i) 0 , a(i) 0 ), φ(s(j), a(j)))

2γk(φ(s (i), a (i)), φ(s(j), a(j))) + 2γ(1 γ)k(φ(s (i), a (i)), φ(s(j) 0 , a(j) 0 )) .

By deriving in reverse order, we can recover its another deﬁnition in supremum, which can be divided into three parts:

d IPM(dπ φ, d D φ ) = sup ν F

bf1(ν) + bf2(ν) + bf3(ν)

where bf1(ν) = 1 γ

i=1 ν φ s(i) 0 , a(i) 0 , bf2(ν) = 1

i=1 ν φ s(i), a(i) ,

i=1 ν φ s (i), a (i) .

Published as a conference paper at ICLR 2021

We can bound the error of sample-based estimator with individual errors as: IPMG(dπ φ, d D φ ) d IPM(dπ φ, d D φ ) = sup ν F |f1(ν) + f2(ν) + f3(ν)| sup ν F

bf1(ν) + bf2(ν) + bf3(ν)

|f1(ν) + f2(ν) + f3(ν)| bf1(ν) + bf2(ν) + bf3(ν)

f1(ν) + f2(ν) + f3(ν) bf1(ν) bf2(ν) bf3(ν)

h f1(ν) bf1(ν) + f2(ν) bf2(ν) + f3(ν) bf3(ν) i

f1(ν) bf1(ν) + sup ν F

f2(ν) bf2(ν) + sup ν F

f3(ν) bf3(ν) .

We then observe that

f1(ν) bf1(ν) = (1 γ) Es d0,a π(s) s d0, a π(s) [k(φ(s, a), φ( s, a))]

i=1 E s d0, a π(s) h k φ s(i), a(i) , φ( s, a) i

j=1 k φ s(i), a(i) , φ s(j), a(j) 1/2 ,

which shows that changing s(i), a(i) results in changes of supν F f1(ν) bf1(ν) in magnitude

of at most 2(1 γ) k1/2/n where k = supz Z k(z, z). Therefore, by Mc Diarmid s inequality (Theorem A.1),

f1(ν) bf1(ν) Es(i),a(i) sup ν F

f1(ν) bf1(ν) ϵ exp nϵ2

Es(i),a(i) sup ν F

f1(ν) bf1(ν)

n Es(i),a(i)

E s(i), a(i)

i=1 ν φ s(i) 0 , a(i) 0 #

i=1 ν φ s(i) 0 , a(i) 0

n Es(i),a(i), s(i), a(i)

i=1 ν φ s(i) 0 , a(i) 0

i=1 ν φ s(i) 0 , a(i) 0

n Es(i),a(i), s(i), a(i),σ(i)

i=1 σ(i) n ν φ s(i) 0 , a(i) 0 ν φ s(i) 0 , a(i) 0 o

where the last inequality is from Lemma A.2. Combining the results, we get

f1(ν) bf1(ν) 2(1 γ)

Similarly, we derive bounds for f2 and f3 respectively:

f2(ν) bf2(ν) 2

f3(ν) bf3(ν) 2γ

Published as a conference paper at ICLR 2021

By letting RHS of above bounds to be δ/3 and using union bound, we get, with probability 1-δ, we get IPMG(dπ φ, d D φ ) d IPM(dπ φ, d D φ )

The relationship between F and G

G = g (Z R) : g(z) = ν(z) γEs T (φ 1(z)) a π(s ) [ν(φ(s , a ))], ν (Z R), ν, ν Hk 1

show that when the conditional expectation Es T (φ 1( )),a π(s )[ν(φ(s , a ))] : Z R is a function in RKHS Hk, G also becomes a subset of Hk.

Then we can prove the following Theorem.

Theorem 4.3. Given an MDP M, its estimate c M with a bijective representation function φ, i.e. ( b T, b R) = b Tz φ, b Rz φ , and an RKHS Hk (Z R) induced by a universal kernel k such that supz Z k(z, z) = k, assume that fφ, b Rz, b Tz(z) =

ET,π h P t=0 γt Eφ, b Rz, b Tz(st, at) (s0, a0) = φ 1(z) i Hk with Bφ = fφ, b Rz, b Tz Hk and the loss

is bounded by E = sups S,a A Eφ, b Rz, b Tz(s, a). Let n be the number of data in D. With probability 1 2δ,

(s,a) D Eφ, b Rz, b Tz(s, a)+Bφ d IPM(dπ φ, d D φ )+

Proof. Applying the Hoeffding inequality (Hoeffding, 1963), with probability 1 δ, we get:

E(s,a) d D h Eφ, b Rz, b Tz(s, a) i 1

(s,a) D Eφ, b Rz, b Tz(s, a) +

By using an union bound with Eq. (13) and substituting terms in Eq. (3), we recover the result.

Published as a conference paper at ICLR 2021

B EXPERIMENT DETAILS

B.1 COMPUTING INFRASTRUCTURE

All experiments were conducted on the Google Cloud Platform. Speciﬁcally, we used computeoptimized machines (c2-standard-4) that provide 4 v CPUs and 16 GB memory for the evaluation experiment of Section 5.1, and we used high-memory machines (n1-highmem-4), which provide 4 v CPUs and 26GB memory, equipped with an Nvidia Tesla K80 GPU for the RL experiment of Section 5.2.

B.2 DETAILS OF THE OPE EXPERIMENT

Task details We did not modify Cart Pole-v0 environment and Acrobot-v1 environment from the original implementation of Open AI Gym (Brockman et al., 2016) except for the maximum trajectory length. We ran PPO (Schulman et al., 2017) to optimize policies to reach a certain performance level and set them as the target policies for Cart Pole-v0 and Acrobot-v1. For the HIV simulator, we used the code adapted by Liu et al. (2018b), which is originally from the implementation of RLPy3. We modiﬁed the environment to have more randomness in the initial state (up to 10% perturbation from the baseline initial state) and to use the reward function that gives the logarithm of original reward values, as the original reward function scales up to 1010. We used a tree-based ﬁtted q-iteration algorithm implemented by Liu et al. (2018b) to optimize the target policy for the HIV simulator. All the other details are shown in Table 2. We assume that the termination conditions of tasks are known in prior.

Table 2: Task settings of OPE experiments.

Cart Pole-v0 Acrobot-v1 HIV simulator

state space dimension 4 6 6 # of actions 2 3 4 discount rate γ 0.99 0.99 0.98 # of trajectories 200 200 50 max length of training traj. 200 200 200 max length of rollouts for evaluation 1000 1000 1000 discounted return of target policy 89.4 -56.88 803.2

Model and algorithm details The model we learn is composed of a representation module and a dynamics module. To be consistent with the experiment settings in Liu et al. (2018b), we use a representation module of a single hidden layer feed-forward network that takes the state as input and outputs representation. We squashed the representation between ( 1, 1) using the tanh activation function. The dynamics module is also a single hidden layer feed-forward network that takes representation as input and outputs state difference and reward prediction for each action. We use the swish activation function (Ramachandran et al., 2017) for the hidden layers of two modules. As a whole, the model can also be seen as a feed-forward network with three hidden layers of varying activation functions, where the output of the second hidden layer is the representation we regularize.

For the purpose of comparison, we minimize the L2 distance between the model prediction and the desired outcome from data, which corresponds to using a model of Gaussian predictive distribution with ﬁxed variance. We standardized the inputs and outputs of the neural network and used Adam (Kingma & Ba, 2014) with a learning rate of 3 10 4 for the optimization. When compared to the similar experiments conducted in Liu et al. (2018b), we used a larger and more expressive model with more optimization steps with a smaller learning rate for a more accurate comparison. While the derivation of the Rep B-SDE objective was based on the state-action representation function, we use state representation in this experiment for direct comparison with Rep BM, which uses state representation (it can be also understood as using action invariant kernel). We follow the choice

3https://github.com/rlpy/rlpy

Published as a conference paper at ICLR 2021

of Liu et al. (2018b) and use dot product kernel k(φ(s), φ( s)) = φ(s) φ( s) for the OPE experiment, which is not universal but allows us to avoid search of kernel hyperparameters, such as length-scales.

After the training, we generate another 200 trajectories (50 in case of HIV simulator), and rollout in both true and simulated (based on learned model) environments to evaluate models. We measure the individual MSE, which is

Individual MSE = Es0 d0

t=0 γtrt|s0 i EM,π h X

t=0 γtrt|s0 i 2

for measuring the performance of each model. Whole experiment, from sampling data to learning and evaluating the model, is repeated 200 times with different random seeds.

Choice and effect of hyperparameter α For choosing hyperparameter α for each algorithm, we searched over α {0.001, 0.01, 0.1, 1, 10} for each off-policyness ϵ and for each environment. Chosen αs were mainly α {0.001, 0.01} for Cart Pole-v0, α {1, 10} for Acrobotv1, and α {0.01, 0.1} for HIV simulator for both algorithms. In general, large α was beneﬁcial when high off-policyness (ϵ) is present and/or the task is hard to generalize. On the right we show the example of effect of varying α in Cart Pole-v0.

0.0001 0.001 0.01 0.1

Relative Individual MSE

Cart Pole-v0, =0.5

Baseline Rep BM MRB (ours)

Comparison to other baselines In Figure 2, the OPE results with other model-free baselines are presented. FQE: Fitted Q-evaluation, IS: step-wise importance sampling, DR: doubly robust estimator based on step-wise importance sampling using the value function learned with ﬁtted Qevaluation (Jiang & Li, 2016), Dual DICE: stationary distribution correction algorithm (Nachum et al., 2019a). We used the implementation provided by the authors in case of Dual DICE. All results are normalized to set the MSE of model-based baseline to be 1. Here, we used average MSEs instead of individual MSEs since importance sampling estimators are not suitable in computing individual MSEs. The results show that model-based methods are more robust to increasing off-policyness when compared to FQE. The results of model-based methods on Acrobot is relatively worse due to the difﬁcult dynamics of Acrobot environment.

0.1 0.3 0.5 0.7 0.9 Behavior off-policyness ( )

Normalized Average MSE

Cart Pole-v0

0.1 0.3 0.5 0.7 0.9 Behavior off-policyness ( )

0.1 0.3 0.5 0.7 0.9 Behavior off-policyness ( )

106 HIV simulator

Baseline Ours FQE IS DR Dual DICE

Figure 2: The OPE results compared to other model-free baselines. All experiments are repeated 200 times and the error bars denote 95% conﬁdence interval.

B.3 DETAILS OF THE OFFLINE RL EXPERIMENTS

Task details We used 12 datasets in D4RL (Fu et al., 2020) over four dataset types and three environments as speciﬁed in the main text. In Table 1, normalized scores suggested by (Fu et al., 2020) is used to report the result, where score of 0 corresponds to a random policy and 100 corresponds to a converged SAC policy. In Half Cheetah-v2, score of 0 means the undiscounted return of 280, and score of 1 means the undiscounted return of 12135. In Hopper-v2, score of 0 means the undiscounted return of 20, and score of 1 means the undiscounted return of 3234. In Walker2d-v2 score of 0 means the undiscounted return of 2, and score of 1 means the undiscounted return of 4592. We assume that the termination conditions of tasks are known in prior.

Published as a conference paper at ICLR 2021

Representation balancing maximizing undiscounted return As we report the undiscounted sum of rewards in the experiments, the maximization of lower bound of Rπ may result in an underutilization of experiences of later timesteps. One way to mitigate the mismatch is to optimize the policy by maximizing the returns starting from the states in the dataset. It corresponds to maximizing Rπ = (1 γ)Es0 d D,M,π[P t=0 γtrt] instead of Rπ, where the expectation respect to the initial state distribution d0 is altered with the data distribution d D.

Consequently, to bound the error of Rπ, the representation should be balanced with another discounted stationary distribution dπ(s, a) (1 γ) P t=0 γt Pr(st = s, at = a|s0 d D, T, π), the distribution induced by the policy π where the initial state is sampled from the data distribution. The derivations can be easily adapted by noting that:

IPMG( dπ φ, d D φ ) = sup ν F

(1 γ)E s d D

h ν φ(s, a) i E(s,a,s ) d D

h ν φ(s, a) γν φ(s , a ) i ,

and changing the initial state sampling distributions to d D during the estimation of IPM.

Model and algorithm details Similar to the model used in the OPE experiment, the model we learn is composed of a representation module and a dynamics module. A representation module is a feed-forward network with two hidden layers that takes the state-action pair as input and outputs representation through the tanh activation function. The dynamics module is a single hidden layer network that takes representation as input and outputs parameters of diagonal Gaussian distribution predicting state difference and reward. We use 200 hidden units for all intermediate layers including the representation layer. Across all domains, we train an ensemble of 7 models and pick the best 5 models on their validation error on hold-out set of 1000 transitions in the dataset. The inputs and outputs of the neural network is normalized. We present the pseudo-code of the presented Representation Balancing Ofﬂine Model-based RL algorithm below.

Algorithm 1 Representation Balancing Ofﬂine Model-based RL

Input: Ofﬂine dataset D, previous policy π Output: Optimized policy π 1: Sample K independent datasets with replacement from D. 2: Train bootstrapped ensemble of K models { b Ti, b Ri}K i=0 minimizing Eq. (8) (adapted with dπ φ). 3: for repeat = 0, 1, . . . do 4: for rollout = 0, 1, . . . , B do 5: Sample initial rollout state s0 from D. 6: for t = 0, 1, . . . do 7: Sample an action at π(st). 8: Randomly pick ( b Ti, b Ri) and sample (st+1, rt) ( b Ti, b Ri)(st, at).

9: Compute rt = rt γλ p

VK[µ(st, at)] 2 and store (st, at, rt, st+1) in b D.

10: end for 11: end for 12: Draw samples from D to compute d IPM( dπ φ, d D φ ).

13: Draw samples from D and b D to update critic Q. 14: Maximize Es D b D,a π(s)[Q(s, a) τ log π(a|s)] απ d IPM( dπ φ, d D φ ) to update π. 15: end for

For the result of MOPO (Yu et al., 2020), we ran the code kindly provided by the authors4 without any modiﬁcation on the algorithm or the hyperparameters. All algorithms we experimented (Base, RP, Rep B-SDE) share all the hyperparameters in common except the ones associated with changing objectives. We run SAC on the full rollouts from the trained ensemble models as shown in Algorithm 1. The common hyperparameters shared among algorithms are shown in Table 3. We simply tried the listed hyperparameters and not tuned them further. For RP and Rep B-SDE, we penalized the reward from simulated environments with the standard deviation of prediction means of neural network ensembles. We used standardized output of all 7 neural networks to compute the reward

4https://github.com/tianheyu927/mopo

Published as a conference paper at ICLR 2021

Table 3: Common hyperparameters used in ofﬂine RL experiments.

Parameter Value

optimizer Adam (Kingma & Ba, 2014) learning rate 3 10 4 discount factor γ 0.99 number of samples per minibatch 256 target smoothing coefﬁcient τ 5 10 3 [actor/critic] number of hidden layers 2 [actor/critic] number of hidden units per layer 256 [actor/critic] non-linearity Re LU # of rollouts 104

max length of rollouts 103

rollout buffer size 5 107

Table 4: Normalized scores on D4RL Mu Jo Co benchmark datasets (Fu et al., 2020) with standard errors fully speciﬁed.

Dataset type Environment Rep B-SDE (ours) RP Base MOPO

Random Walker2d-v2 21.1 1.0 18.4 1.8 16.4 2.2 1.3 0.5 Medium Walker2d-v2 72.1 1.9 56.3 4.3 5.5 0.4 -0.1 0.0 Med-Replay Walker2d-v2 49.8 11.4 41.1 10.3 6.2 0.7 47.8 4.8 Med-Expert Walker2d-v2 88.8 6.9 72.6 5.1 51.7 7.1 32.4 8.8 Random Hopper-v2 8.6 1.0 8.3 0.2 8.3 0.2 9.1 0.3 Medium Hopper-v2 34.0 2.8 27.5 3.1 19.8 0.8 19.2 4.5 Med-Replay Hopper-v2 62.2 6.7 49.8 8.7 32.9 5.5 80.8 2.9 Med-Expert Hopper-v2 82.6 7.0 74.0 7.2 19.1 1.1 23.2 1.3 Random Half Cheetah-v2 32.9 1.1 31.3 1.5 26.1 6.2 29.9 1.2 Medium Half Cheetah-v2 49.1 0.3 47.3 0.3 22.1 4.4 45.0 0.4 Med-Replay Half Cheetah-v2 57.5 0.8 53.6 0.8 54.0 2.4 53.8 1.1 Med-Expert Half Cheetah-v2 55.4 8.3 36.7 12.0 31.9 10.5 91.0 2.5

penalty. We ﬁrst search over the reward penalty coefﬁcient λ {0, 2, 5, 7, 10, 15} that grants RP the best performance. We shared same λ for Rep B-SDE and searched over αM {0.01, 0.1, 1}, απ {0, 0.01, 0.1, 1, 10}. We ran the algorithms for 1.5 106 gradient updates, except for Half Cheetah-Medium-Expert where we ran for 5.0 106. In Table 4, we present the standard errors of results, which was omitted in the main text.

In addition to the performance after the ﬁnal iteration presented in Table 1, we also present the learning curves of the experiment in Figure 3, and the effect of the choice of αM and απ in Figure 4. The Figure 4 shows that the presented regularization is robust to the choice of αM, απ except for too large αM, consistently improving from RP, which corresponds to αM = 0, απ = 0.

Published as a conference paper at ICLR 2021

0.0 0.3 0.6 0.9 1.2 1.5

Average Return

Random-Walker2d-v2

0.0 0.3 0.6 0.9 1.2 1.5

Medium-Walker2d-v2

0.0 0.3 0.6 0.9 1.2 1.5

Medium-Replay-Walker2d-v2

0.0 0.3 0.6 0.9 1.2 1.5

Medium-Expert-Walker2d-v2

0.0 0.3 0.6 0.9 1.2 1.5 0

Average Return

Random-Hopper-v2

0.0 0.3 0.6 0.9 1.2 1.5

Medium-Hopper-v2

0.0 0.3 0.6 0.9 1.2 1.5

Medium-Replay-Hopper-v2

0.0 0.3 0.6 0.9 1.2 1.5

Medium-Expert-Hopper-v2

0.0 0.3 0.6 0.9 1.2 1.5

Updates (106)

Average Return

Random-Half Cheetah-v2

0.0 0.3 0.6 0.9 1.2 1.5

Updates (106)

Medium-Half Cheetah-v2

0.0 0.3 0.6 0.9 1.2 1.5

Updates (106)

Medium-Replay-Half Cheetah-v2

0.0 1.0 2.0 3.0 4.0 5.0

Updates (106)

Medium-Expert-Half Cheetah-v2

MOPO Base RP RP + MRB BC MF

Figure 3: The learning curves of the D4RL experiment.

0.0 0.3 0.6 0.9 1.2 1.5

Updates (106)

Average Return

0.0 0.3 0.6 0.9 1.2 1.5

Updates (106)

Effect of varying M and (Medium-Walker2d-v2)

BC MF RP ( M = 0.0, = 0.0) ( M = 0.01, = 0.0)

( M = 0.1, = 0.0) ( M = 1.0, = 0.0) ( M = 10.0, = 0.0) ( M = 0.1, = 0.01) ( M = 0.1, = 0.1) ( M = 0.1, = 1.0) ( M = 0.1, = 10.0)

Figure 4: The effect of varying hyperparameter in the D4RL experiment.