# representation_balancing_offline_modelbased_reinforcement_learning__cffd8d62.pdf Published as a conference paper at ICLR 2021 REPRESENTATION BALANCING OFFLINE MODEL-BASED REINFORCEMENT LEARNING Byung-Jun Lee1,3, Jongmin Lee1 & Kee-Eung Kim1,2 1School of Computing, KAIST, Daejeon, Republic of Korea 2Graduate School of AI, KAIST, Daejeon, Republic of Korea 3Gauss Labs Inc., Seoul, Republic of Korea {bjlee,jmlee}@ai.kaist.ac.kr, kekim@kaist.ac.kr One of the main challenges in offline and off-policy reinforcement learning is to cope with the distribution shift that arises from the mismatch between the target policy and the data collection policy. In this paper, we focus on a model-based approach, particularly on learning the representation for a robust model of the environment under the distribution shift, which has been first studied by Representation Balancing MDP (Rep BM). Although this prior work has shown promising results, there are a number of shortcomings that still hinder its applicability to practical tasks. In particular, we address the curse of horizon exhibited by Rep BM, rejecting most of the pre-collected data in long-term tasks. We present a new objective for model learning motivated by recent advances in the estimation of stationary distribution corrections. This effectively overcomes the aforementioned limitation of Rep BM, as well as naturally extending to continuous action spaces and stochastic policies. We also present an offline model-based policy optimization using this new objective, yielding the state-of-the-art performance in a representative set of benchmark offline RL tasks. 1 INTRODUCTION Reinforcement learning (RL) has accomplished remarkable results in a wide range of domains, but its successes were mostly based on a large number of online interactions with the environment. However, in many real-world tasks, exploratory online interactions are either very expensive or dangerous (e.g. robotics, autonomous driving, and healthcare), and applying a standard online RL would be impractical. Consequently, the ability to optimize RL agents reliably without online interactions has been considered as a key to practical deployment, which is the main goal of batch RL, also known as offline RL (Fujimoto et al., 2019; Levine et al., 2020). In an offline RL algorithm, accurate policy evaluation and reliable policy improvement are both crucial for the successful training of the agent. Evaluating policies in offline RL is essentially an off-policy evaluation (OPE) task, which aims to evaluate the target policy given the dataset collected from the behavior policy. The difference between the target and the behavior policies causes a distribution shift in the estimation, which needs to be adequately addressed for accurate policy evaluation. OPE itself is one of the long-standing hard problems in RL (Sutton et al., 1998; 2009; Thomas & Brunskill, 2016; Hallak & Mannor, 2017). However, recent offline RL studies mainly focus on how to improve the policy conservatively while using a common policy evaluation technique without much considerations for the distribution shift, e.g. mean squared temporal difference error minimization or maximum-likelihood training of environment model (Fujimoto et al., 2019; Kumar et al., 2019; Yu et al., 2020). While conservative policy improvement helps the policy evaluation by reducing the off-policyness, we hypothesize that addressing the distribution shift explicitly during the policy evaluation can further improve the overall performance, since it can provide a better foundation for policy improvement. To this end, we aim to explicitly address the distribution shift of the OPE estimator used in the offline RL algorithm. In particular, we focus on the model-based approach, where we train an Published as a conference paper at ICLR 2021 environment model robust to the distribution shift. One of the notable prior works is Representation Balancing MDP (Rep BM) (Liu et al., 2018b), which regularizes the representation learning of the model to be invariant between the distributions. However, despite the promising results by Rep BM, its step-wise estimation of the distance between the distributions has a few drawbacks that limit the algorithm from being practical: not only it assumes a discrete-action task where the target policy is deterministic, but it also performs poorly in long-term tasks due to the curse of horizon of step-wise importance sampling (IS) estimators (Liu et al., 2018a). To address these limitations, we present the Representation Balancing with Stationary Distribution Estimation (Rep B-SDE) framework, where we aim to learn a balanced representation by regularizing, in the representation space, the distance between the data distribution and the discounted stationary distribution induced by the target policy. Motivated by the recent advances in estimating stationary distribution corrections, we present a new representation balancing objective to train a model of the environment that no longer suffers from the curse of horizon. We empirically show that the model trained by the Rep B-SDE objective is robust to the distribution shift for the OPE task, particularly when the difference between the target and the behavior is large. We also introduce a model-based offline RL algorithm based on the Rep B-SDE framework and report its performance on the D4RL benchmark (Fu et al., 2020), showing the state-of-the-art performance in a representative set of tasks. 2 RELATED WORK Learning balanced representation Learning a representation invariant to specific aspects of data is an established method for overcoming distribution shift that arises in unsupervised domain adaptation (Ben-David et al., 2007; Zemel et al., 2013) and in causal inference from observational data (Shalit et al., 2017; Johansson et al., 2018). They have shown that imposing a bound on the generalization error under the distribution shift leads to the objective that learns a balanced representation such that the training and the test distributions look similar. Rep BM (Liu et al., 2018b) can be seen as a direct extension to the sequential case, which encourages the representation to be invariant under the target and behavior policies in each timestep. Stationary distribution correction estimation (DICE) Step-wise importance sampling (IS) estimators (Precup, 2000) compute importance weights by taking the product of per-step distribution ratios. Consequently, these methods suffer from exponentially high variance in the lengths of trajectories, which is a phenomenon called the curse of horizon (Liu et al., 2018a). Recently, techniques of computing a stationary DIstribution Correction Estimation (DICE) have made remarkable progress that effectively addresses the curse of horizon (Liu et al., 2018a; Nachum et al., 2019a; Tang et al., 2020; Zhang et al., 2020; Mousavi et al., 2020). DICE has been also used to explicitly address the distribution shift in online model-free RL, by directly applying IS on the policy and action-value objectives (Liu et al., 2019; Gelada & Bellemare, 2019). We adopt one of the estimation techniques, Dual DICE (Nachum et al., 2019a), to measure the distance between the stationary distribution and the data distribution in the representation space. Offline reinforcement learning There are extensive studies on improving standard online modelfree RL algorithms (Mnih et al., 2015; Lillicrap et al., 2016; Haarnoja et al., 2018) for stable learning in the offline setting. The main idea behind them is to conservatively improve policy by (1) quantifying the uncertainty of value function estimate, e.g. using bootstrapped ensembles (Kumar et al., 2019; Agarwal et al., 2020), or/and (2) constraining the optimized target policy to be close to the behavior policy (i.e. behavior regularization approaches) (Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2019; Lee et al., 2020). A notable exception is Algae DICE (Nachum et al., 2019b), which implicitly uses DICE to regularize the discounted stationary distribution induced by the target policy to be kept inside of the data support, similar to this work. On the other hand, Yu et al. (2020) argued that the model-based approach can be advantageous due to its ability to generalize predictions on the states outside of the data support. They introduce MOPO (Yu et al., 2020), which uses truncated rollouts and penalized rewards for conservative policy improvement. MORe L (Kidambi et al., 2020) trains a state-action novelty detector and use it to penalize rewards in the data-sparse region. Matsushima et al. (2020), MOOSE (Swazinna et al., 2020) Published as a conference paper at ICLR 2021 and MBOP (Argenson & Dulac-Arnold, 2020) guide their policy optimization using the behavior policy, similar to the behavior regularization approaches. Note that these aforementioned offline RL methods build on the standard approximate dynamic programming algorithm for action-value estimation (model-free) or on a maximum-likelihood environment model (model-based), without explicitly addressing the distribution shift in the estimator. In contrast, we augment the objective for model learning to obtain a robust model under the distribution shift, which is the first attempt for offline RL to the best of our knowledge. 3 PRELIMINARIES A Markov Decision Process (MDP) is specified by a tuple M = S, A, T, R, d0, γ , consisting of state space S, action space A, transition function T : S A (S), reward function R : S A ([0, rmax]), initial state distribution d0, and discount rate γ. In this paper, we mainly focus on continuous state space S Rds and conduct experiments on both discrete action spaces A = {a0, ...ana} and continuous action spaces A Rda. Given MDP M and policy π, which is a (stochastic) mapping from state to action, the trajectory can be generated in the form of s0, a0, r0, s1, a1, r1, ..., where s0 d0 and for each timestep t 0, at π(st), rt R(st, at), and st+1 T(st, at). The goal of RL is to optimize or evaluate a policy, based on the normalized expected discounted return: Rπ (1 γ)EM,π [P t=0 γtrt]. A useful and important concept throughout the paper is the discounted stationary distribution, which represents the long-term occupancy of states: dπ(s, a) (1 γ) t=0 γt Pr(st = s, at = a|M, π). From the definition, it can be observed that Rπ can be obtained by Rπ = E(s,a) dπ[r(s, a)]. Offline RL and off-policy evaluation In this paper, we focus on the offline RL problem where the agent can only access a static dataset D = {(si, ai, ri, s i)}N i=1 for the maximization of Rπ. We consider a behavior-agnostic setting where we do not have any knowledge of the data collection process. We denote the empirical distribution of the dataset by d D. Before improving policy, we first aim to better evaluate Rπ given a target policy π and a static dataset D, which corresponds to an off-policy evaluation (OPE) problem. We mainly focus on a modelbased approach where the algorithm first estimates the unknown dynamics b T, b R using the dataset D. This defines an approximate MDP c M = S, A, b T, b R, d0, γ , with the approximate expected discounted return b Rπ (1 γ)Ec M,π [P t=0 γtrt] obtained from c M. In this paper, we are interested in the MDP estimate c M that can effectively reduce the error in the evaluation of policy π, |Rπ b Rπ|. In order to do so, we need to learn a good representation of a model that results in a small OPE error. We assume a bijective representation function φ : S A Z where Z Rdz is the representation space. We define the transition and the reward models in terms of the representation function φ, i.e. b T = b Tz φ and b R = b Rz φ. In practice, where we use a neural network for b T and b R, z can be chosen to be the output of an intermediate hidden layer, making φ represented by lower layers and b Tz, b Rz by the remaining upper layers. We define dπ φ(z) the discounted stationary distribution on Z induced by dπ(s, a) under the representation function z = φ(s, a), and similarly for d D φ (z). 4 REPRESENTATION BALANCING OFFLINE MODEL-BASED RL 4.1 GENERALIZATION ERROR BOUND FOR MODEL-BASED OFF-POLICY EVALUATION We aim to construct a model c M from the dataset D, which can accurately evaluate the policy π, by minimizing a good upper bound of policy evaluation error |Rπ b Rπ|. We define the following point-wise model loss for notational convenience: Eφ, b Rz, b Tz(s, a) = c RDT V R(r|s, a) b Rz r|φ(s, a) + c T DT V T(s |s, a) b Tz s |φ(s, a) , Published as a conference paper at ICLR 2021 where c R = 2(1 γ) and c T = 2γrmax. Then, we start by restating the simulation lemma (Kearns & Singh, 2002) to bound the policy evaluation error in terms of the point-wise model loss. The proof is available in Appendix A. Lemma 4.1. Given an MDP M and its estimate c M with a bijective representation function φ, i.e. ( b T, b R) = b Tz φ, b Rz φ , the policy evaluation error of a policy π can be bounded by: Rπ b Rπ E(s,a) dπ h Eφ, b Rz, b Tz(s, a) i (1) The Lemma 4.1 has a natural interpretation: if the model error is small in the states frequently visited by following the policy π, the resultant policy evaluation error will also be small. However, minimizing the RHS of Eq. (1) in the off-policy evaluation (OPE) task is generally intractable since the distribution dπ is not directly accessible. Therefore, the common practice has been to construct a maximum-likelihood MDP using D while ignoring the distribution shift, but its OPE performance is not guaranteed. Instead, we will derive a tractable upper bound on the policy evaluation error by eliminating the direct dependence on dπ in Eq. (1). To this end, we adopt the distance metric between two distributions over representations dπ φ and d D φ that can bound their difference in expectations, which is the Integral Probability Metric (IPM) (M uller, 1997): IPMG(p, q) = sup g G Ez p[g(z)] Ez q[g(z)] . (2) where particular choices of G make the IPM equivalent to different well-known distances of distributions, e.g. total variation distance or Wasserstein distance (Sriperumbudur et al., 2009). Theorem 4.2. Given an MDP M and its estimate c M with a bijective representation function φ, i.e. ( b T, b R) = b Tz φ, b Rz φ , assume that there exists a constant Bφ > 0 and a function class G {g : Z R} such that 1 Bφ Eφ, b Rz, b Tz φ 1( ) G. Then, for any policy π, Rπ b Rπ E(s,a) d D h Eφ, b Rz, b Tz(s, a) i + BφIPMG(dπ φ, d D φ ) (3) This theorem is an adaptation of Lemma 1 of Shalit et al. (2017) to an infinite horizon model-based policy evaluation and can be derived by the definition of IPMG(dπ φ, d D φ ) since it serves as an upper bound of the difference in the expectations of any function in G. The first term in Eq. (3) corresponds to the fitness to the data following d D, while the second term serves as a regularizer. To see this, minimizing the second term would yield a near-constant representation function, which would be bad for the first term since it cannot distinguish states and actions well enough. It shows a natural trade-off between optimizing the model that fits data better and learning the representation that is invariant with respect to dπ φ and d D φ . Nevertheless, RHS of Eq. (3) still cannot be evaluated naively due to its dependence on dπ in estimating the IPM. We address this challenge via a change of variable, which is known as a Dual DICE trick (Nachum et al., 2019a). Define ν : Z R as an arbitrary function of state-action pairs that satisfies: ν φ(s, a) g φ(s, a) + γEs T (s,a) a π(s ) h ν φ(s , a ) i , (s, a) S A. Then we can rewrite the IPM as: IPMG(dπ φ, d D φ ) = sup g G E(s,a) dπ h g φ(s, a) i E(s,a) d D h g φ(s, a) i (1 γ)E s d0 h ν φ(s, a) i E(s,a,s ) d D h ν φ(s, a) γν φ(s , a ) i , (4) ν : ν(z) = ET,π t=0 γtg(φ(st, at)) (s0, a0) = φ 1(z) In other words, we are now taking a supremum over the new function class F, which captures a function that returns the expected discounted sum of g(φ(st, at)) following the policy π in an MDP Published as a conference paper at ICLR 2021 M given an initial representation z. While it is now difficult to choose F from G, Eq. (3) still can be kept valid by using a sufficiently rich function class for F. In this work, we choose F to be the family of functions in the unit ball in a reproducing kernel Hilbert space (RKHS) Hk with the kernel k, which allows the following closed-form formula (see Lemma A.3 in Appendix for details): IPMG(dπ φ, d D φ )2 = Es0 d0,a0 π(s0),(s,a,s ) d D,a π(s ) s0 d0, a0 π( s0),( s, a, s ) d D, a π( s ) k φ(s, a), φ( s, a) + (1 γ)2k φ(s0, a0), φ( s0, a0) + γ2k φ(s , a ), φ( s , a ) 2(1 γ)k φ(s0, a0), φ( s, a) 2γk φ(s , a ), φ( s, a) + 2γ(1 γ)k φ(s , a ), φ( s0, a0) . This completes our derivation for the tractable upper bound of policy evaluation error (Eq. (3)), whose direct dependence on dπ is eliminated by Eq. (5). Finally, we can train a model by minimizing the upper bound that encourages us to learn balanced representation while improving data fitness, where the trained model can readily provide a model-based OPE. The IPMG(dπ φ, d D φ )2 in Eq. (5) can be estimated via finite random samples, and we denote its sampled-based estimator as d IPM(dπ φ, d D φ )2. We show in the following that, a valid upper bound can be established based on the sample-based estimators instead of the exact terms in the RHS of Eq. (3) under certain conditions. Theorem 4.3. Given an MDP M, its estimate c M with a bijective representation function φ, i.e. ( b T, b R) = b Tz φ, b Rz φ , and an RKHS Hk (Z R) induced by a universal kernel k such that supz Z k(z, z) = k, assume that fφ, b Rz, b Tz(z) = ET,π h P t=0 γt Eφ, b Rz, b Tz(st, at) (s0, a0) = φ 1(z) i Hk with Bφ = fφ, b Rz, b Tz Hk and the loss is bounded by E = sups S,a A Eφ, b Rz, b Tz(s, a). Let n be the number of data in D. With probability 1 2δ, Rπ b Rπ 1 (s,a) D Eφ, b Rz, b Tz(s, a)+Bφ d IPM(dπ φ, d D φ )+ This result can be proved by adapting the convergence results of the empirical estimate of the MMD (Gretton et al., 2012) and Hoeffding s inequality (Hoeffding, 1963). With the choice of an RKHS Hk, we can now interpret Bφ as the RKHS norm fφ, b Rz, b Tz Hk, which captures the magnitude and the smoothness of the expected cumulative model loss fφ, b Rz, b Tz. In general, assuming smooth underlying dynamics, we can expect Bφ to be small when the model error is small. Furthermore, although k depends on the kernel function we use, we can always let k = 1 and subsume it into Bφ as long as it is bounded, i.e. using Bφ Bφ k. In the next section, we develop algorithms based on practical approximations of Eq. (3). Detailed comparison to Rep BM As previously stated, Rep BM (Liu et al., 2018b) is a modelbased finite-horizon OPE algorithm that trains the model to have balanced representation φ, which is encouraged to be invariant under the target and behavior policies. Specifically, given the deterministic target policy π and the behavior policy µ, at each timestep t, it defines the factual distribution on Z given that the actions until timestep t have been executed according to the policy π: p F φ,t(z) = Pr(zt|M, µ, a0:t = π(s0:t)) and the counterfactual distribution on Z given the same condition except the action at timestep t: p CF φ,t (z) = Pr(zt|M, µ, a0:t 1 = π(s0:t 1), at = π(st)). Then, Rep BM bounds the OPE error as,1 Rπ b Rπ (1 γ) t=0 γt Ep(st|M,µ,a0:t=π(s0:t))[Eφ, b Rz, b Tz(st, π(st))] + Bφ,t IPMGt(p F φ,t, p CF φ,t ) . Although Rep BM achieves performance improvement over other OPE algorithms, we found a number of practical challenges: from the definition of the IPMGt(p F φ,t, p CF φ,t ), it requires a discrete-action 1We adapted their formulation to the infinite horizon discounted MDP setting. Published as a conference paper at ICLR 2021 environment and a deterministic policy π, which cannot be met by many practical RL settings. In addition, since the sample-based estimation of IPMGt(p F φ,t, p CF φ,t ) requires samples consistent with the policy π, i.e. a0:t 1 = π(s0:t 1), the algorithm would reject exponentially many samples with respect to t, which is the curse of horizon (Liu et al., 2018a). When there is a large difference between the behavior and the target policies in long-term tasks, their implementation becomes close to using the maximum likelihood objective, which can also be observed empirically in our experiments. In contrast, our work is free from the abovementioned practical limitations by performing balancing between the discounted stationary distribution dπ and the data distribution d D, leveraging the recent advances in stationary distribution correction estimation (i.e. the Dual DICE trick) to overcome the difficulties pertinent to the expectation concerning dπ required to evaluate the IPM in the objective. 4.2 REPRESENTATION BALANCING WITH STATIONARY DISTRIBUTION ESTIMATION In the following, we describe algorithms for OPE and offline RL based on the practical approximations to Eq. (3), which we call the Rep B-SDE framework. Objective for off-policy evaluation As we mentioned earlier, we aim to minimize the upper bound of OPE error |Rπ b Rπ| specified in Theorem 4.2. To make the RHS of Eq. (3) tractable for optimization, we replace the intractable total variation distance with KL-divergence, which can be easily minimized by maximizing the data log-likelihood. We also replace the IPM with its samplebased estimator to obtain the learning objective: min φ, b Rz, b Tz (s,a,s ,r) D h log b Rz r|φ(s, a) DKL(R(r|s,a)|| b Rz(r|φ(s,a))) log b Tz s |φ(s, a) DKL(T (s |s,a)|| b Tz(s |φ(s,a))) i + αM d IPM(dπ φ, d D φ ) (6) The constant Bφ in Theorem 4.2 depends on the function classes and cannot be estimated, and thus, we replace it with a tunable hyperparameter αM that balances between data fitness and representation invariance. Remark that αM = 0 recovers the simple maximum-likelihood objective. By simulating the target policy π under the environment model ( b T, b R) obtained by minimizing Eq. (6), it is possible to perform a model-based OPE that approximately minimizes the upper bound of OPE error. Objectives for offline model-based RL By rearranging Eq. (3), we have, Rπ b Rπ E(s,a) D h Eφ, b Rz, b Tz(s, a) i BφIPMG(dπ φ, d D φ ). (7) Then, we can maximize the RHS of Eq. (7) to get the model and the policy that maximizes the lower bound of true return Rπ. Similar to the derivation of Eq. (6), we replace the total variation distance by KL-divergence to obtain the following learning objectives: Lc M(c M, π, αM) = Ed D h log b Rz r|φ(s, a) log b Tz s |φ(s, a) i + αMIPMG(dπ φ, d D φ ), (8) Jπ(π, c M, απ) = Ec M,π απIPMG(dπ φ, d D φ ). (9) where the expectation in Eq. (9) can be optimized using various model-based RL algorithms, e.g. with planning (Chua et al., 2018) or using a model-free learner (Janner et al., 2019). By iterating between the minimization of Lc M with respect to c M and the maximization of Jπ with respect to π by stochastic gradient method, it is possible to perform offline model-based RL that approximately maximizes the lower bound of the true return Rπ. Implementation details Following the recent practice (Chua et al., 2018; Janner et al., 2019; Yu et al., 2020), we model the dynamics ( b T, b R) using a bootstrap ensemble of neural networks. To optimize a policy based on the objective, we perform full rollouts (until reaching the terminal states or maximum timesteps) using learned dynamics ( b T, b R). During obtaining the rollouts, we pessimistically augment the estimated reward function using the penalty proportional to the bootstrapped uncertainty, which helped the algorithm to perform robustly. We suspect the difficulty in calculating Published as a conference paper at ICLR 2021 0.1 0.3 0.5 0.7 0.9 Behavior off-policyness ( ) Normalized Individual MSE Cart Pole-v0 0.1 0.3 0.5 0.7 0.9 Behavior off-policyness ( ) 0.1 0.3 0.5 0.7 0.9 Behavior off-policyness ( ) HIV simulator Baseline Rep BM Rep B-SDE (ours) Figure 1: The OPE results of different model learning algorithms with varying off-policyness on the x-axis. The y-axis plots the normalized individual MSE on test trajectories where the performance of the baseline model is set to 1. The tasks used are infinite-horizon discounted environments (γ = 0.98 (HIV), γ = 0.99 (others)), where we truncated at t = 1000. The experiments are repeated 200 times, and the error bars indicate 95% confidence intervals. accurate IPM estimation is what makes the additional pessimism beneficial. We store the generated experiences to a separate dataset b D and update the policy π with IPM-regularized soft actor-critic (SAC) (Haarnoja et al., 2018) using samples from both datasets D b D similar to MBPO (Janner et al., 2019). Since the presented model objective requires a policy π to perform a balancing, we initially trained the model and the policy using αM = απ = 0: by c M0 = arg minc M Lc M(c M, , 0) and π0 = arg minπ Lπ(π, c M0, 0). Then, we retrained the model and the policy using π0: c M1 = arg minc M Lc M(c M, π0, αM) and π1 = arg minπ Lπ(π, c M1, απ) for some non-negative αM and απ. While it is desirable to repeat the optimization of the model and the policy until convergence, we did not observe significant improvement after the first iteration and reported the performance of the policy after the first iteration, π1. 5 EXPERIMENTS We demonstrate the effectiveness of the Rep B-SDE framework, by comparing the OPE performance of Eq. (6) to that of Rep BM and evaluating the presented model-based offline RL algorithm on the benchmarks. The code used to produce the results is available online.2 A detailed description of the experiments can be found in Appendix B. 5.1 MODEL-BASED OFF-POLICY EVALUATION For the sake of comparison with Rep BM, we test our OPE algorithm on three continuous-state discrete-action tasks where the goal is to evaluate a deterministic target policy. We trained a suboptimal deterministic target policy π and used an ϵ-greedy policy with various values of ϵ as the data collection policy. In each experiment, we trained environment models for a fixed number of epochs concerning three different objectives: simple maximum-likelihood baseline, step-wise representation balancing objective used in Rep BM (Liu et al., 2018b), and the presented OPE objective of Rep B-SDE (Eq. (6)). We measured the individual mean squared error (Liu et al., 2018b). The normalized error of each algorithm, relative to the error of the baseline valued at 1, is presented in Figure 1. We can observe that the presented objective of Rep B-SDE can reduce the OPE error from the baseline significantly, outperforming Rep BM in most of the cases. As the off-policyness between the policies (ϵ) increases, representation balancing algorithms should more benefit compared to the maximum-likelihood baseline in principle. However, the result shows that the performance of Rep BM merely increases due to the increased sample rejection rate under large ϵ. 2https://github.com/dlqudwns/repb-sde Published as a conference paper at ICLR 2021 Table 1: Normalized scores on D4RL Mu Jo Co benchmark datasets (Fu et al., 2020) where the score of 0 corresponds to a random policy and 100 corresponds to a converged SAC policy. All results (except MF, which is taken from Fu et al. (2020) and Kumar et al. (2020)) are averaged over 5 runs, where denotes the standard error. The highest scores are highlighted with boldface. Dataset type Environment Rep B-SDE (ours) RP Base MOPO MF BC Random Walker2d 21.1 1.0 18.4 16.4 1.3 7.3 BEAR 0.0 Medium Walker2d 72.1 1.9 56.3 5.5 -0.1 81.1 BRAC 7.7 Med-Replay Walker2d 49.8 11.4 41.1 6.2 47.8 26.7 CQL 8.0 Med-Expert Walker2d 88.8 6.9 72.6 51.7 32.4 98.7 CQL 3.3 Random Hopper 8.6 1.0 8.3 8.3 9.1 12.2 BRAC 9.0 Medium Hopper 34.0 2.8 27.5 19.8 19.2 58.0 CQL 34.5 Med-Replay Hopper 62.2 6.7 49.8 32.9 80.8 48.6 CQL 13.2 Med-Expert Hopper 82.6 7.0 74.0 19.1 23.2 111.0 CQL 41.9 Random Half Cheetah 32.9 1.1 31.3 26.1 29.9 35.4 CQL -2.4 Medium Half Cheetah 49.1 0.3 47.3 22.1 45.0 46.3 BRAC 31.0 Med-Replay Half Cheetah 57.5 0.8 53.6 54.0 53.8 47.7 BRAC 14.7 Med-Expert Half Cheetah 55.4 8.3 36.7 31.9 91.0 64.7 BCQ 29.5 5.2 OFFLINE REINFORCEMENT LEARNING WITH REPRESENTATION BALANCED MODEL We evaluate the offline model-based RL algorithm presented in Section 4.2 on a subset of datasets in the D4RL benchmark (Fu et al., 2020): using four types of datasets (Random, Medium, Medium Replay, and Medium-Expert) from three different Mu Jo Co environments (Half Cheetah-v2, Hopperv2, and Walker2d-v2) (Todorov et al., 2012). Random dataset contains 106 experience tuples from a random policy. Medium dataset contains 106 experience tuples from a policy trained to approximately 1/3 the performance of the expert, which is an agent trained to completion with SAC. Med-Replay dataset contains 105 (2 105 for Walker2d-v2) experience tuples, which are from the replay buffer of a policy trained up to the performance of the medium agent. Med-Expert dataset is a Medium dataset combined with 106 samples from the expert. This experimental setting exactly follows that of (Yu et al., 2020; Argenson & Dulac-Arnold, 2020). The normalized score of each algorithm is presented in Table 1. MF denotes the best score from offline model-free algorithms (taken from Fu et al. (2020) and Kumar et al. (2020)), including SAC (Haarnoja et al., 2018), BCQ (Fujimoto et al., 2019), BEAR (Kumar et al., 2019), BRAC (Wu et al., 2019), AWR (Peng et al., 2019), c REM (Agarwal et al., 2020), Algae DICE (Nachum et al., 2019b), and CQL (Kumar et al., 2020). The actual algorithm that achieves the reported score is presented next to the numbers. Base shows the performance of the most naive baseline, which attempts to maximize the estimated policy return under the maximum-likelihood model. RP denotes the performance of Base equipped with the appropriate reward penalty using the bootstrapped uncertainty of the model, which is equivalent to π0 described in Section 4.2. Rep B-SDE denotes the performance after a single iteration of our algorithm, corresponding to π1. We also provide BC, the performance of direct behavior cloning from the data, and MOPO (Yu et al., 2020), an offline model-based RL algorithm that optimizes policy based on truncated rollouts with the heuristic reward penalty. The significant gap between Rep B-SDE and RP in the results shows the advantage brought by our framework that encourages balanced representation. While our approach was less successful on some of the datasets (mostly on the Hopper-v2 environment), we hypothesize that the conservative training techniques: the behavior regularization approaches exploited in the model-free algorithms, the rollout truncation technique in MOPO, and the pessimistic training based on the bootstrapped uncertainty estimates adopted in our algorithm exhibit their strengths in different datasets. For example, it may be the case that the ensemble models are overconfident especially in Hopper-v2, and should be regularized with more explicit methods. Nevertheless, we emphasize that the presented framework can be used jointly with any other conservative training technique to improve their performance. Published as a conference paper at ICLR 2021 6 CONCLUSION AND FUTURE WORK In this paper, we presented Rep B-SDE, a framework for balancing the model representation with stationary distribution estimation, aiming at obtaining a model robust to the distribution shift that arises in off-policy and offline RL. We started from the theoretical observation that the model-based policy evaluation error can be upper-bounded by the data fitness and the distance between two distributions in the representation space. Motivated by the bound, we presented a model learning objective for off-policy evaluation and model-based offline policy optimization. Rep B-SDE can be seen as an extension of Rep BM, which addresses the curse of horizon by leveraging the recent advances in stationary distribution correction estimation (i.e. the Dual DICE trick). Using stationary distribution also frees us from other limitations of Rep BM to be applied to more practical settings. To the best of our knowledge, it is the first attempt to introduce an augmented objective for the learning of model robust to a specific distribution shift in offline RL. In the experiments, we empirically demonstrated that we can significantly reduce the OPE error from the baseline, outperforming Rep BM in most cases. We also showed that the robust model also helps in the offline model-based policy optimization, yielding the state-of-the-art performance in a representative set of D4RL benchmarks. We emphasize that our approach can be directly adopted in many other model-based offline RL algorithms. There are a number of promising directions for future work. Most importantly, we have not leveraged the learned representation in the policy when optimizing the policy, yet it is very natural to do so. We can easily incorporate the representation into the policy by assuming energy-based models, but this would make the computation of entropy intractable in entropy-regularized policy optimization algorithms. It would be also interesting to see if the proposed framework for learning balanced representation can benefit off-policy (and offline) model-free methods. ACKNOWLEDGMENTS This work was supported by the National Research Foundation (NRF) of Korea (NRF2019M3F2A1072238 and NRF-2019R1A2C1087634), and the Ministry of Science and Information communication Technology (MSIT) of Korea (IITP No. 2019-0-00075, IITP No. 2020-0-00940 and IITP No. 2017-0-01779 XAI). Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, 2020. Arthur Argenson and Gabriel Dulac-Arnold. Model-based offline planning. ar Xiv preprint ar Xiv:2008.05556, 2020. Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002. Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems, 2007. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Open AI Gym. ar Xiv preprint ar Xiv:1606.01540, 2016. Kurtland Chua, Roberto Calandra, Rowan Mc Allister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, 2018. Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020. Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, 2019. Published as a conference paper at ICLR 2021 Carles Gelada and Marc G Bellemare. Off-policy deep reinforcement learning by bootstrapping the covariate shift. In AAAI Conference on Artificial Intelligence, 2019. Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch olkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723 773, 2012. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018. Assaf Hallak and Shie Mannor. Consistent on-line off-policy evaluation. In International Conference on Machine Learning, 2017. Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963. Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Modelbased policy optimization. In Advances in Neural Information Processing Systems, 2019. Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, 2016. Fredrik D Johansson, Nathan Kallus, Uri Shalit, and David Sontag. Learning weighted representations for generalization across designs. ar Xiv preprint ar Xiv:1802.08598, 2018. Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3):209 232, 2002. Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. MORe L: Modelbased offline reinforcement learning. In Advances in Neural Information Processing Systems, 2020. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, 2019. Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for offline reinforcement learning. In Advances in Neural Information Processing Systems, 2020. Byung-Jun Lee, Jongmin Lee, Peter Vrancx, Dongho Kim, and Kee-Eung Kim. Batch reinforcement learning with hyperparameter gradients. In International Conference on Machine Learning, 2020. Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020. Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016. Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinitehorizon off-policy estimation. In Advances in Neural Information Processing Systems, 2018a. Yao Liu, Omer Gottesman, Aniruddh Raghu, Matthieu Komorowski, Aldo A Faisal, Finale Doshi Velez, and Emma Brunskill. Representation balancing MDPs for off-policy policy evaluation. In Advances in Neural Information Processing Systems, 2018b. Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. ar Xiv preprint ar Xiv:1904.08473, 2019. Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Ofir Nachum, and Shixiang Gu. Deploymentefficient reinforcement learning via model-based offline optimization. ar Xiv preprint ar Xiv:2006.03647, 2020. Published as a conference paper at ICLR 2021 Colin Mc Diarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148 188, 1989. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015. Ali Mousavi, Lihong Li, Qiang Liu, and Denny Zhou. Black-box off-policy estimation for infinitehorizon reinforcement learning. In International Conference on Learning Representations, 2020. Alfred M uller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429 443, 1997. Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dual DICE: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems, 2019a. Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algae DICE: Policy gradient from arbitrary experience. ar Xiv preprint ar Xiv:1912.02074, 2019b. Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. ar Xiv preprint ar Xiv:1910.00177, 2019. Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, pp. 80, 2000. Prajit Ramachandran, Barret Zoph, and Quoc V Le. Swish: a self-gated activation function. ar Xiv preprint ar Xiv:1710.05941, 7, 2017. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Uri Shalit, Fredrik D Johansson, and David Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In International Conference on Machine Learning, 2017. Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Sch olkopf, and Gert RG Lanckriet. On integral probability metrics, φ-divergences and binary classification. ar Xiv preprint ar Xiv:0901.2698, 2009. Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998. Richard S Sutton, Hamid R Maei, and Csaba Szepesv ari. A convergent o(n) temporal-difference algorithm for off-policy learning with linear function approximation. In Advances in neural information processing systems, 2009. Phillip Swazinna, Steffen Udluft, and Thomas Runkler. Overcoming model bias for robust offline deep reinforcement learning. ar Xiv preprint ar Xiv:2008.05533, 2020. Ziyang Tang, Yihao Feng, Lihong Li, Dengyong Zhou, and Qiang Liu. Doubly robust bias reduction in infinite horizon off-policy estimation. In International Conference on Learning Representations, 2020. Philip Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, 2016. Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, 2012. Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. ar Xiv preprint ar Xiv:1911.11361, 2019. Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. MOPO: Model-based offline policy optimization. In Advances in Neural Information Processing Systems, 2020. Published as a conference paper at ICLR 2021 Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In International Conference on Machine Learning, 2013. Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. Gen DICE: Generalized offline estimation of stationary values. In International Conference on Learning Representations, 2020. Published as a conference paper at ICLR 2021 Lemma 4.1. Given an MDP M and its estimate c M with a bijective representation function φ, i.e. ( b T, b R) = b Tz φ, b Rz φ , the policy evaluation error of a policy π can be bounded by: Rπ b Rπ E(s,a) dπ h Eφ, b Rz, b Tz(s, a) i (1) Proof. We define a value function V π(s) = EM,π[P t=0 γtrt|s0 = s], the expected discounted return starting from the state s, for the following proof. Due to its definition, we can write Rπ = (1 γ)Es d0[V π(s)]. The following recursive equation also holds: V π(s) = Ea π(s) r(s, a) + γEs T (s,a)[V (s )] , and similarly with an approximate value function b V π(s) = Ec M,π[P t=0 γtrt|s0 = s] given an MDP estimate c M. Then, Rπ b Rπ = Ed0 h V π(s0) b V π(s0) i = Ed0,π [r0 br0] + γEd0 Z V π(s1)T(s1|s0, a0) b V π(s1) b T(s1|s0, a0) π(a0|s0)ds1da0 = Ed0,π [r0 br0] + γEd0 Z V π(s1)T(s1|s0, a0) b V π(s1)T(s1|s0, a0) + b V π(s1)T(s1|s0, a0) b V π(s1) b T(s1|s0, a0) π(a0|s0)ds1da0 = Ed0,π [r0 br0] + γEd0 Z V π(s1) b V π(s1) | {z } This forms a recursive equation T(s1|s0, a0)π(a0|s0)ds1da0 Z b V π(s1) T(s1|s0, a0) b T(s1|s0, a0) π(a0|s0)ds1da0 = E(s,a) dπ Z R(r|s, a) b R(r|s, a) dr + γ Z b V π(s ) T(s |s, a) b T(s |s, a) ds E(s,a) dπ Z R(r|s, a) b R(r|s, a) dr + γ Z b V π(s ) T(s |s, a) b T(s |s, a) ds E(s,a) dπ Z R(r|s, a) b R(r|s, a) dr + γ Z rmax 1 γ T(s |s, a) b T(s |s, a) ds = 2E(s,a) dπ h DT V (R(r|s, a)|| b Rz(r|φ(s, a))) i 1 γ E(s,a) dπ h DT V (T(s |s, a)|| b Tz(s |φ(s, a))) i Theorem 4.2. Given an MDP M and its estimate c M with a bijective representation function φ, i.e. ( b T, b R) = b Tz φ, b Rz φ , assume that there exists a constant Bφ > 0 and a function class G {g : Z R} such that 1 Bφ Eφ, b Rz, b Tz φ 1( ) G. Then, for any policy π, Rπ b Rπ E(s,a) d D h Eφ, b Rz, b Tz(s, a) i + BφIPMG(dπ φ, d D φ ) (3) Proof. From Lemma 4.1, we directly have: Rπ b Rπ E(s,a) d D h Eφ, b Rz, b Tz(s, a) i + Z Eφ, b Rz, b Tz(s, a) dπ(s, a) d D(s, a) dsda. Published as a conference paper at ICLR 2021 Then, Z Eφ, b Rz, b Tz(s, a) dπ(s, a) d D(s, a) dsda = Z Eφ, b Rz, b Tz(φ 1(z)) dπ (z) d D φ 1(φ 1(z)) d(s, a) = Z Eφ, b Rz, b Tz(φ 1(z)) dπ φ(z) d D φ (z) dz Z 1 Bφ Eφ, b Rz, b Tz(φ 1(z)) dπ φ(z) d D φ (z) dz (Bφ > 0) Z g(z) dπ φ(z) d D φ (z) dz Bφ Eφ, b Rz, b Tz(φ 1(z)) G = BφIPMG dπ φ, d D φ We state some previous results, which are required for further proof. Theorem A.1 (Mc Diarmid s inequality (Mc Diarmid, 1989)). Let {Xi}n i=1 be independent random variables taking values in set X, and assume that f : X n R satisfies sup {xi}n i=1 X n, x X |f({xi}n i=1) f(x1, ..., xi 1, x, xi+1, ..., xn)| ci. (10) Then for every ϵ > 0, Pr{f({Xi}n i=1) E{Xi}n i=1[f({Xi}n i=1)] ϵ} exp 2ϵ2 Pn i=1 c2 i Lemma A.2 (Rademacher complexity of RKHS (Bartlett & Mendelson, 2002)). Let F be a unit ball in a universal RKHS on the compact domain X, with kernel bounded according to 0 k(x, x ) k. Let {xi}n i=1 be an i.i.d. sample of size n drawn according to a probability measure p on X, and let σi be i.i.d. and take values in { 1, 1} with equal probability. The Rademacher complexity, which is defined as below, is upper bounded as: Rn(F) E{xi}n i=1,σ sup f F i=1 σif(xi) The upper bound is followed by Lemma 22 of (Bartlett & Mendelson, 2002). Now we prove the following using the results above. Lemma A.3. Let Hk be a RKHS associated with universal kernel k( , ). Let , Hk be the inner product of Hk, which satisfies the reproducing property ν(z) = ν, k( , z) Hk. When G is chosen such that G = g (Z R) : g(z) = ν(z) γEs T (φ 1(z)) a π(s ) [ν(φ(s , a ))], ν (Z R), ν, ν Hk 1 , the IPMG(dπ φ, d D φ ) has the following closed form definition: IPMG(dπ φ, d D φ )2 = Es0 d0,a0 π(s0),(s,a,s ) d D,a π(s ) s0 d0, a0 π( s0),( s, a, s ) d D, a π( s ) k(φ(s, a), φ( s, a)) + (1 γ)2k(φ(s0, a0), φ( s0, a0)) + γ2k(φ(s , a ), φ( s , a )) 2(1 γ)k(φ(s0, a0), φ( s, a)) 2γk(φ(s , a ), φ( s, a)) + 2γ(1 γ)k(φ(s , a ), φ( s0, a0)) . Furthermore, suppose that k supz Z k(z, z). The estimator d IPM(dπ φ, d D φ )2, which is the samplebased estimation of IPMG(dπ φ, d D φ )2 from n samples, satisfies with probability at least 1 δ, IPMG(dπ φ, d D φ ) d IPM(dπ φ, d D φ ) Published as a conference paper at ICLR 2021 Proof. In the below we write in shorthand, IPMG to denote IPMG(dπ φ, d D φ ). As in Eq. (4), we can rewrite the IPM term as: IPMG = sup ν F (1 γ)E s d0 a π(s) [ν(φ(s, a))] E(s,a,s ) d D a π(s ) [ν(φ(s, a)) γν(φ(s , a ))] and F here becomes F = {ν (Z R) : ν, ν Hk 1}, a unit ball in RKHS Hk. Using the reproducing property of Hk: (1 γ)E s d0 a π(s) [ν(φ(s, a))] E(s,a,s ) d D a π(s ) [ν(φ(s, a)) γν(φ(s , a ))] (1 γ)E s d0 a π(s) [ν(φ(s, a))] E(s,a,s ) d D a π(s ) [ν(φ(s, a)) γν(φ(s , a ))] (1 γ)E s d0 a π(s) [ ν, k( , φ(s, a)) Hk] E(s,a,s ) d D a π(s ) [ ν, k( , φ(s, a)) Hk γ ν, k( , φ(s , a )) Hk] 2 = sup ν F ν, ν 2 Hk, where ν ( ) = (1 γ)E s d0 a π(s) [k( , φ(s, a))] E(s,a,s ) d D a π(s ) [k( , φ(s, a)) γk( , φ(s , a ))] . Due to the Cauchy-Schwarz inequality and ν, ν Hk 1, ν F, ν, ν 2 Hk ν, ν Hk ν , ν Hk ν , ν Hk holds and supν F ν, ν 2 Hk = ν , ν Hk. Using the property that k( , z), k( , z) Hk = k(z, z), we can derive the closed form expression in the lemma from ν , ν Hk. Now we prove the error bound of the estimator. First, we divide IPMG(dπ φ, d D φ ) into three parts: IPMG(dπ φ, d D φ ) = sup ν F |f1(ν) + f2(ν) + f3(ν)| where f1(ν) = (1 γ)Es d0,a π(s)[ν(φ(s, a))], f2(ν) = E(s,a) d D [ν(φ(s, a))] , f3(ν) = E(s,a,s ) d D,a π(s ) [γν(φ(s , a ))] . Given n samples {s(i) 0 , a(i) 0 , s(i), a(i), s (i), a (i)}n i=1 from the generative process s(i) 0 d0, a(i) 0 π(s(i) 0 ), (s(i), a(i), s (i)) d D, a (i) π(s (i)) i, we define the sample-based estimator d IPM(dπ φ, d D φ ): d IPM(dπ φ, d D φ )2 = 1 k(φ(s(i), a(i)), φ(s(j), a(j))) + (1 γ)2k(φ(s(i) 0 , a(i) 0 ), φ(s(j) 0 , a(j) 0 )) + γ2k(φ(s (i), a (i)), φ(s (j), a (j))) 2(1 γ)k(φ(s(i) 0 , a(i) 0 ), φ(s(j), a(j))) 2γk(φ(s (i), a (i)), φ(s(j), a(j))) + 2γ(1 γ)k(φ(s (i), a (i)), φ(s(j) 0 , a(j) 0 )) . By deriving in reverse order, we can recover its another definition in supremum, which can be divided into three parts: d IPM(dπ φ, d D φ ) = sup ν F bf1(ν) + bf2(ν) + bf3(ν) where bf1(ν) = 1 γ i=1 ν φ s(i) 0 , a(i) 0 , bf2(ν) = 1 i=1 ν φ s(i), a(i) , i=1 ν φ s (i), a (i) . Published as a conference paper at ICLR 2021 We can bound the error of sample-based estimator with individual errors as: IPMG(dπ φ, d D φ ) d IPM(dπ φ, d D φ ) = sup ν F |f1(ν) + f2(ν) + f3(ν)| sup ν F bf1(ν) + bf2(ν) + bf3(ν) |f1(ν) + f2(ν) + f3(ν)| bf1(ν) + bf2(ν) + bf3(ν) f1(ν) + f2(ν) + f3(ν) bf1(ν) bf2(ν) bf3(ν) h f1(ν) bf1(ν) + f2(ν) bf2(ν) + f3(ν) bf3(ν) i f1(ν) bf1(ν) + sup ν F f2(ν) bf2(ν) + sup ν F f3(ν) bf3(ν) . We then observe that f1(ν) bf1(ν) = (1 γ) Es d0,a π(s) s d0, a π(s) [k(φ(s, a), φ( s, a))] i=1 E s d0, a π(s) h k φ s(i), a(i) , φ( s, a) i j=1 k φ s(i), a(i) , φ s(j), a(j) 1/2 , which shows that changing s(i), a(i) results in changes of supν F f1(ν) bf1(ν) in magnitude of at most 2(1 γ) k1/2/n where k = supz Z k(z, z). Therefore, by Mc Diarmid s inequality (Theorem A.1), f1(ν) bf1(ν) Es(i),a(i) sup ν F f1(ν) bf1(ν) ϵ exp nϵ2 Es(i),a(i) sup ν F f1(ν) bf1(ν) n Es(i),a(i) E s(i), a(i) i=1 ν φ s(i) 0 , a(i) 0 # i=1 ν φ s(i) 0 , a(i) 0 n Es(i),a(i), s(i), a(i) i=1 ν φ s(i) 0 , a(i) 0 i=1 ν φ s(i) 0 , a(i) 0 n Es(i),a(i), s(i), a(i),σ(i) i=1 σ(i) n ν φ s(i) 0 , a(i) 0 ν φ s(i) 0 , a(i) 0 o where the last inequality is from Lemma A.2. Combining the results, we get f1(ν) bf1(ν) 2(1 γ) Similarly, we derive bounds for f2 and f3 respectively: f2(ν) bf2(ν) 2 f3(ν) bf3(ν) 2γ Published as a conference paper at ICLR 2021 By letting RHS of above bounds to be δ/3 and using union bound, we get, with probability 1-δ, we get IPMG(dπ φ, d D φ ) d IPM(dπ φ, d D φ ) The relationship between F and G G = g (Z R) : g(z) = ν(z) γEs T (φ 1(z)) a π(s ) [ν(φ(s , a ))], ν (Z R), ν, ν Hk 1 show that when the conditional expectation Es T (φ 1( )),a π(s )[ν(φ(s , a ))] : Z R is a function in RKHS Hk, G also becomes a subset of Hk. Then we can prove the following Theorem. Theorem 4.3. Given an MDP M, its estimate c M with a bijective representation function φ, i.e. ( b T, b R) = b Tz φ, b Rz φ , and an RKHS Hk (Z R) induced by a universal kernel k such that supz Z k(z, z) = k, assume that fφ, b Rz, b Tz(z) = ET,π h P t=0 γt Eφ, b Rz, b Tz(st, at) (s0, a0) = φ 1(z) i Hk with Bφ = fφ, b Rz, b Tz Hk and the loss is bounded by E = sups S,a A Eφ, b Rz, b Tz(s, a). Let n be the number of data in D. With probability 1 2δ, (s,a) D Eφ, b Rz, b Tz(s, a)+Bφ d IPM(dπ φ, d D φ )+ Proof. Applying the Hoeffding inequality (Hoeffding, 1963), with probability 1 δ, we get: E(s,a) d D h Eφ, b Rz, b Tz(s, a) i 1 (s,a) D Eφ, b Rz, b Tz(s, a) + By using an union bound with Eq. (13) and substituting terms in Eq. (3), we recover the result. Published as a conference paper at ICLR 2021 B EXPERIMENT DETAILS B.1 COMPUTING INFRASTRUCTURE All experiments were conducted on the Google Cloud Platform. Specifically, we used computeoptimized machines (c2-standard-4) that provide 4 v CPUs and 16 GB memory for the evaluation experiment of Section 5.1, and we used high-memory machines (n1-highmem-4), which provide 4 v CPUs and 26GB memory, equipped with an Nvidia Tesla K80 GPU for the RL experiment of Section 5.2. B.2 DETAILS OF THE OPE EXPERIMENT Task details We did not modify Cart Pole-v0 environment and Acrobot-v1 environment from the original implementation of Open AI Gym (Brockman et al., 2016) except for the maximum trajectory length. We ran PPO (Schulman et al., 2017) to optimize policies to reach a certain performance level and set them as the target policies for Cart Pole-v0 and Acrobot-v1. For the HIV simulator, we used the code adapted by Liu et al. (2018b), which is originally from the implementation of RLPy3. We modified the environment to have more randomness in the initial state (up to 10% perturbation from the baseline initial state) and to use the reward function that gives the logarithm of original reward values, as the original reward function scales up to 1010. We used a tree-based fitted q-iteration algorithm implemented by Liu et al. (2018b) to optimize the target policy for the HIV simulator. All the other details are shown in Table 2. We assume that the termination conditions of tasks are known in prior. Table 2: Task settings of OPE experiments. Cart Pole-v0 Acrobot-v1 HIV simulator state space dimension 4 6 6 # of actions 2 3 4 discount rate γ 0.99 0.99 0.98 # of trajectories 200 200 50 max length of training traj. 200 200 200 max length of rollouts for evaluation 1000 1000 1000 discounted return of target policy 89.4 -56.88 803.2 Model and algorithm details The model we learn is composed of a representation module and a dynamics module. To be consistent with the experiment settings in Liu et al. (2018b), we use a representation module of a single hidden layer feed-forward network that takes the state as input and outputs representation. We squashed the representation between ( 1, 1) using the tanh activation function. The dynamics module is also a single hidden layer feed-forward network that takes representation as input and outputs state difference and reward prediction for each action. We use the swish activation function (Ramachandran et al., 2017) for the hidden layers of two modules. As a whole, the model can also be seen as a feed-forward network with three hidden layers of varying activation functions, where the output of the second hidden layer is the representation we regularize. For the purpose of comparison, we minimize the L2 distance between the model prediction and the desired outcome from data, which corresponds to using a model of Gaussian predictive distribution with fixed variance. We standardized the inputs and outputs of the neural network and used Adam (Kingma & Ba, 2014) with a learning rate of 3 10 4 for the optimization. When compared to the similar experiments conducted in Liu et al. (2018b), we used a larger and more expressive model with more optimization steps with a smaller learning rate for a more accurate comparison. While the derivation of the Rep B-SDE objective was based on the state-action representation function, we use state representation in this experiment for direct comparison with Rep BM, which uses state representation (it can be also understood as using action invariant kernel). We follow the choice 3https://github.com/rlpy/rlpy Published as a conference paper at ICLR 2021 of Liu et al. (2018b) and use dot product kernel k(φ(s), φ( s)) = φ(s) φ( s) for the OPE experiment, which is not universal but allows us to avoid search of kernel hyperparameters, such as length-scales. After the training, we generate another 200 trajectories (50 in case of HIV simulator), and rollout in both true and simulated (based on learned model) environments to evaluate models. We measure the individual MSE, which is Individual MSE = Es0 d0 t=0 γtrt|s0 i EM,π h X t=0 γtrt|s0 i 2 for measuring the performance of each model. Whole experiment, from sampling data to learning and evaluating the model, is repeated 200 times with different random seeds. Choice and effect of hyperparameter α For choosing hyperparameter α for each algorithm, we searched over α {0.001, 0.01, 0.1, 1, 10} for each off-policyness ϵ and for each environment. Chosen αs were mainly α {0.001, 0.01} for Cart Pole-v0, α {1, 10} for Acrobotv1, and α {0.01, 0.1} for HIV simulator for both algorithms. In general, large α was beneficial when high off-policyness (ϵ) is present and/or the task is hard to generalize. On the right we show the example of effect of varying α in Cart Pole-v0. 0.0001 0.001 0.01 0.1 Relative Individual MSE Cart Pole-v0, =0.5 Baseline Rep BM MRB (ours) Comparison to other baselines In Figure 2, the OPE results with other model-free baselines are presented. FQE: Fitted Q-evaluation, IS: step-wise importance sampling, DR: doubly robust estimator based on step-wise importance sampling using the value function learned with fitted Qevaluation (Jiang & Li, 2016), Dual DICE: stationary distribution correction algorithm (Nachum et al., 2019a). We used the implementation provided by the authors in case of Dual DICE. All results are normalized to set the MSE of model-based baseline to be 1. Here, we used average MSEs instead of individual MSEs since importance sampling estimators are not suitable in computing individual MSEs. The results show that model-based methods are more robust to increasing off-policyness when compared to FQE. The results of model-based methods on Acrobot is relatively worse due to the difficult dynamics of Acrobot environment. 0.1 0.3 0.5 0.7 0.9 Behavior off-policyness ( ) Normalized Average MSE Cart Pole-v0 0.1 0.3 0.5 0.7 0.9 Behavior off-policyness ( ) 0.1 0.3 0.5 0.7 0.9 Behavior off-policyness ( ) 106 HIV simulator Baseline Ours FQE IS DR Dual DICE Figure 2: The OPE results compared to other model-free baselines. All experiments are repeated 200 times and the error bars denote 95% confidence interval. B.3 DETAILS OF THE OFFLINE RL EXPERIMENTS Task details We used 12 datasets in D4RL (Fu et al., 2020) over four dataset types and three environments as specified in the main text. In Table 1, normalized scores suggested by (Fu et al., 2020) is used to report the result, where score of 0 corresponds to a random policy and 100 corresponds to a converged SAC policy. In Half Cheetah-v2, score of 0 means the undiscounted return of 280, and score of 1 means the undiscounted return of 12135. In Hopper-v2, score of 0 means the undiscounted return of 20, and score of 1 means the undiscounted return of 3234. In Walker2d-v2 score of 0 means the undiscounted return of 2, and score of 1 means the undiscounted return of 4592. We assume that the termination conditions of tasks are known in prior. Published as a conference paper at ICLR 2021 Representation balancing maximizing undiscounted return As we report the undiscounted sum of rewards in the experiments, the maximization of lower bound of Rπ may result in an underutilization of experiences of later timesteps. One way to mitigate the mismatch is to optimize the policy by maximizing the returns starting from the states in the dataset. It corresponds to maximizing Rπ = (1 γ)Es0 d D,M,π[P t=0 γtrt] instead of Rπ, where the expectation respect to the initial state distribution d0 is altered with the data distribution d D. Consequently, to bound the error of Rπ, the representation should be balanced with another discounted stationary distribution dπ(s, a) (1 γ) P t=0 γt Pr(st = s, at = a|s0 d D, T, π), the distribution induced by the policy π where the initial state is sampled from the data distribution. The derivations can be easily adapted by noting that: IPMG( dπ φ, d D φ ) = sup ν F (1 γ)E s d D h ν φ(s, a) i E(s,a,s ) d D h ν φ(s, a) γν φ(s , a ) i , and changing the initial state sampling distributions to d D during the estimation of IPM. Model and algorithm details Similar to the model used in the OPE experiment, the model we learn is composed of a representation module and a dynamics module. A representation module is a feed-forward network with two hidden layers that takes the state-action pair as input and outputs representation through the tanh activation function. The dynamics module is a single hidden layer network that takes representation as input and outputs parameters of diagonal Gaussian distribution predicting state difference and reward. We use 200 hidden units for all intermediate layers including the representation layer. Across all domains, we train an ensemble of 7 models and pick the best 5 models on their validation error on hold-out set of 1000 transitions in the dataset. The inputs and outputs of the neural network is normalized. We present the pseudo-code of the presented Representation Balancing Offline Model-based RL algorithm below. Algorithm 1 Representation Balancing Offline Model-based RL Input: Offline dataset D, previous policy π Output: Optimized policy π 1: Sample K independent datasets with replacement from D. 2: Train bootstrapped ensemble of K models { b Ti, b Ri}K i=0 minimizing Eq. (8) (adapted with dπ φ). 3: for repeat = 0, 1, . . . do 4: for rollout = 0, 1, . . . , B do 5: Sample initial rollout state s0 from D. 6: for t = 0, 1, . . . do 7: Sample an action at π(st). 8: Randomly pick ( b Ti, b Ri) and sample (st+1, rt) ( b Ti, b Ri)(st, at). 9: Compute rt = rt γλ p VK[µ(st, at)] 2 and store (st, at, rt, st+1) in b D. 10: end for 11: end for 12: Draw samples from D to compute d IPM( dπ φ, d D φ ). 13: Draw samples from D and b D to update critic Q. 14: Maximize Es D b D,a π(s)[Q(s, a) τ log π(a|s)] απ d IPM( dπ φ, d D φ ) to update π. 15: end for For the result of MOPO (Yu et al., 2020), we ran the code kindly provided by the authors4 without any modification on the algorithm or the hyperparameters. All algorithms we experimented (Base, RP, Rep B-SDE) share all the hyperparameters in common except the ones associated with changing objectives. We run SAC on the full rollouts from the trained ensemble models as shown in Algorithm 1. The common hyperparameters shared among algorithms are shown in Table 3. We simply tried the listed hyperparameters and not tuned them further. For RP and Rep B-SDE, we penalized the reward from simulated environments with the standard deviation of prediction means of neural network ensembles. We used standardized output of all 7 neural networks to compute the reward 4https://github.com/tianheyu927/mopo Published as a conference paper at ICLR 2021 Table 3: Common hyperparameters used in offline RL experiments. Parameter Value optimizer Adam (Kingma & Ba, 2014) learning rate 3 10 4 discount factor γ 0.99 number of samples per minibatch 256 target smoothing coefficient τ 5 10 3 [actor/critic] number of hidden layers 2 [actor/critic] number of hidden units per layer 256 [actor/critic] non-linearity Re LU # of rollouts 104 max length of rollouts 103 rollout buffer size 5 107 Table 4: Normalized scores on D4RL Mu Jo Co benchmark datasets (Fu et al., 2020) with standard errors fully specified. Dataset type Environment Rep B-SDE (ours) RP Base MOPO Random Walker2d-v2 21.1 1.0 18.4 1.8 16.4 2.2 1.3 0.5 Medium Walker2d-v2 72.1 1.9 56.3 4.3 5.5 0.4 -0.1 0.0 Med-Replay Walker2d-v2 49.8 11.4 41.1 10.3 6.2 0.7 47.8 4.8 Med-Expert Walker2d-v2 88.8 6.9 72.6 5.1 51.7 7.1 32.4 8.8 Random Hopper-v2 8.6 1.0 8.3 0.2 8.3 0.2 9.1 0.3 Medium Hopper-v2 34.0 2.8 27.5 3.1 19.8 0.8 19.2 4.5 Med-Replay Hopper-v2 62.2 6.7 49.8 8.7 32.9 5.5 80.8 2.9 Med-Expert Hopper-v2 82.6 7.0 74.0 7.2 19.1 1.1 23.2 1.3 Random Half Cheetah-v2 32.9 1.1 31.3 1.5 26.1 6.2 29.9 1.2 Medium Half Cheetah-v2 49.1 0.3 47.3 0.3 22.1 4.4 45.0 0.4 Med-Replay Half Cheetah-v2 57.5 0.8 53.6 0.8 54.0 2.4 53.8 1.1 Med-Expert Half Cheetah-v2 55.4 8.3 36.7 12.0 31.9 10.5 91.0 2.5 penalty. We first search over the reward penalty coefficient λ {0, 2, 5, 7, 10, 15} that grants RP the best performance. We shared same λ for Rep B-SDE and searched over αM {0.01, 0.1, 1}, απ {0, 0.01, 0.1, 1, 10}. We ran the algorithms for 1.5 106 gradient updates, except for Half Cheetah-Medium-Expert where we ran for 5.0 106. In Table 4, we present the standard errors of results, which was omitted in the main text. In addition to the performance after the final iteration presented in Table 1, we also present the learning curves of the experiment in Figure 3, and the effect of the choice of αM and απ in Figure 4. The Figure 4 shows that the presented regularization is robust to the choice of αM, απ except for too large αM, consistently improving from RP, which corresponds to αM = 0, απ = 0. Published as a conference paper at ICLR 2021 0.0 0.3 0.6 0.9 1.2 1.5 Average Return Random-Walker2d-v2 0.0 0.3 0.6 0.9 1.2 1.5 Medium-Walker2d-v2 0.0 0.3 0.6 0.9 1.2 1.5 Medium-Replay-Walker2d-v2 0.0 0.3 0.6 0.9 1.2 1.5 Medium-Expert-Walker2d-v2 0.0 0.3 0.6 0.9 1.2 1.5 0 Average Return Random-Hopper-v2 0.0 0.3 0.6 0.9 1.2 1.5 Medium-Hopper-v2 0.0 0.3 0.6 0.9 1.2 1.5 Medium-Replay-Hopper-v2 0.0 0.3 0.6 0.9 1.2 1.5 Medium-Expert-Hopper-v2 0.0 0.3 0.6 0.9 1.2 1.5 Updates (106) Average Return Random-Half Cheetah-v2 0.0 0.3 0.6 0.9 1.2 1.5 Updates (106) Medium-Half Cheetah-v2 0.0 0.3 0.6 0.9 1.2 1.5 Updates (106) Medium-Replay-Half Cheetah-v2 0.0 1.0 2.0 3.0 4.0 5.0 Updates (106) Medium-Expert-Half Cheetah-v2 MOPO Base RP RP + MRB BC MF Figure 3: The learning curves of the D4RL experiment. 0.0 0.3 0.6 0.9 1.2 1.5 Updates (106) Average Return 0.0 0.3 0.6 0.9 1.2 1.5 Updates (106) Effect of varying M and (Medium-Walker2d-v2) BC MF RP ( M = 0.0, = 0.0) ( M = 0.01, = 0.0) ( M = 0.1, = 0.0) ( M = 1.0, = 0.0) ( M = 10.0, = 0.0) ( M = 0.1, = 0.01) ( M = 0.1, = 0.1) ( M = 0.1, = 1.0) ( M = 0.1, = 10.0) Figure 4: The effect of varying hyperparameter in the D4RL experiment.