# adaptation_augmented_modelbased_policy_optimization__d252ba3a.pdf

Journal of Machine Learning Research 24 (2023) 1-35 Submitted 5/22; Revised 4/23; Published 6/23

Adaptation Augmented Model-based Policy Optimization

Jian Shen rockyshen@apex.sjtu.edu.cn Hang Lai laihang99@sjtu.edu.cn Minghuan Liu minghuanliu@sjtu.edu.cn Han Zhao hanzhao@illinois.edu Yong Yu yyu@sjtu.edu.cn Weinan Zhang wnzhang@sjtu.edu.cn Department of Computer Science, Shanghai Jiao Tong University Department of Computer Science, University of Illinois, Urbana-Champaign

Editor: Laurent Orseau

Compared to model-free reinforcement learning (RL), model-based RL is often more sample eﬃcient by leveraging a learned dynamics model to help decision making. However, the learned model is usually not perfectly accurate and the error will compound in multi-step predictions, which can lead to poor asymptotic performance. In this paper, we ﬁrst derive an upper bound of the return discrepancy between the real dynamics and the learned model, which reveals the fundamental problem of distribution shift between simulated data and real data. Inspired by the theoretical analysis, we propose an adaptation augmented model-based policy optimization (AMPO) framework to address the distribution shift problem from the perspectives of feature learning and instance re-weighting, respectively. Speciﬁcally, the feature-based variant, namely FAMPO, introduces unsupervised model adaptation to minimize the integral probability metric (IPM) between feature distributions from real and simulated data, while the instance-based variant, termed as IAMPO, utilizes importance sampling to re-weight the real samples used to train the model. Besides model learning, we also investigate how to improve policy optimization in the model usage phase by selecting simulated samples with diﬀerent probability according to their uncertainty. Extensive experiments on challenging continuous control tasks show that FAMPO and IAMPO, coupled with our model usage technique, achieves superior performance against baselines, which demonstrates the eﬀectiveness of the proposed methods. Keywords: Model-based reinforcement learning, distribution shift, occupancy measure, Integral Probability Metric, importance sampling

1. Introduction

Reinforcement learning (RL) algorithms can be roughly divided into two categories according to whether they utilize an environmental dynamics model: model-free RL (MFRL) and model-based RL (MBRL). MFRL methods, which directly learn a value function or a policy (or both), have achieved great success on a wide range of tasks such as video games (Mnih

. Equal contribution. . Corresponding author.

c 2023 Jian Shen, Hang Lai, Minghuan Liu, Han Zhao, Yong Yu and Weinan Zhang.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-0606.html.

Shen, Lai, Liu, Zhao, Yu and Zhang

et al., 2015; Hessel et al., 2018) and robotic control (Gu et al., 2017; Haarnoja et al., 2018). However, MFRL is notoriously sample-ineﬃcient and requires a tremendous number of interactive samples to learn a good policy. In many high-stakes real-world applications, e.g., autonomous driving, online education, etc., it is often expensive, or even infeasible, to collect such large-scale datasets. In contrast, MBRL methods, which learn a dynamics model ﬁrst and use it to alleviate sampling cost, are widely considered to be an appealing alternative (Sun et al., 2018; Langlois et al., 2019). Despite the higher sample eﬃciency, model-based methods tend to have poor asymptotic performance compared to their model-free counterparts due to the vulnerability to model errors (Nagabandi et al., 2018; Chua et al., 2018). To be precise, despite being equipped with a high-capacity model, such errors still exist due to the potential distribution shift between the training and generating phases, i.e., the state-action input distribution used to train the model is diﬀerent from the one generated by the model (Talvitie, 2014). Speciﬁcally, the training data is usually collected by the behavior policies in the real environment but the model is required to make predictions on the data collected by the target policy in the model. When an imperfect model is used for multi-step rollouts, the error in one-step prediction is inclined to accumulate in the next steps, also known as the multi-step compounding error challenge (Asadi et al., 2018). In light of the distribution shift problem in MBRL, in the literature there are many eﬀorts devoted to tackle this problem by improving the approximation accuracy of model learning, or by designing careful strategies when using the model for simulation. For model learning, diﬀerent architectures (Asadi et al., 2018, 2019; Chua et al., 2018) and loss functions (Farahmand, 2018; Wu et al., 2019) have been proposed to mitigate overﬁtting or to improve multi-step predictions so that the distributions of the simulated data generated by the model are closer to the real ones. Besides, Talvitie (2014, 2017) also proposed a self-correct mechanism that uses the predicted states as the input to train the model along with the real data to gradually bridge the gap between the simulated data and the real ones. On the other hand, for model usage, delicate rollout schemes (Janner et al., 2019; Pan et al., 2020) have been adopted to stop the model rollouts before the simulated data deviate too much from the real distribution. Although these existing methods help alleviate the distribution shift to some extent, this problem has not been explicitly addressed. In this paper, we investigate how to better address the distribution shift problem explicitly in a principled way. To begin with, we ﬁrst derive an upper bound of the return disparity between the real dynamics and the learned model, which naturally inspires a bound minimization algorithm. To this end, we propose our AMPO (Adaptation augmented Model-based Policy Optimization) framework upon the existing MBPO (Janner et al., 2019) method with two variants, dubbed as FAMPO and IAMPO, respectively. More speciﬁcally, FAMPO introduces a model adaptation procedure which encourages the model to learn invariant feature representations by minimizing the integral probability metric (IPM) between the feature distributions of real data and simulated data. In addition to aligning the feature distributions, we also try to handle the distribution shift problem on the data level and propose an instance-based model adaptation method, IAMPO. The intuition behind IAMPO is that the model should focus more on the state-action pairs that are more likely to appear in the simulated data distributions. The design principle of IAMPO is simple and straightforward: optimizing the dynamics model by minimizing the return discrepancy via

Adaptation Augmented Model-based Policy Optimization

gradient descent. In practice, IAMPO re-weights all the training samples by the importance scores when learning the dynamics model, where the importance score is deﬁned as the density ratio between the occupancy measures of the simulated data and the real data. While model adaptation helps to achieve better generalization, some inaccurate predictions can still catastrophically aﬀect the performance due to the imperfect approximation. For this reason, during the model usage phase, we adopt a weighted sampling strategy, which samples the simulated data with diﬀerent probabilities according to the model uncertainty, to reduce the proportion of uncertain samples in the simulated data for policy optimization. We evaluate our algorithms on a range of continuous control benchmark tasks, which demonstrate that FAMPO and IAMPO, coupled with the weighted sampling strategy, achieves higher sample eﬃciency and better asymptotic performance compared to various baselines.

2. Preliminaries

We ﬁrst introduce the notation used throughout the paper and the problem setup of RL. Then, we brieﬂy discuss the concepts related to integral probability metric. Finally, we present a classical MBRL method, MBPO, which will be the underlying framework for our algorithm.

2.1 Reinforcement Learning

We consider an inﬁnite-horizon Markov Decision Process (MDP), which is deﬁned by the tuple (S, A, T, r, ν0, γ), where S is the state space and A is the action space. Throughout the paper, we assume the state and action spaces are continuous. We use γ (0, 1) to denote the discount factor, and T(s |s, a) to mean the transition density of state s given action a made under state s. The initial state distribution is denoted as ν0, and the reward function is denoted as r(s, a). We assume the reward function is bounded by rmax := sups,a |r(s, a)| < . The agent maintains a policy π(a|s) that determines the probability of choosing an action a at a given state s. The goal in reinforcement learning (RL) is to ﬁnd the optimal policy π

that maximizes the expected return (sum of discounted rewards), denoted by ηT :

π := arg max π ηT (π) = arg max π Eπ h X

t=0 γtr(st, at) | s0 ν0 i , (1)

where st+1 T(s|st, at) and at π(a|st). In general the true transition T(s |s, a) is unknown and MBRL methods often learn an approximate model ˆT(s |s, a) of the transition dynamics, using samples collected from interactions with the MDP. Diﬀerent from the previous works (Luo et al., 2018; Chua et al., 2018), in this paper we also assume the reward function r(s, a) to be unknown, and an agent needs to learn the reward function simultaneously. Given a policy π and a transition function T, we denote the density of being in state s at time step t as P π T,t(s) = P(st = s | π, T, s0 ν0). We then deﬁne the normalized occupancy measure (Ho and Ermon, 2016) of policy π under the dynamics T as

ρπ T (s, a) = (1 γ) π(a|s)

t=0 γt P π T,t(s).

Shen, Lai, Liu, Zhao, Yu and Zhang

Similarly, ρπ ˆT (s, a) represents the normalized occupancy measure of policy π under the

approximate dynamics model ˆT.

2.2 Integral Probability Metric

Integral probability metric (IPM) is a family of discrepancy measures between two distributions over the same space (Müller, 1997; Sriperumbudur et al., 2009). Speciﬁcally, given two probability distributions P and Q over X, the F-IPM is deﬁned as

d F(P, Q) := sup f F Ex P[f(x)] Ex Q[f(x)], (2)

where F is a class of witness functions f : X R. Following Bińkowski et al. (2018), we assume IPM is symmetric. That is, if f F, we also have f F. By choosing diﬀerent function class F, IPM reduces to many well-known distance metrics between probability distributions. In particular, the Wasserstein-1 distance (Villani, 2008) is deﬁned using the 1-Lipschitz functions {f : f L 1}, where the Lipschitz semi-norm L is deﬁned as f L = supx =y |f(x) f(y)|/|x y|. Furthermore, total variation is also a kind of IPM and we use d TV( , ) to denote it.

2.3 Model-based Policy Optimization

We brieﬂy summarize the model-based policy optimization (MBPO) (Janner et al., 2019) algorithm, on top of which we build our algorithm. MBPO uses a bootstrapped ensemble of probabilistic dynamics models ˆTθ(s |s, a), parameterized by θ. Each individual dynamics model is a probabilistic neural network which outputs a Gaussian distribution with diagonal covariance. The model ensemble is trained on the real data via minimizing the negative log-likelihood loss:

Li ˆT (θ) =

µi θ (sn, an) sn+1 Σi θ 1 (sn, an) µi θ (sn, an) sn+1 + log det Σi θ (sn, an) .

(3) The learned model ˆTθ(s |s, a) is used to generate k-step rollouts starting from the states sampled from the real data buﬀer Denv with the actions taken by the current policy πφ. The generated simulated data is then added to a separate buﬀer Dmodel. Finally, the policy πφ is trained on both real and simulated data from Denv Dmodel in a ﬁxed ratio using Soft Actor-Critic (SAC) (Haarnoja et al., 2018), which trains a stochastic policy with entropy regularization in actor-critic architecture by minimizing the expected KL-divergence:

Lπ(φ) = Es[DKL(πφ( |s) exp(Q(s, ) V (s))], (4)

where the Q and V functions are estimated via the soft Bellman backup operator following Haarnoja et al. (2018).

3. Feature-based Adaptation-augmented MBPO

In this section, we ﬁrst propose a feature-based adaptation approach to explicitly mitigate the distribution shift problem in MBRL, which is inspired by the return discrepancy upper bound derived as follows.

Adaptation Augmented Model-based Policy Optimization

3.1 An Upper Bound for Return Discrepancy

One of the main beneﬁts of model-based RL methods is that we can use it to simulate data to replace the real environment once the model is learned. Imagine that if the dynamics model is perfect, we do not need the real environment anymore. However, if the dynamics model is extremely erroneous, the policy learned on the model may perform worse in the real environment, which can lower sample eﬃciency instead. Therefore, it is necessary in MBRL to derive an upper bound of the discrepancy between the expected return in the real environment ηT (π) and the expected return in the model η ˆT (π) with the same policy π in the following form (Luo et al., 2018; Janner et al., 2019):

η ˆT (π) ηT (π) C. (5)

Usually, the dynamics model ˆT will be learned from experiences (s, a, s ) collected by a behavioral policy πD in the real environment dynamics T. Typically, in an online MBRL method with iterative policy optimization, the behavioral policy πD represents a collection of past policies. Once we have derived this bound, we can naturally design a model-based framework to optimize the RL objective by maximizing the surrogate return η ˆT (π) and minimizing the discrepancy C simultaneously. Speciﬁcally, previous works have derived the following lemma to give a precise return discrepancy (Luo et al., 2018; Yu et al., 2020).

Lemma 1 (Luo et al. (2018), Lemma 4.3; Yu et al. (2020), Lemma 4.1; Shen et al. (2020), Lemma E.1) Let two MDPs share the same reward function r(s, a), but with diﬀerent dynamics T( |s, a) and ˆT( |s, a) respectively. Deﬁne V π T (s) := Eπ,T P t=0 γtr (st, at) | s0 = s as the expected discounted return under π starting from state s, and Gπ ˆT (s, a) := Es ˆT [V π T (s ) | s, a] Es T [V π T (s ) | s, a]. For any policy π, we have that

η ˆT (π) ηT (π) = κE(s,a) ρπ ˆ T [Gπ ˆT (s, a)], (6)

where κ = γ(1 γ) 1.

For the sake of completeness, we provide a proof of Lemma 1, which is almost the same as in Yu et al. (2020). The only diﬀerence is that our occupancy measure is normalized. Proof Let Wj be the expected return when executing π on ˆT for the ﬁrst j steps, then switching to T for the remainder. That is,

Wj = Eat π, t<j:st+1 ˆT, t j:st+1 T

t=0 γtr(st, at) | s0 ν0

Note that W0 = ηT (π) and W = η ˆT (π), so

η ˆT (π) ηT (π) =

j=0 (Wj+1 Wj).

And we have

Wj = Rj + E(sj,aj) π, ˆT h Esj+1 T γj+1V π T (sj+1) | sj, aj i ,

Wj+1 = Rj + E(sj,aj) π, ˆT h Esj+1 ˆT γj+1V π T (sj+1) | sj, aj i ,

Shen, Lai, Liu, Zhao, Yu and Zhang

where Rj is the expected return of the ﬁrst j time steps, which are taken with respect to ˆT. Then

Wj+1 Wj = γj+1E(sj,aj) π, ˆT h Esj+1 ˆT [V π T (sj+1) | sj, aj] Esj+1 T [V π T (sj+1)] | sj, aj i

= γj+1E(sj,aj) π, ˆT h Gπ ˆT (sj, aj) i .

η ˆT (π) ηT (π) =

j=0 (Wj+1 Wj)

j=0 γj+1E(sj,aj) π, ˆT h Gπ ˆT (sj, aj) i

= γ 1 γ E(s,a) ρπ ˆ T

h Gπ ˆT (s, a) i ,

as desired.

Lemma 1 states that the discrepancy between the return in the model and in the real environment can be rewritten as the discrepancy between the expected value of the next states output by the model and the real dynamic. Intuitively, it means that the expected return discrepancy will be small when the output distributions of the model and real dynamic are close. Based on this theoretical result, Yu et al. (2020) built an upper bound as η ˆT (π) ηT (π) 2γ(1 γ) 2rmax E(s,a) ρπ ˆ T d TV(T( |s, a), ˆT( |s, a)). However, it is impractical to design an algorithm to optimize this upper bound, since given the (s, a) sampled from ρπ ˆT , we could not directly obtain the real s from T( |s, a) in practice. Luo et al. (2018) used the opposite direction of the return discrepancy to obtain ηT (π) η ˆT (π) κLE(s,a) ρπ T [ T(s, a) ˆT(s, a) ] assuming the real dynamics and the dynamics model are deterministic and the value function V π ˆT is L-Lipschitz. However, to optimize this bound, it requires to collect samples from ρπ T for every iterate π. Their solution is to constrain the policy to be in the neighborhood of a reference policy πref while using the data collected by πref. Instead, in what follows we show that Eρπ ˆ T [Gπ ˆT (s, a)] can be further decomposed to construct an upper bound for the expected return discrepancy, which inspires our feature-based model adaptation method without introducing any further constraints or assumptions as in the previous works.

Theorem 2 Let F1 be a collection of functions from S A to R. With Lemma 1, under the assumption that Gπ ˆT (s, a) F1, the expected return discrepancy admits the following bound:

η ˆT (π) ηT (π) γ(1 γ) 1d F1(ρπD T , ρπ ˆT )+γ(1 γ) 2rmax E(s,a) ρπD T

2DKL(T( |s, a), ˆT( |s, a)). (7)

Proof Let F2 be a collection of functions from S to R that F2 := n f : f rmax

1 γ o , and since rmax := sups,a |r(s, a)|, we have V π T (s ) F2. Using Lemma 1, we can rewrite the

Adaptation Augmented Model-based Policy Optimization

expected return discrepancy as

η ˆT (π) ηT (π) =κE(s,a) ρπ ˆ T [Gπ ˆT (s, a)] κE(s,a) ρπD T [Gπ ˆT (s, a)] + κE(s,a) ρπD T [Gπ ˆT (s, a)]

E(s,a) ρπ ˆ T [f(s, a)] E(s,a) ρπD T [f(s, a)]

+ κE(s,a) ρπD T

h sup g F2 Es ˆT g(s ) | s, a Es T g(s ) | s, a i

=κd F1(ρπD T , ρπ ˆT ) + κE(s,a) ρπD T [d F2( ˆT( |s, a), T( |s, a))],

and since F2 := n f : f rmax

1 γ o , the second term reduces to the total variation distance:

η ˆT (π) ηT (π) κd F1(ρπD T , ρπ ˆT ) + 2κ(1 γ) 1rmax E(s,a) ρπD T d TV(T( |s, a), ˆT( |s, a))

κd F1(ρπD T , ρπ ˆT ) + κ(1 γ) 1rmax E(s,a) ρπD T

2DKL(T( |s, a), ˆT( |s, a)), (9) where the last inequality holds due to Pinsker s inequality. At last, replacing κ with γ(1 γ) 1

completes the proof.

Remark Theorem 2 gives an upper bound on the discrepancy between the return in the model and the real environment. There are two terms in this bound: the last term corresponds to the model estimation error on real data, since the Kullback Leibler divergence measures the average quality of current model estimation. The ﬁrst term is the integral probability metric between the (s, a) distributions ρπD T and ρπ ˆT , which exactly corresponds to the distribution shift problem between model learning and model usage. To maximize ηT (π), we would like to minimize the upper bound and maximize η ˆT (π) jointly over the policy and the dynamics model. In practice, we usually omit model optimization in the term η ˆT (π) for simplicity, following the practice in previous work (Luo et al., 2018). Then optimizing η ˆT (π) over the policy and optimizing the last term of upper bound over the model together becomes the standard principle of Dyna-style MBRL approaches, which may suﬀer catastrophic performance deterioration if the ﬁrst term d F1(ρπD T , ρπ ˆT ) is too large. Therefore, the key is to minimize the ﬁrst term, i.e., the occupancy measure divergence, which is intuitively reasonable since the dynamics model will predict simulated (s, a) samples close to its training data with high accuracy. To optimize this term over the policy, we can use imitation learning methods on the dynamics model, such as GAIL (Ho and Ermon, 2016), where the real samples are viewed as the expert demonstrations. However, optimizing this term over the policy is unnecessary, which can further reduce the eﬃciency of the whole training process. For example, one may optimize the policy excessively over this term instead of the η ˆT (π) term. So in this paper, we mainly focus on how to optimize this occupancy measure divergence term over the model.

3.2 Practical FAMPO Algorithm

To optimize the occupancy measure divergence term over the model, we ﬁrst tackle it explicitly on feature level from the perspective of unsupervised domain adaptation (Ben-David

Shen, Lai, Liu, Zhao, Yu and Zhang

Algorithm 1 FAMPO

1: Initialize policy πφ, dynamics model ˆTθ, environment buﬀer Denv, model buﬀer Dmodel 2: repeat

3: Perform G1 gradient steps to train the model ˆTθ with samples from Denv 4: for F model rollouts do

5: Sample a state s uniformly from Denv 6: Use the policy πφ to perform a k-step rollout on the model ˆTθ starting from s; add to Dmodel 7: end for

8: Perform G2 gradient steps to train the feature extractor with samples (s, a) from both Denv and Dmodel by the model adaptation loss LWD 9: for E timesteps in the real environment do

10: Use the policy πφ to take an action in the real environment; add the sample(s, a, s , r) to Denv 11: Use the model ˆTθ to estimate the uncertainty for each sample in Dmodel 12: Perform G3 gradient steps to train the policy πφ with real data uniformly sampled from Denv and simulated data sampled from Dmodel according to the uncertainty

13: end for

14: until certain number of real samples have been collected

et al., 2010; Zhao et al., 2019), which aims at generalizing a learner on unlabeled data with labeled data from a diﬀerent distribution. One promising solution to this problem is to ﬁnd invariant feature representations by incorporating an additional objective of feature distribution alignment (Ben-David et al., 2007; Ganin et al., 2016). This motivates a model adaptation procedure to encourage the dynamics model to learn the features that are invariant to the real state-action data and the simulated one. Simultaneously, model adaptation can also be incorporated into existing Dyna-style MBRL methods, including those by reducing the distribution shift problem. Thereafter, in this paper, we adopt MBPO (Janner et al., 2019) as our baseline backbone framework due to its remarkable success in practice. We dub the integrated framework FAMPO and detail the step-by-step algorithm in Algorithm 1, where we defer the discussion on the weighted sampling strategy (line 11 to 12) to Section 5.

3.2.1 Incorporating Unsupervised Model Adaptation

For brevity, we will only introduce how to train individual dynamics model here, which can be easily extended to other models in the ensemble. Since our models are implemented by a neural network, we can deﬁne the shallow layers as the feature extractor fg with corresponding parameters θg and the remaining layers as the decoder fd with parameters θd. Thus, we have ˆT = fd fg and θ = {θg, θd}. We integrate a model adaptation loss over the output of feature extractor, which encourages such a conceptual division as the feature encoder and the decoder. The main idea of model adaptation is to adjust the feature extractor fg in order to align the two feature distributions of real samples and simulated ones as input, so that the induced feature distributions from real and simulated samples can be close in the feature space.

Adaptation Augmented Model-based Policy Optimization

Model Usage

estimate uncertainty

Model Learning FAMPO IAMPO

models models

DICE nets feature extrator

re-weight Alternate

Figure 1: Illustration of FAMPO and IAMPO algorithms. For model learning, FAMPO alternates between minimizing the standard MLE loss and aligning the latent feature distributions of models, while IAMPO trains the DICE nets to estimate the occupancy measure density ratio τ, which is then used to re-weight the real samples for model training. For model usage, when forming a batch to train the policy, the simulated data is selected with diﬀerent probabilities based on the estimated uncertainty, which is represented by the shade of color.

To incorporate unsupervised model adaptation into MBPO, we adopt alternative optimization between model training and model adaptation as illustrated in Figure 1. At every iteration (line 3 to 8 in Algorithm 1), when the dynamics model is trained, we use it to generate simulated rollouts which will then be used for model adaptation and policy optimization. As for the detailed adaptation strategy, instead of directly sharing the parameter weights of the feature extractor between real data and simulated data (Ganin et al., 2016), we choose to adopt the asymmetric feature mapping strategy (Tzeng et al., 2017), which has been shown to outperform the weight-sharing variant in domain adaptation due to more ﬂexible feature mappings. To be speciﬁc, the asymmetric feature mapping strategy unties the shared weights between two domains and learns individual feature extractors for real data and simulated data respectively. Thus in FAMPO, after the model adaptation at one iteration is ﬁnished, we will use the weight parameters for simulated data to initialize the model training for the next iteration. Through such an alternative optimization between model training and model adaptation, the feature representations learned by the feature extractor will be informative for the decoder to predict real samples, and more importantly it can generalize to the simulated samples.

Shen, Lai, Liu, Zhao, Yu and Zhang

3.2.2 Model Adaptation via Wasserstein-1 Distance

Speciﬁcally, given real samples (se, ae) from the environment buﬀer Denv and the simulated samples (sm, am) from the model buﬀer Dmodel, the two separate feature extractors map them to feature representations he = fe g(se, ae) and hm = fm g (sm, am). To achieve model adaptation, we minimize one kind of IPM between the two feature distributions Phe and Phm according to the lower bound in Theorem 2. In this paper, we choose the Wasserstein-1 distance as the divergence measure in model adaptation, which is validated to be eﬀective in domain adaptation (Shen et al., 2018). In Appendix C, we also provide a variant that uses Maximum Mean Discrepancy (MMD). Wasserstein-1 distance corresponds to an instance of IPM where the witness function satisﬁes the 1-Lipschitz constraint. To estimate the Wasserstein-1 distance, we use a critic network fc with parameters ω as introduced in Wasserstein GAN (Arjovsky et al., 2017). The critic maps a feature representation to a real number, and then according to Eq. (2) the Wasserstein-1 distance can be estimated by maximizing the following objective function over the critic:

LWD(θe g, θm g , ω) = 1

i=1 fc(hi e) 1 Nm

j=1 fc(hj m). (10)

In the meanwhile, the parameterized family of critic functions {fc} should satisfy the 1Lipschitz constraint according to the IPM formulation of Wasserstein-1 distance. In order to properly enforce the 1-Lipschitz constraint, we choose the gradient penalty loss (Gulrajani et al., 2017) for the critic

Lgp(ω) = EPˆh[( fc(ˆh) 2 1)2], (11)

where Pˆh is obtained by uniformly sampling along straight lines between pairs of points sampled from Phe and Phm as suggested in Gulrajani et al. (2017). After the critic is trained to approximate the Wasserstein-1 distance, we optimize the feature extractor to minimize the estimated Wasserstein-1 distance to learn features invariant to the real data and simulated data. To sum up, model adaptation though Wasserstein-1 distance can be achieved by solving the following minimax objective

min θeg,θm g max ω LWD(θe g, θm g , ω) α Lgp(ω), (12)

where θe g and θm g are the parameters of the two feature generators for real data and simulated data respectively, and α is the balancing coeﬃcient. For model adaptation at each iteration, we alternate between training the critic to estimate the Wasserstein-1 distance and training the feature extractor of the dynamics model to learn transferable features.

4. Instance-based Adaptation-augmented MBPO

Apart from explicitly mitigating the distribution shift by learning invariant features as in Section 3, in this section, we additionally propose an instance-based model adaptation method to tackle this problem on data level, which is motivated by the following derivation.

Adaptation Augmented Model-based Policy Optimization

4.1 Derivative of Return Discrepancy

Recall that we presented a precise return discrepancy in Lemma 1, and further decomposed it to derive Theorem 2. Instead, we can also directly optimize the dynamics model by taking the gradient of the return discrepancy. To take a further step, according to the deﬁnition of integral probability metric (IPM) (Müller, 1997), let F be a collection of functions from S to R, under the assumption that V π T F, we have

Gπ ˆT (s, a) sup f F Es ˆT f(s ) | s, a Es T f(s ) | s, a

=: d F( ˆT( |s, a), T( |s, a)), (13)

where d F is the IPM deﬁned by the function class F. Denoting the dynamics model as ˆTθ parameterized by θ, we have

θE(s,a) ρπ ˆ Tθ [d F(T( |s, a), ˆTθ( |s, a))] = θ

s,a ρπ ˆTθd F(T( |s, a), ˆTθ( |s, a)) ds da

s,a (ρπ ˆTθ θd F(T( |s, a), ˆTθ( |s, a)) + d F(T( |s, a), ˆTθ( |s, a)) θρπ ˆTθ) ds da

= Eρπ ˆ Tθ [ θd F(T( |s, a), ˆTθ( |s, a))] + Eρπ ˆ Tθ [d F(T( |s, a), ˆTθ( |s, a)) θ log ρπ ˆTθ].

A justiﬁcation of the interchange of the gradient and integration in Equation 14 is provided in Appendix B. There are two terms in this equation: one is the derivative of the IPM between the real transition T and the dynamics model ˆTθ, and the other is the derivative of the occupancy measure of policy π in the dynamics model ˆTθ. The ﬁrst term in Eq. 14 suggests us to minimize the IPM under the expectation over the occupancy measure ρπ ˆTθ. To derive an optimization objective according to this term, we need to tackle two questions. First, which speciﬁc IPM form should we use? Second, how to compute the IPM when we cannot access the ground-truth of T( |s, a) for the data drawn from ρπ ˆTθ?

To answer the ﬁrst question, as in Theorem 2, we have V π T (1 γ) 1rmax and then the IPM reduces to the total variation distance (1 γ) 1rmaxd TV(T, ˆTθ). Then according to

Pinsker s inequality, we have d TV(T, ˆTθ) q

1 2DKL(T( |s, a) ˆTθ( |s, a)). By discarding the constants, the loss function can be written as

L ˆT (θ) = E(s,a) ρπ ˆ Tθ [DKL(T( |s, a) ˆTθ( |s, a))]. (15)

Eq. 15 justiﬁes the use of maximum likelihood estimation (MLE) since MLE is the minimizer of the empirical approximation of the Kullback Leibler divergence. In practice, following MBPO (Janner et al., 2019), we use an ensemble of probabilistic networks to represent the model and train the model ensemble via maximum likelihood. However, since the ﬁrst component in KL requires the output of ground-truth dynamics T for model generated data ρπ ˆTθ, this loss function cannot be directly optimized. In an online MBRL method with iterative policy improvement, we denote the datacollecting policy as πD, which represents a collection of policies in previous iterations. Hence,

Shen, Lai, Liu, Zhao, Yu and Zhang

we will have the transitions (s, a, s ) collected by the behavioral policy πD in the real environment dynamics T, which are stored in the buﬀer Denv. Therefore, we use importance sampling to deal with the second question:

L ˆT (θ) = E(s,a) ρπD T

h ρπ ˆTθ ρπD T DKL(T( |s, a) ˆTθ( |s, a)) i , (16)

where the stop-gradient operator means the importance ratio τ (s, a) = ρπ ˆTθ(s, a)/ρπD T (s, a) is treated as a constant when taking derivatives of this objective. The proposed instancebased model learning technique uses the real data from Denv to train the model as previous works, but the loss function for each data point is re-weighted by the occupancy measure density ratio.

4.2 Importance Ratio Estimation

In order to apply the above importance-ratio scheme, we need to estimate the importance ratio τ (s, a) for each instance (s, a) in the real data buﬀer Denv. In practice, the importance ratio can be estimated either by the classical binary classiﬁcation (Zadrozny, 2004) or by the DICE (DIstribution Correction Estimation) framework. We utilize DICE to estimate the importance ratio by default, and also provide the comparison of these two ratio estimation techniques in Appendix C.

4.2.1 Binary Classification

One direct way to estimate this ratio is constructing a binary classiﬁcation problem where the real data in Denv is labeled positive and the simulated data in Dmodel is labeled negative (Zadrozny, 2004). To be more speciﬁc, let ζ(s, a) be the output of the classiﬁer on some state-action pair (s, a), then the importance ratio can be estimated by

τ(s, a) = ψ(1 ζ(s, a))/ζ(s, a), (17)

where ψ is the ratio of positive and negative samples. In practical implementation, the classiﬁer can be modeled as feed-forward neural networks and trained to minimize the cross-entropy loss.

4.2.2 Importance Ratio Estimation via DICE

Although binary classiﬁcation is simple to implement, in MBRL the number of real data and the simulated data is not balanced since we hope to simulate much more simulated data to help improve the policy when real data is limited. This imbalanced classiﬁcation problem may cause the learning process biased (Krawczyk, 2016) and lead to poor estimation. To sidestep the above obstacle, we can turn to use the DICE framework to estimate the ratio by exploiting the Markov property. In oﬀ-policy evaluation where we want to estimate the performance of a target policy from data generated by a behavioral policy, one promising solution is to re-weight the reward by the occupancy measure density ratio. Recently several works have proposed to estimate this ratio such as Dual DICE (Nachum et al., 2019), Gen DICE (Zhang et al., 2020a) and Gradient DICE (Zhang et al., 2020b). In this paper, we adapt the Gradient DICE architecture

Adaptation Augmented Model-based Policy Optimization

to estimate the importance ratio in Eq. 16. We will brieﬂy introduce the whole framework. For more detailed introduction of this framework, we refer the readers to Zhang et al. (2020b). To begin with, the occupancy measure satisﬁes the following equation:

ρπ ˆT (s , a ) = (1 γ)ν0(s )π(a |s ) + γ Z ρπ ˆT (s, a) ˆT(s |s, a)π(a |s ) ds da, (18)

where ν0 is the initial state distribution. Then using ρπD T (s, a)τ(s, a) to replace ρπ ˆT (s, a) and minimizing some divergence between the LHS and RHS of Eq. 18 over the function τ with an additional constraint can ﬁnally estimate the importance ratio τ (s, a). Denoting δ(s , a )=γ R ρπD T (s, a)τ(s, a) ˆT(s |s, a)π(a |s ) ds da+(1 γ)ν0(s )π(a |s ) ρπD T (s , a )τ(s , a ), we have the following objective function

ρπD T (s, a)

2 (EρπD T [τ(s, a)] 1)2 , (19)

where λ > 0 is a constant coeﬃcient. By further applying Fenchel conjugate (Rockafellar, 1970) and the interchangeability principle as in Zhang et al. (2020a), optimizing the ﬁnal objective of Gradient DICE is a minimax problem as

min τ max f,β L(τ, β, f) =(1 γ)Es ν0,a π[f(s, a)]

+ γE(s,a) ρπD T ,s ˆT,a π[τ(s, a)f(s , a )]

E(s,a) ρπD T [τ(s, a)f(s, a) + 1

+ λ(E(s,a) ρπD T [βτ(s, a) β] 1

2β2) , (20)

where we have τ : S A R, f : S A R and β R. In practical implementation, we use feed-forward neural networks to model the function f and τ. Since we use a model ensemble and the importance ratio is model-speciﬁc, we also construct separate DICE network for each dynamics model. Furthermore, because the simulated data is generated from some states in Denv, we use the collected real states to approximate the initial state distribution ν0(s). After the DICE network estimates the ratio for each real sample, we clip it to be in the interval [α0, α1] with the intention of improving the stability. We demonstrate the detailed process of IAMPO upon the MBPO (Janner et al., 2019) backbone in Algorithm 2. Speciﬁcally, in each iteration, IAMPO ﬁrst trains the DICE network (line 3), which is then used to estimate the importance ratio (line 4). Like in Algorithm 1, the weighted sampling strategy (line 12 to 13) will be introduced in Section 5.

5. Weighted Sampling Strategy

Although the model adaptation methods proposed in previous sections help to alleviate the distribution shift problem in model learning, some inaccurate predictions can still lead to catastrophic performance degradation due to the imperfect approximation. A straightforward solution is to discard the data with large model error when use them for policy optimization. In practice, it is non-trivial to estimate the model error and instead we can use model uncertainty that is considered to be positively correlated to the model error to replace it. For

Shen, Lai, Liu, Zhao, Yu and Zhang

Algorithm 2 IAMPO

1: Initialize policy πφ, dynamics model ˆTθ, DICE network {τ, f, β}, environment buﬀer Denv, model buﬀer Dmodel 2: repeat

3: Perform G2 gradient steps to train the DICE network {τ, f, β} with data from Denv according to Eq. 20

4: Use τ to estimate the ratio for each sample in Denv 5: Perform G1 gradient steps to train the model ˆTθ with the loss re-weighted by the ratio using data in Denv 6: for F model rollouts do

7: Sample a state s uniformly from Denv 8: Use the policy πφ to perform a k-step rollout on the model ˆTθ starting from s; add to Dmodel 9: end for

10: for E timesteps in the real environment do

11: Use the policy πφ to take an action in the real environment; add the sample(s, a, s , r) to Denv 12: Use the model ˆTθ to estimate the uncertainty for each sample in Dmodel 13: Perform G3 gradient steps to train the policy πφ with real data uniformly sampled from Denv and simulated data sampled from Dmodel according to the uncertainty

14: end for

15: until certain number of real samples have been collected

example, Janner et al. (2019) used the model to generate short rollouts so that the longer rollouts with higher uncertainty are cut oﬀ. Pan et al. (2020) explicitly estimated the model uncertainty and masked the simulated data with high uncertainty. These methods directly enforce the occupancy measure of high-uncertainty simulated data to be zero. However, it may be too restrictive since there is no guarantee that high uncertainty means high model error, and the masked simulated data may be still useful if its ground-truth model error is small. To tackle the above challenges, we propose to utilize a weighted sampling strategy during model usage phase for better generalization. Formally, as illustrated in Figure 1, when forming a mini-batch for policy training, the weighted sampling strategy chooses simulated data in the model buﬀer Dmodel according to a Boltzmann distribution based on the estimated uncertainty d(s, a). For a simulated data xi = (si, ai, s i, ri), the probability p(xi) of being sampled into the mini-batch is

p(xi) = exp( σ d(si, ai)) P xj exp( σ d(sj, aj)) (21)

where σ is a temperature parameter. There are multiple ways to estimate the uncertainty, such as the maximum standard deviation of the learned models in the ensemble (Yu et al., 2020) or the prediction disagreement between one model versus the rest of the models (Pan et al., 2020). In this paper, we follow the uncertainty estimation method in Kidambi et al. (2020) and use the maximum prediction discrepancy d(s, a) = maxi,j ˆTθi(s, a) ˆTθj(s, a) 2

Adaptation Augmented Model-based Policy Optimization

where ˆTθi and ˆTθj are members in the model ensemble { ˆTθ1, ˆTθ2, . . . }. Besides, the sampling temperature σ is set as min{σ0, σ1 dmax dmin } in practice, where σ0 and σ1 are hyperparameters, and dmax and dmin are the maximum and minimum uncertainty estimated in the model buﬀer respectively. Alternatively, one can also view the weighted sampling strategy as implicitly optimizing the second term of Eq. 14, since the second term, which is in the derivative form, suggests to minimize log ρπ ˆTθ weighted by d F(T, ˆT). And the weighted sampling strategy reduces the occupancy measure ρπ ˆTθ of simulated data if its uncertainty is high, which is positively

correlated to the model error d F(T, ˆT).

6. Experiments

The experiments aim to answer the following three questions: 1) How do FAMPO/IAMPO perform compared to prior model-free and model-based methods? 2) Do the feature-based and instance-based adaptation methods address the distribution shift problem in MBRL as we expected? 3) What are key ingredients in our algorithms?

6.1 Comparative Evaluation

Compared Methods. We compare the proposed FAMPO and IAMPO to other model-free and model-based algorithms: Soft Actor-Critic (SAC) (Haarnoja et al., 2018), the state-of-theart model-free oﬀ-policy algorithm in terms of sample eﬃciency and asymptotic performance. For model-based methods, we compare to MBPO (Janner et al., 2019), PETS (Chua et al., 2018) and SLBO (Luo et al., 2018), where PETS directly uses the model for planning without explicit policy learning and SLBO trains the model with a multi-step L2-norm loss and updates the policy using TRPO (Schulman et al., 2015). Environments. We evaluate our methods and other baselines on six Mu Jo Co continuous control tasks from Open AI Gym (Brockman et al., 2016) with a maximum horizon of 1000, including Inverted Pendulum, Swimmer, Hopper, Walker2d, Ant and Half Cheetah. For the Swimmer environment, we use the modiﬁed version introduced by Langlois et al. (2019) since the original version is quite diﬃcult to solve. For the other ﬁve environments, we adopt the same settings as in Janner et al. (2019). Implementation Details. We implement all our experiments using Tensor Flow. For MBPO, FAMPO and IAMPO, we ﬁrst apply a random policy to sample a certain number of real data to pre-train the dynamics model. Every time we train the dynamics model, we randomly sample several real data as a validation set and stop the model training if the model loss does not decrease for ﬁve gradient steps, which means we do not choose a speciﬁc value for the hyperparameter G1. Other important hyperparameters used in our methods are chosen by grid search and detailed hyperparameter settings can be found in Appendix E. Results. The learning curves of all compared methods are presented in Figure 2. From the comparison, we observe that FAMPO and IAMPO are more sample eﬃcient as they learn faster than all other baselines in all six environments. Furthermore, our methods are capable of reaching comparable asymptotic performance of the state-of-the-art model-free baseline SAC. Compared with MBPO, our approaches achieve better performance in all the environments, which veriﬁes the value of model adaptation. This also indicates that even

Shen, Lai, Liu, Zhao, Yu and Zhang

0 0.5K 1K 1.5K 2K steps

average return

Inverted Pendulum

0 3K 6K 9K 12K 15K steps

average return

0 20K 40K 60K 80K 100K steps

average return

0 50K 100K 150K 200K 250K steps

average return

0 50K 100K 150K 200K 250K 300K steps

average return

0 10K 20K 30K 40K 50K steps

average return

Half Cheetah

Figure 2: Performance curves of our methods and other baselines on six Mu Jo Co tasks. The results are averaged over eight random seeds, where solid curves depict the mean of eight trials and shaded areas indicate the standard deviation. For each trail, the average return over ten episodes in the real environment is evaluated every 1000 environment timesteps. For MBPO, FAMPO and IAMPO, in the dynamics model pre-training stage, the policy is not updated and thus the performance is not plotted at the beginning.

in the situation with reduced distribution shift by using short rollouts, model adaptation still helps. By comparing FAMPO and IAMPO, we can see that these two variants are comparably eﬀective across diﬀerent environments. Further analysis of the feature-based and instance-based adaptation is provided in Section 6.2.

6.2 Empirical Analysis

Distribution shift. To investigate how our proposed methods help mitigate the distribution shift problem in MBRL, we ﬁrst visualize the real data and the simulated data in Figure 3(a). In particular, after training IAMPO for 50 epochs on Hopper, we randomly sample 500 real (s, a) data from Denv and 2000 simulated (s, a) data from Dmodel. Then, we plot the normalized t-SNE visualization of the state-action pairs with the simulated data colored in blue and the real data colored diﬀerently according to the importance ratio estimated by IAMPO. It can be easily concluded that the distribution of simulated data deviates from that of real data, which may lead to poor model prediction in these areas. Moreover, if the real data is densely distributed and the simulated data is sparse, i.e., the ground-truth density ratio ρπ ˆTθ(s, a)/ρπD T (s, a) is small, the estimated ratio is also small (black points drawn in the ﬁgure), and vice versa. This visualization implies that the importance ratio of IAMPO is estimated as we expect.

Adaptation Augmented Model-based Policy Optimization

0.0 0.2 0.4 0.6 0.8 1.0

(a) distribution shift.

5K 20K 40K steps

wasserstein distance

FAMPO MBPO SLBO

(b) Wasserstein-1 distance.

0 10 20 30 40 50 rollout length

compounding error

FAMPO IAMPO MBPO

(c) Compounding errors.

0.0 0.2 0.4 0.6 0.8 1.0 groundtruth error

estimated uncertainty

(d) Uncertainty.

Figure 3: Empirical result on Hopper environment. (a) T-SNE visualization of state-action pairs randomly sampled from Denv and Dmodel. Simulated data is drawn in blue while real data is drawn in diﬀerent colors, and the brighter points represent the real samples with larger importance ratio. (b) The Wasserstein-1 distance between the feature distributions. (c) Average compounding errors with diﬀerent rollout lengths. (d) Relationship between the estimated uncertainty and the ground-truth error of the real samples. The orange points are the half of the data with high uncertainty, and the blue ones are the half with low uncertainty.

Wasserstein-1 Distance. Instead of re-weighting the real samples as shown in Figure 3(a), FAMPO explicitly align the feature distribution of real samples and simulated ones. To further verify the eﬀect of feature-based model adaptation, we visualize the estimated Wasserstein-1 distance between the real features and simulated ones. Besides MBPO and FAMPO, we additionally analyze the multi-step training loss of SLBO since it also utilizes the model output as the input of model training, which may help learn invariant features. According to the results shown in Figure 3(b), we ﬁnd that: i) the vanilla model training in MBPO itself can slowly minimize the Wasserstein-1 distance between feature distributions; ii) the multi-step training loss in SLBO does help learn invariant features but the improvement is limited; iii) the model adaptation loss in FAMPO is eﬀective in promoting feature distribution alignment, which is consistent with our initial motivation.

Shen, Lai, Liu, Zhao, Yu and Zhang

FAMPO IAMPO 0

average return

Hopper (epoch 40)

FAMPO IAMPO 0

Walker2d (epoch 100)

FAMPO IAMPO 0

Ant (epoch 100)

Figure 4: The results of the ablation studies conducted on three environments with few interactions (40K steps for Hopper, 100K steps for Walker2d and ant). The bars are average returns over ﬁve trials and black error lines indicate the standard deviation.

Compounding Errors. We also investigate if the model adaptation help alleviate the compounding model errors (Nagabandi et al., 2018) of multi-step forward predictions, which is largely caused by the distribution shift problem. Concretely, we use the current policy to sample a trajectory (s0, a0, s1, ..., ah 1, sh) of length h on real environment, and use the learned dynamics model to generate the corresponding simulated rollouts (ˆs0, a0, ˆs1, ..., ah 1, ˆsh) where ˆs0 = s0 and ˆsi+1 = ˆTθ(ˆsi, ai). Then the empirical compounding error is calculated as ϵh = 1

h Ph i=1 ˆsi si 2. We conduct experiments with diﬀerent trajectory length h and plot the results in Figure 3(c). We ﬁnd that both FAMPO and IAMPO achieves smaller compounding errors than MBPO, which meets our motivation that model adaptation can successfully mitigate the distribution shift. Uncertainty. For our weighted sampling strategy, one critical question is whether the estimated uncertainty matches the real state prediction error (e.g.the IPM in the second term of Eq. 14). However, since it is hard to obtain the ground-truth label of the simulated data due to the complexity of Mu Jo Co simulator, we instead evaluate the uncertainty quantiﬁcation using sampled real data (s, a, s ) Denv. To be speciﬁc, we use the dynamics model to predict the next state ˆs of newly collected real samples, on which the model hasn t been trained. Then we calculate the ground-truth error ˆs s 2 and estimate the uncertainty d(s, a). Figure 3(d) shows the relationship between the estimated uncertainty and the ground-truth error, from which we get to know that in general there is a positive correlation between the estimated uncertainty and the ground-truth error justifying the incorporation of the uncertainty in algorithm design. On the other hand, we further ﬁnd directly masking half of the data with highest uncertainty will also mask data with low ground-truth error, revealing the disadvantage of hard-masking mechanism.

6.3 Ablation Studies

In this section, we aim to investigate the key ingredients of our algorithms through ablation studies. Particularly, we compare the corresponding variants : (i) FAMPO w/o S, which only adopts feature-based model adaptation without incorporating weighted sampling strategy; (ii) IAMPO w/o S, which only adopts instance-based model adaptation without incorporating

Adaptation Augmented Model-based Policy Optimization

MBPO 100 200 400 600 800 0

average return

G2 in FAMPO (Hopper, epoch 40)

MBPO 100 400 800 1400 2000 0

G1 in IAMPO (Hopper, epoch 40)

MBPO 0.1 0.5 1 5 10 0

0 in IAMPO (Walker2d, epoch 100)

Figure 5: The results of the hyperparameter studies with respect to the adaptation iteration G2 in FAMPO, the DICE training iteration G1 and the temperature of the probabilistic sampling σ0 in IAMPO. The experiments are conducted on Hopper with 40k steps and Walker2d with 100 steps.

weighted sampling strategy; (iii) MBPO w/ S, which only adopts weighted sampling strategy without incorporating model adaptation. We conduct the experiments of ablation studies on three environments and the results are shown in Figure 4. From the comparison, we observe that: 1) Both the feature-based model adaptation and instance-based model adaptation are quite eﬀective, since in all three environments FAMPO-S and IAMPO-S improve the performance in a considerable margin compared to MBPO. 2) The eﬀectiveness of weighted sampling strategy varies across the three environments. To be more speciﬁc, weighted sampling strategy shows its eﬀectiveness on Hopper and Walker2d with diﬀerent degrees of performance improvement, but it does not improve much on Ant. The reason may be that on complex environments it is diﬃcult to estimate the uncertainty well to reach the positive correlation to the ground-truth model error.

6.4 Hyperparameter Studies

In this section, we study the sensitivity of our methods to important hyperparameters. Speciﬁcally, For model adaptation learning, we investigate the sensitivity of FAMPO to the adaptation iteration G2 and IAMPO to the DICE training iteration G1. While for weighted sampling strategy, we would like to assess the importance of the probabilistic sampling temperature σ. We conduct experiments with diﬀerent value of these hyperparameters and plot the result in Figure 5. According to the results, We observe that increasing G2 yields better performance up to a certain level while too large G2 degrades the performance, which means that we need to control the trade-oﬀbetween model training and model adaptation to ensure the feature representations to be invariant and also discriminative. Similar trend can be summarized in the result of G1 and σ, but in most cases the performance is still better than MBPO. To explain, too small G1 can t train the DICE networks suﬃciently but too large G1 may cause overﬁtting. And with relatively small σ, the algorithm just adopts uniform sampling in model usage, while using too large σ greatly degrades the performance since only focusing low-uncertainty simulated data may reduce the data diversity to obtain a good policy.

Shen, Lai, Liu, Zhao, Yu and Zhang

7. Related work

Recently, model-based reinforcement learning have attracted increasing attention, mainly due to its high sample eﬃciency by learning a model of the environment dynamics and using the model for policy optimization. In the literature, for MBRL, there are two stages, i.e., model learning and model usage. Among them, model learning mainly involves two aspects: (1) function approximator choice like Gaussian process (Deisenroth and Rasmussen, 2011), time-varying linear models (Sutton et al., 2012; Levine et al., 2016) and neural networks (Nagabandi et al., 2018), and (2) objective design like multi-step L2-norm (Luo et al., 2018), log-likelihood loss (Chua et al., 2018) and adversarial loss (Wu et al., 2019). In this regard, Chua et al. (2018) have demonstrated the superior eﬀectiveness of using an ensemble of probabilistic models with log-likelihood loss in reducing the potential overﬁtting. Inspired by their success, we also use this model architecture in the design of our methods. For model usage, MBRL methods can be roughly categorized into four groups according to the speciﬁc model usage strategies (Lai et al., 2020; Zhu et al., 2020). The ﬁrst group consists of analytic-gradient algorithms that use model derivatives to search the policy by back-propagation, including Deisenroth and Rasmussen (2011); Heess et al. (2015). The second group includes shooting algorithms that directly plan forward by model predictive control (MPC) without an explicit policy (Nagabandi et al., 2018; Chua et al., 2018). The third group mainly contains model-augmented value expansion algorithms that use modelbased rollouts to improve targets for model-free temporal diﬀerence (TD) updates (Buckman et al., 2018; Feinberg et al., 2018). The last group of algorithms are Dyna-style methods that use the learned model to generate simulated data to augment the real data for model-free policy training (Sutton, 1990; Luo et al., 2018). Under this taxonomy, our approaches are Dyna-style methods that are inspired by the recent MBPO (Janner et al., 2019) algorithm. One major challenge of Dyna-style MBRL in model usage is the distribution shift problem between the simulated and the real data, and this error will only exacerbate when the learned model is used to make multi-step predictions, because of the error compounding problem. To this end, various solutions have been proposed to mitigate the distribution shift. Among them, there are mainly two types: the ﬁrst one aims at improving model learning while the second one proposes to use the model more conservatively. Our algorithms take idea from both lines of works and also makes novel contributions to both. To facilitate model learning, Asadi et al. (2019) proposed to build multi-step models to directly predict the outcome of an action sequence. Furthermore, the model predicted outputs are used to construct the model training samples in addition to the real samples (Talvitie, 2017), so that the model can generalize to its output distribution. Asadi et al. (2018) also proposed to add Lipschitz continuity constraint on the model and provided a bound on multi-step prediction error accordingly. As a comparison, FAMPO leveraged feature-based domain adaptation and proposed a model adaptation strategy to learn a feature space where real data and simulated data are close in distribution. The design principle of IAMPO is heavily inspired by domain adaptation as well. However, instead of feature-based approaches, IAMPO is more related to instance-based methods for adaptation, which often involves an estimation of importance ratio to correct for the potential domain shift. Other works include constraining the model usage so as not to explore the regions with disastrous errors. For example, MBPO (Janner et al., 2019) used the model to generate short

Adaptation Augmented Model-based Policy Optimization

rollouts to avoid large departure from the real distribution in long rollouts. In a similar vein, STEVE (Buckman et al., 2018) interpolated between rollouts of various lengths according to the estimated model uncertainty. Recently, Mu et al. (2021) trained an evaluator to assess the reliability of the imagined trajectories, and M2AC (Pan et al., 2020) proposed a masking mechanism that masks the simulated data whose estimated uncertainty is high in usage phase. Our model usage component also exploits model uncertainty to reduce the possibility of selecting unreliable simulated data for policy optimization. On the theoretical side, previous works on MBRL mostly focused on either the tabular MDPs or linear dynamics (Szita and Szepesvári, 2010; Jaksch et al., 2010; Dean et al., 2019; Simchowitz et al., 2018), but not much in the context of continuous state space and non-linear systems. Recently, Luo et al. (2018) gave a theoretical guarantee of monotonic improvement by introducing a reference policy and imposing constraints on policy optimization and model learning related to the reference policy. Janner et al. (2019) then derived a lower bound focusing on branched short rollouts, but their algorithm was a heuristic instead of being designed to maximize the lower bound. In contrast, our algorithms are naturally inspired by our theoretical results and directly minimize a derived return discrepancy upper bound without enforcing any constraint on the policy.

8. Conclusion

In this paper, we investigate how to explicitly tackle the distribution shift problem in MBRL. We ﬁrst provide an upper bound of the return discrepancy to justify the necessity of model adaptation to correct the potential distribution bias in MBRL. With this insight, We then propose to incorporate model adaptation into model learning from both feature and instance perspective. In this way, the model gives more accurate predictions when generating simulated data, and therefore the follow-up policy optimization performance can be improved. Besides model adaptation learning, we additionally propose the weighted sampling strategy to further avoid the impact of inaccurate model predictions. Extensive experiments on continuous control tasks have shown the eﬀectiveness of our work. We believe our work takes an important step towards more sample-eﬃcient MBRL.

Acknowledgments

The Shanghai Jiao Tong University team is supported by Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102) and National Natural Science Foundation of China (62076161). The author Hang Lai is supported by Wu Wen Jun Honorary Doctoral Scholarship, AI Institute, Shanghai Jiao Tong University. The work of Han Zhao was supported in part by the Defense Advanced Research Projects Agency (DARPA) under Cooperative Agreement Number: HR00112320012, a Facebook Research Award, and Amazon AWS Cloud Credit.

Shen, Lai, Liu, Zhao, Yu and Zhang

Appendix A. A Diﬀerent View of Analysis

In this appendix, we provide an alternative perspective on the expected return upper bound derivation.

Lemma 3 Deﬁne the normalized state visit distribution as νπ T (s) := (1 γ) P t=0 γt P π T,t(s), where P π T,t(s) = P(st = s | π, T). Assume the initial state distributions of the real dynamics T and the dynamics model ˆT are the same. For any state s , assume there exists a witness function class Fs = {f : S A R} such that ˆT(s | , ) : S A R is in Fs . Then the following holds:

|νπD T (s ) νπ ˆT (s )| γd Fs (ρπD T , ρπ ˆT ) + γE(s,a) ρπD T

T(s | s, a) ˆT(s | s, a) . (22)

Proof For the state visit distribution νπ ˆT (s ), we have

νπ ˆT (s ) = (1 γ)ν0(s ) + γ Z ρπ ˆT (s, a) ˆT(s |s, a) ds da (23)

where ν0 denotes the probability of the initial state being the state s . Then we have

|νπD T (s ) νπ ˆT (s )|

s,a T(s |s, a)ρπD T (s, a) ˆT(s |s, a)ρπ ˆT (s, a) ds da

= γ E(s,a) ρπD T [T(s |s, a)] E(s,a) ρπ ˆ T [ ˆT(s |s, a)]

γ E(s,a) ρπD T [T(s |s, a) ˆT(s |s, a)] + γ E(s,a) ρπD T [ ˆT(s |s, a)] E(s,a) ρπ ˆ T [ ˆT(s |s, a)]

γE(s,a) ρπD T

T(s |s, a) ˆT(s |s, a) + γd Fs (ρπD T , ρπ ˆT ), (24) which completes the proof.

Lemma 3 states that the discrepancy between two state visit distributions for each state is bounded by the dynamics model error for predicting this state and the discrepancy between the two state-action occupancy measures. Intuitively, it means that when both the input state-action distributions and the conditional dynamics distributions are close then the output state distributions will be close as well.

Theorem 4 Let F := s SFs and deﬁne ϵπ := 2d TV(νπ T , νπD T ). Under the assumption of Lemma 3, the expected return ηT (π) admits the following bound:

η ˆT (π) ηT (π) rmax ϵπ + γrmax d F(ρπD T , ρπ ˆT ) Vol(S) (25)

+ γrmax E(s,a) ρπD T

2DKL(T( |s, a) ˆT( |s, a)), (26)

where Vol(S) is the volume of state space S.

Adaptation Augmented Model-based Policy Optimization

Proof The return discrepancy is bounded as follows

|η ˆT (π) ηT (π)| =

ρπ T (s, a) ρπ ˆT (s, a) r(s, a) ds da

νπ T (s)π(a|s) νπ ˆT (s)π(a|s) r(s, a) ds da

νπ T (s)π(a|s) νπ ˆT (s)π(a|s) ds da

νπ T (s) νπ ˆT (s) ds

νπD T (s) νπ ˆT (s) + νπ T (s) νπD T (s) ds

νπD T (s) νπ ˆT (s) ds + rmax ϵπ

Replacing the above state s with the notation s , then according to Lemma 3, we have

|η(π) ˆη(π)|

rmax ϵπ + γrmax Z

s d Fs (νπD T , νπ ˆT ) ds + γrmax E(s,a) ρπD T

T(s |s, a) ˆT(s |s, a) ds

rmax ϵπ + γrmax d F(ρπD T , ρπ ˆT ) Vol(S) + γrmax E(s,a) ρπD T

T(s |s, a) ˆT(s |s, a) ds

= rmax ϵπ + γrmax d F(ρπD T , ρπ ˆT ) Vol(S) + 2γrmax E(s,a) ρπD T d TV(T( |s, a), ˆT( |s, a))

rmax ϵπ + γrmax d F(ρπD T , ρπ ˆT ) Vol(S) + γrmax E(s,a) ρπD T

2DKL(T( |s, a), ˆT( |s, a)) , (28) where the last inequality holds due to Pinsker s inequality, which completes the proof.

Theorem 4 gives another upper bound on the return discrepancy between the return in the model and the real environment. In this bound, the ﬁrst term denotes the divergence between state visit distributions induced by the policy π and the behavioral policy πD in the environment, which is an important objective in batch reinforcement learning (Fujimoto et al., 2019) for reliable exploitation of oﬀ-policy samples. The second term corresponds to the distribution shift problem and the last term corresponds to the model estimation error on real data.

By comparing this bound to the one in Theorem 2, it seems the bound in Theorem 2 might be tighter since there is no extra ϵπ term. However, we should notice that the assumptions made in Theorem 2 are stronger. To be more speciﬁc, we assume Gπ ˆT satisﬁes the constraint

in Theorem 2, while here we only assume the model ˆT to satisfy the constraint, which is easier to hold.

Shen, Lai, Liu, Zhao, Yu and Zhang

Appendix B. Interchange Justiﬁcation

In this appendix, we provide a proof of the interchangeability of the gradient and integration in Equation 14, which is guaranteed by the Leibniz Integral Rule (Protter et al., 1985; Talvila, 2001). Formally,

Lemma 5 (Leibniz Integral Rule) Let f(x, t) be a function such that f(x, t) and its partial derivative f(x, t)/ x are continuous in t and x in some region of the xt plane, including a t b, x0 x x1, then for x0 x x1,

a f(x, t) dt = Z b

xf(x, t) dt. (29)

Let f(θ, (s, a)) = ρπ ˆTθ(s, a) d F(T( |s, a), ˆTθ( |s, a)). According to Lemma 5, now we just need to prove that i). f is continuous, and ii). the partial derivative f/ θ is continuous in the state-action space. Note that π, ˆTθ and ˆTθ/ θ are continuous since they are constructed using neural networks. We assume that the ground-truth dynamic T and V π T F are continuous and the initial state s0 is sampled from a continuous distribution.

Proof To prove i), f is continuous if ρπ ˆTθ and d F(T, ˆTθ) are both continuous. We ﬁrst

consider ρπ ˆTθ(s, a) = (1 γ) π(a|s) P t=0 γt P π ˆT,t(s). For t = 0, P π ˆT,0(s) = P(s0 = s) is continuous since s0 is sampled from a continuous distribution. For t 1, P π ˆT,t(s) = R

st 1,a P π ˆT,t 1(st 1) π(a|st 1) ˆTθ(s|st 1, a) dst 1 da is continuous if P π ˆT,t 1(s) is continuous,

since π and ˆTθ are continuous. Then by induction, P π ˆT,t(s) is continuous for t 0. Therefore, ρπ ˆTθ(s, a) is continuous as a discounted sum over P π ˆT,t(s) π(a|s). We then consider

d F(T( |s, a), ˆTθ( |s, a)),

d F(T( |s, a), ˆTθ( |s, a)) = sup f1 F Es T f(s ) | s, a Es ˆT f(s ) | s, a

s (T(s |s, a) ˆTθ(s |s, a))f1(s ) ds . (30)

Note that the supreme of continuous functions is still continuous, thus d F(T( |s, a), ˆTθ( |s, a)) is continuous under the assumption that T and ˆTθ are continuous. So far, we have proven (i).

To prove ii), f/ θ = ρπ ˆTθ/ θ d F(T, ˆTθ) + ρπ ˆTθ d F(T, ˆTθ)/ θ. We have proven that

d F(T, ˆTθ) and ρπ ˆTθ are continuous, so f/ θ is continuous if ρπ ˆTθ/ θ and d F(T, ˆTθ)/ θ are

Adaptation Augmented Model-based Policy Optimization

continuous. First,

ρπ ˆTθ θ =(1 γ) π(a|s)

t=0 γt P π ˆT,t(s)

=(1 γ) π(a|s)

st 1,a P π ˆT,t 1(st 1) π(a|st 1) ˆTθ(s|st 1, a) dst 1 da

=(1 γ) π(a|s)

st 1,a π(a|st 1) P π ˆT,t 1(st 1)

θ ˆTθ(s|st 1, a) (33)

+ P π ˆT,t 1(st 1) ˆTθ(s|st 1, a)

dst 1 da, (34)

where the interchange of the integration and the gradient step in the last equation can be proven trivially via Lemma 5. Similarly by induction, we can prove that P π ˆT,t(s)/ θ is continuous for t 0. Note that the integration of a continuous function is continuous. Therefore, ρπ ˆTθ/ θ is continuous since P π ˆT,t 1(st 1)/ θ and ˆTθ/ θ are continuous. Secondly,

d F(T, ˆTθ)

θ = supf1 F R

s (T(s |s, a) ˆTθ(s |s, a))f1(s ) ds

s (T(s |s, a) ˆTθ(s |s, a))f 1 (s ) ds

s f 1 (s ) ˆTθ(s |s, a)

θ ds , (37)

where f 1 = arg maxf1 F R

s (T(s |s, a) ˆTθ(s |s, a))f1(s ) ds . Here we can directly take the

supremum since F := n f : f rmax

1 γ o is closed. And the supremum is almost surely

unique, since the supremum simply lets f 1 = rmax/(1 γ) when T(s |s, a) ˆTθ(s |s, a) is positive, and f 1 = rmax/(1 γ) otherwise. The interchange of the integration and the gradient step from (8) to (9) can be proven trivially via Lemma 5 again. Similarly, as ˆTθ/ θ is continuous, d F(T, ˆTθ)/ θ is continuous too, which completes the proof of ii). Combining i) and ii) completes the proof.

Appendix C. Design Evaluation

In this appendix, we evaluate the design choices for each part of our algorithm. More speciﬁcally, we compare diﬀerent choices of feature alignment metrics, importance ratio estimation methods, and sampling strategy schemes.

C.1 MMD Variant of FAMPO

Besides Wasserstein-1 distance, we can use other distribution divergence metrics to align the features. MMD is another instance of IPM when the witness function class is the unit

Shen, Lai, Liu, Zhao, Yu and Zhang

0 20K 40K 60K 80K 100K steps

average return

0 50K 100K 150K 200K 250K 300K steps

average return

0 50K 100K 150K 200K 250K 300K steps

average return

Figure 6: Learning curves of FAMPO using diﬀerent metrics (Wasserstein-1 distance and MMD).

ball in a reproducing kernel Hilbert space (RKHS). Let k be the kernel of the RKHS Hk of functions on X. Then the squared MMD in Hk between two feature distributions Phe and Phm is Gretton et al. (2012):

MMD2 k(Phe, Phm) := Ehe,h e[k(he, h e)] + Ehm,h m[k(hm, h m)] 2Ehe,hm[k(he, hm)], (38)

which is a non-parametric measurement based on kernel mappings. In practice, given ﬁnite feature samples from distributions {h1 e, , h Ne e } Phe and {h1 m, , h Nm m } Phm, where Ne and Nm are the number of real samples and simulated ones, one unbiased estimator of MMD2 k(Phe, Phm) can be written as follows:

LMMD(θg) = 1 Ne(Ne 1)

i =i k(hi e, hi e )+ 1 Nm(Nm 1)

j =j k(hj m, hj m) 2 Ne Nm

j=1 k(hi e, hj m).

(39) To achieve model adaptation through MMD, we optimize the feature extractor to minimize the above adaptation loss LMMD with real (s, a) data and simulated one as input. When implementing the MMD variant, choosing optimal kernels remains an open problem and we use a linear combination of eight RBF kernels with bandwidths {0.001, 0.005, 0.01, 0.05, 0.1, 1, 5, 10}. The results are shown in Figure 6. Note that we only compare diﬀerent metrics for feature alignment here and do not incorporate the weighted sampling strategy. We can observe that using MMD as the distribution divergence measure is also eﬀective in the FAMPO framework, only slightly worse than using Wasserstein-1 distance on Hopper.

C.2 Classiﬁcation Variant of IAMPO

We compare diﬀerent methods to estimate the importance ratio for IAMPO: DICE network and binary classiﬁcation, as introduced in Section 4.2. The results are shown in Figure 7. Similarly, here we do not use weighted sampling strategy. While IAMPO with binary classiﬁcation achieves better performance than MBPO, it is not as eﬀective as DICE, mainly due to the imbalance between the amount of real data and simulated data.

Adaptation Augmented Model-based Policy Optimization

0 20K 40K 60K 80K 100K steps

average return

0 50K 100K 150K 200K 250K 300K steps

average return

0 50K 100K 150K 200K 250K 300K steps

average return

Figure 7: Learning curves of IAMPO with DICE network (IAMPO-DICE) and IAMPO with classiﬁcation (IAMPO-CLAS).

C.3 Masking Scheme of Sampling Strategy

We also use the same uncertainty estimation method as in weighted sampling strategy to implement the masking scheme(Pan et al., 2020). To be more speciﬁc, after estimating the uncertainty, we directly mask half the simulated data with high uncertainty in Dmodel and use the remaining data to train the policy. In practice, we assure that after the masking scheme, the size of Dmodel is still the same as MBPO by increasing the rollout batch size. The preliminary results on Hopper and Walker2d are shown in Figure 8. We ﬁnd that the masking scheme cannot achieve better performance than MBPO. The reason may be that after masking half the simulated data, only those very close to the real data are reserved to train the policy, as analyzed in Section 5. This observation further veriﬁes the eﬀectiveness weighted sampling strategy.

5K 25K 50K 70K steps

average return

MBPO Mask Weighted

5K 25K 50K 75K 100K steps

average return

Figure 8: Comparison results of original MBPO to MBPO with masking scheme and weighted sampling strategy.

Shen, Lai, Liu, Zhao, Yu and Zhang

5K 25K 50K 75K 100K steps

average return

FAMPO FAMPO-NOSTOP

(a) Early stopping.

5K 50K 100K 150K 200K 250K 300K steps

average return

FAMPO FAMPO-SW FAMPO-ADDA

(b) Adaptation strategy.

Figure 9: More empirical analysis for FAMPO. (a) FAMPO-NOSTOP denotes the FAMPO variant without early stopping the model adaptation procedure. (b) FAMPO-SW denotes the FAMPO variant of sharing the feature extractor weights for two data distributions. FAMPO-ADDA denotes the FAMPO variant of ﬁxing the feature extractor of real data.

Appendix D. More Experimental Results for FAMPO

D.1 Model Adaptation Early Stopping

In practice, we ﬁnd that after a certain number of environment steps, the model loss diﬀerence between FAMPO and MBPO becomes small. So in FAMPO, we early stop the model adaptation procedure after collecting a certain number of real data, such as 40K in the Hopper environment. We then conduct experiments without early stopping model adaptation, and the results are demonstrated in Figure 9(a). We ﬁnd that keeping adapting the dynamics model throughout the whole learning process does not bring performance improvement. This indicates that model adaptation makes a diﬀerence only when the model training data is insuﬃcient. So we set a model adaptation early stopping epoch for each environment (see Table 2 for detail) to improve the computation eﬃciency.

D.2 Adaptation Strategy

In FAMPO, we untie the feature extractor weights for two data distributions and learn the two feature extractors simultaneously, which is a variant of the adaptation strategy in Adversarial Discriminative Domain Adaptation (ADDA) (Tzeng et al., 2017). Diﬀerently, in ADDA the feature mapping for source domain (i.e. real data) is ﬁxed. Another alternative is to share the feature extractor weights between the two data distributions. From the comparison in Figure 9(b), we observe that the performance of these three adaptation strategies diﬀers not much, but FAMPO performs slightly better. The reason may be that, diﬀerent from general domain adaptation, in MBRL scenarios, the real and simulated data will be collected and generated continuously. Therefore learning a feature extractor speciﬁed for the simulated data at each iteration is unnecessary.

Adaptation Augmented Model-based Policy Optimization

Appendix E. Hyperparameter Settings

Table 1 lists the common hyperparameters used in FAMPO and IAMPO. Table 2 and Table 3 list the distinct hyperparameters of FAMPO and IAMPO, respectively.

Table 1: Common hyperparameters for FAMPO and IAMPO.

Inverted Pendulum

Half Cheetah

network architecture MLP with four hidden layers of size 200 feature extractor: four hidden layers; decoder: one output layer real samples for 300 2000 5000 model pretraining

real steps 250 1000 per epoch

E real steps between 125 250 model training

F model rollout 100000 batch size

B ensemble size 7

G3 policy updates 30 20 40 per real step

Table 2: Distinct hyperparameters for FAMPO. [a, b, x, y] denotes a linear function, i.e. at epoch e, f(e) = min(max(x + e a

b a (x y), x), y).

Inverted Pendulum

Half Cheetah

model adaptation 64 256 batch size

k rollout length 1 1 [5,45,1,15] 1 [10,60,1,25] [1,30,1,5]

G2 model adaptation 6 40 400 1000 3000 [1,30,100,1000] updates model adaptation 6 6 40 80 60 30 early stop epoch

σ0 temperature in 1 0 1 1 0.2 0 weighted sampling

σ1 temperature in 20 0 1 50 30 0 weighted sampling

Shen, Lai, Liu, Zhao, Yu and Zhang

Table 3: Distinct hyperparameters for IAMPO. [a, b, x, y] denotes a linear function, i.e. at epoch e, f(e) = min(max(x + e a

b a (x y), x), y).

Inverted Pendulum

Half Cheetah

DICE training 256 batch size

k rollout length 1 3 [10,60,1,15] 1 [10,60,1,25] 5

G2 DICE training 100 100 800 600 300 450 updates

α0 ratio clipping 0.1 minimum value

α1 ratio clipping 5 20 5 maximum value

λ coeﬃcient in 0.1 3 5 0.1 DICE loss

σ0 temperature in 2 5 1 2 0.1 0 weighted sampling

σ1 temperature in 20 20 40 50 7.5 0 weighted sampling

Adaptation Augmented Model-based Policy Optimization

Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214 223, 2017.

Kavosh Asadi, Dipendra Misra, and Michael Littman. Lipschitz continuity in model-based reinforcement learning. In International Conference on Machine Learning, pages 264 273, 2018.

Kavosh Asadi, Dipendra Misra, Seungchan Kim, and Michel L Littman. Combating the compounding-error problem with a multi-step model. ar Xiv preprint ar Xiv:1905.13320, 2019.

Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in neural information processing systems, pages 137 144, 2007.

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from diﬀerent domains. Machine learning, 79(1-2):151 175, 2010.

Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. ar Xiv preprint ar Xiv:1801.01401, 2018.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sampleeﬃcient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pages 8224 8234, 2018.

Kurtland Chua, Roberto Calandra, Rowan Mc Allister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4754 4765, 2018.

Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, pages 1 47, 2019.

Marc Deisenroth and Carl E Rasmussen. PILCO: A model-based and data-eﬃcient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465 472, 2011.

Amir-massoud Farahmand. Iterative value-aware model learning. In Advances in Neural Information Processing Systems, pages 9072 9083, 2018.

Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I Jordan, Joseph E Gonzalez, and Sergey Levine. Model-based value estimation for eﬃcient model-free reinforcement learning. ar Xiv preprint ar Xiv:1803.00101, 2018.

Shen, Lai, Liu, Zhao, Yu and Zhang

Scott Fujimoto, David Meger, and Doina Precup. Oﬀ-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052 2062, 2019.

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096 2030, 2016.

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723 773, 2012.

Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous oﬀ-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3389 3396. IEEE, 2017.

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In Advances in neural information processing systems, pages 5767 5777, 2017.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Oﬀpolicy maximum entropy deep reinforcement learning with a stochastic actor. ar Xiv preprint ar Xiv:1801.01290, 2018.

Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. Advances in Neural Information Processing Systems, 28:2944 2952, 2015.

Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 32, 2018.

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in neural information processing systems, pages 4565 4573, 2016.

Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563 1600, 2010.

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, pages 12498 12509, 2019.

Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based oﬄine reinforcement learning. Advances in Neural Information Processing Systems, 33, 2020.

Bartosz Krawczyk. Learning from imbalanced data: open challenges and future directions. Progress in Artiﬁcial Intelligence, 5(4):221 232, 2016.

Adaptation Augmented Model-based Policy Optimization

Hang Lai, Jian Shen, Weinan Zhang, and Yong Yu. Bidirectional model-based policy optimization. In International Conference on Machine Learning, pages 5618 5627. PMLR, 2020.

Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. Benchmarking model-based reinforcement learning. ar Xiv preprint ar Xiv:1907.02057, 2019.

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334 1373, 2016.

Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. ar Xiv preprint ar Xiv:1807.03858, 2018.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.

Yao Mu, Yuzheng Zhuang, Bin Wang, Guangxiang Zhu, Wulong Liu, Jianyu Chen, Ping Luo, Shengbo Li, Chongjie Zhang, and Jianye Hao. Model-based reinforcement learning via imagination with derived memory. Advances in Neural Information Processing Systems, 34, 2021.

Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429 443, 1997.

Oﬁr Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dual DICE: Behavior-agnostic estimation of discounted stationary distribution corrections. ar Xiv preprint ar Xiv:1906.04733, 2019.

Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free ﬁne-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559 7566. IEEE, 2018.

Feiyang Pan, Jia He, Dandan Tu, and Qing He. Trust the model when it is conﬁdent: Masked model-based actor-critic. Advances in Neural Information Processing Systems, 33, 2020.

Murray H Protter, Charles B Morrey, Murray H Protter, and Charles B Morrey. Diﬀerentiation under the integral sign. improper integrals. the gamma function. Intermediate Calculus, pages 421 453, 1985.

R Tyrrell Rockafellar. Convex analysis, volume 36. Princeton university press, 1970.

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889 1897, 2015.

Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. Wasserstein distance guided representation learning for domain adaptation. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

Shen, Lai, Liu, Zhao, Yu and Zhang

Jian Shen, Han Zhao, Weinan Zhang, and Yong Yu. Model-based policy optimization with unsupervised model adaptation. Advances in Neural Information Processing Systems, 33, 2020.

Max Simchowitz, Horia Mania, Stephen Tu, Michael I Jordan, and Benjamin Recht. Learning without mixing: Towards a sharp analysis of linear system identiﬁcation. ar Xiv preprint ar Xiv:1802.08334, 2018.

Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert RG Lanckriet. On integral probability metrics, φ-divergences and binary classiﬁcation. ar Xiv preprint ar Xiv:0901.2698, 2009.

Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based reinforcement learning in contextual decision processes. ar Xiv preprint ar Xiv:1811.08540, 2018.

Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pages 216 224. Elsevier, 1990.

Richard S Sutton, Csaba Szepesvári, Alborz Geramifard, and Michael P Bowling. Dyna-style planning with linear function approximation and prioritized sweeping. ar Xiv preprint ar Xiv:1206.3285, 2012.

István Szita and Csaba Szepesvári. Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1031 1038, 2010.

Erik Talvila. Necessary and suﬃcient conditions for diﬀerentiating under the integral sign. The American Mathematical Monthly, 108(6):544 548, 2001.

Erik Talvitie. Model regularization for stable sample rollouts. In UAI, pages 780 789, 2014.

Erik Talvitie. Self-correcting models for model-based reinforcement learning. In Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017.

Eric Tzeng, Judy Hoﬀman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167 7176, 2017.

Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.

Yueh-Hua Wu, Ting-Han Fan, Peter J Ramadge, and Hao Su. Model imitation for modelbased reinforcement learning. ar Xiv preprint ar Xiv:1909.11821, 2019.

Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based oﬄine policy optimization. Advances in Neural Information Processing Systems, 33, 2020.

Adaptation Augmented Model-based Policy Optimization

Bianca Zadrozny. Learning and evaluating classiﬁers under sample selection bias. In Proceedings of the twenty-ﬁrst international conference on Machine learning, page 114, 2004.

Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. Gen DICE: Generalized oﬄine estimation of stationary values. ar Xiv preprint ar Xiv:2002.09072, 2020a.

Shangtong Zhang, Bo Liu, and Shimon Whiteson. Gradient DICE: Rethinking generalized oﬄine estimation of stationary values. In International Conference on Machine Learning, pages 11194 11203. PMLR, 2020b.

Han Zhao, Remi Tachet des Combes, Kun Zhang, and Geoﬀrey J Gordon. On learning invariant representation for domain adaptation. ar Xiv preprint ar Xiv:1901.09453, 2019.

Guangxiang Zhu, Minghao Zhang, Honglak Lee, and Chongjie Zhang. Bridging imagination and reality for model-based deep reinforcement learning. Advances in Neural Information Processing Systems, 33, 2020.