# adaptation_augmented_modelbased_policy_optimization__d252ba3a.pdf Journal of Machine Learning Research 24 (2023) 1-35 Submitted 5/22; Revised 4/23; Published 6/23 Adaptation Augmented Model-based Policy Optimization Jian Shen rockyshen@apex.sjtu.edu.cn Hang Lai laihang99@sjtu.edu.cn Minghuan Liu minghuanliu@sjtu.edu.cn Han Zhao hanzhao@illinois.edu Yong Yu yyu@sjtu.edu.cn Weinan Zhang wnzhang@sjtu.edu.cn Department of Computer Science, Shanghai Jiao Tong University Department of Computer Science, University of Illinois, Urbana-Champaign Editor: Laurent Orseau Compared to model-free reinforcement learning (RL), model-based RL is often more sample efficient by leveraging a learned dynamics model to help decision making. However, the learned model is usually not perfectly accurate and the error will compound in multi-step predictions, which can lead to poor asymptotic performance. In this paper, we first derive an upper bound of the return discrepancy between the real dynamics and the learned model, which reveals the fundamental problem of distribution shift between simulated data and real data. Inspired by the theoretical analysis, we propose an adaptation augmented model-based policy optimization (AMPO) framework to address the distribution shift problem from the perspectives of feature learning and instance re-weighting, respectively. Specifically, the feature-based variant, namely FAMPO, introduces unsupervised model adaptation to minimize the integral probability metric (IPM) between feature distributions from real and simulated data, while the instance-based variant, termed as IAMPO, utilizes importance sampling to re-weight the real samples used to train the model. Besides model learning, we also investigate how to improve policy optimization in the model usage phase by selecting simulated samples with different probability according to their uncertainty. Extensive experiments on challenging continuous control tasks show that FAMPO and IAMPO, coupled with our model usage technique, achieves superior performance against baselines, which demonstrates the effectiveness of the proposed methods. Keywords: Model-based reinforcement learning, distribution shift, occupancy measure, Integral Probability Metric, importance sampling 1. Introduction Reinforcement learning (RL) algorithms can be roughly divided into two categories according to whether they utilize an environmental dynamics model: model-free RL (MFRL) and model-based RL (MBRL). MFRL methods, which directly learn a value function or a policy (or both), have achieved great success on a wide range of tasks such as video games (Mnih . Equal contribution. . Corresponding author. c 2023 Jian Shen, Hang Lai, Minghuan Liu, Han Zhao, Yong Yu and Weinan Zhang. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-0606.html. Shen, Lai, Liu, Zhao, Yu and Zhang et al., 2015; Hessel et al., 2018) and robotic control (Gu et al., 2017; Haarnoja et al., 2018). However, MFRL is notoriously sample-inefficient and requires a tremendous number of interactive samples to learn a good policy. In many high-stakes real-world applications, e.g., autonomous driving, online education, etc., it is often expensive, or even infeasible, to collect such large-scale datasets. In contrast, MBRL methods, which learn a dynamics model first and use it to alleviate sampling cost, are widely considered to be an appealing alternative (Sun et al., 2018; Langlois et al., 2019). Despite the higher sample efficiency, model-based methods tend to have poor asymptotic performance compared to their model-free counterparts due to the vulnerability to model errors (Nagabandi et al., 2018; Chua et al., 2018). To be precise, despite being equipped with a high-capacity model, such errors still exist due to the potential distribution shift between the training and generating phases, i.e., the state-action input distribution used to train the model is different from the one generated by the model (Talvitie, 2014). Specifically, the training data is usually collected by the behavior policies in the real environment but the model is required to make predictions on the data collected by the target policy in the model. When an imperfect model is used for multi-step rollouts, the error in one-step prediction is inclined to accumulate in the next steps, also known as the multi-step compounding error challenge (Asadi et al., 2018). In light of the distribution shift problem in MBRL, in the literature there are many efforts devoted to tackle this problem by improving the approximation accuracy of model learning, or by designing careful strategies when using the model for simulation. For model learning, different architectures (Asadi et al., 2018, 2019; Chua et al., 2018) and loss functions (Farahmand, 2018; Wu et al., 2019) have been proposed to mitigate overfitting or to improve multi-step predictions so that the distributions of the simulated data generated by the model are closer to the real ones. Besides, Talvitie (2014, 2017) also proposed a self-correct mechanism that uses the predicted states as the input to train the model along with the real data to gradually bridge the gap between the simulated data and the real ones. On the other hand, for model usage, delicate rollout schemes (Janner et al., 2019; Pan et al., 2020) have been adopted to stop the model rollouts before the simulated data deviate too much from the real distribution. Although these existing methods help alleviate the distribution shift to some extent, this problem has not been explicitly addressed. In this paper, we investigate how to better address the distribution shift problem explicitly in a principled way. To begin with, we first derive an upper bound of the return disparity between the real dynamics and the learned model, which naturally inspires a bound minimization algorithm. To this end, we propose our AMPO (Adaptation augmented Model-based Policy Optimization) framework upon the existing MBPO (Janner et al., 2019) method with two variants, dubbed as FAMPO and IAMPO, respectively. More specifically, FAMPO introduces a model adaptation procedure which encourages the model to learn invariant feature representations by minimizing the integral probability metric (IPM) between the feature distributions of real data and simulated data. In addition to aligning the feature distributions, we also try to handle the distribution shift problem on the data level and propose an instance-based model adaptation method, IAMPO. The intuition behind IAMPO is that the model should focus more on the state-action pairs that are more likely to appear in the simulated data distributions. The design principle of IAMPO is simple and straightforward: optimizing the dynamics model by minimizing the return discrepancy via Adaptation Augmented Model-based Policy Optimization gradient descent. In practice, IAMPO re-weights all the training samples by the importance scores when learning the dynamics model, where the importance score is defined as the density ratio between the occupancy measures of the simulated data and the real data. While model adaptation helps to achieve better generalization, some inaccurate predictions can still catastrophically affect the performance due to the imperfect approximation. For this reason, during the model usage phase, we adopt a weighted sampling strategy, which samples the simulated data with different probabilities according to the model uncertainty, to reduce the proportion of uncertain samples in the simulated data for policy optimization. We evaluate our algorithms on a range of continuous control benchmark tasks, which demonstrate that FAMPO and IAMPO, coupled with the weighted sampling strategy, achieves higher sample efficiency and better asymptotic performance compared to various baselines. 2. Preliminaries We first introduce the notation used throughout the paper and the problem setup of RL. Then, we briefly discuss the concepts related to integral probability metric. Finally, we present a classical MBRL method, MBPO, which will be the underlying framework for our algorithm. 2.1 Reinforcement Learning We consider an infinite-horizon Markov Decision Process (MDP), which is defined by the tuple (S, A, T, r, ν0, γ), where S is the state space and A is the action space. Throughout the paper, we assume the state and action spaces are continuous. We use γ (0, 1) to denote the discount factor, and T(s |s, a) to mean the transition density of state s given action a made under state s. The initial state distribution is denoted as ν0, and the reward function is denoted as r(s, a). We assume the reward function is bounded by rmax := sups,a |r(s, a)| < . The agent maintains a policy π(a|s) that determines the probability of choosing an action a at a given state s. The goal in reinforcement learning (RL) is to find the optimal policy π that maximizes the expected return (sum of discounted rewards), denoted by ηT : π := arg max π ηT (π) = arg max π Eπ h X t=0 γtr(st, at) | s0 ν0 i , (1) where st+1 T(s|st, at) and at π(a|st). In general the true transition T(s |s, a) is unknown and MBRL methods often learn an approximate model ˆT(s |s, a) of the transition dynamics, using samples collected from interactions with the MDP. Different from the previous works (Luo et al., 2018; Chua et al., 2018), in this paper we also assume the reward function r(s, a) to be unknown, and an agent needs to learn the reward function simultaneously. Given a policy π and a transition function T, we denote the density of being in state s at time step t as P π T,t(s) = P(st = s | π, T, s0 ν0). We then define the normalized occupancy measure (Ho and Ermon, 2016) of policy π under the dynamics T as ρπ T (s, a) = (1 γ) π(a|s) t=0 γt P π T,t(s). Shen, Lai, Liu, Zhao, Yu and Zhang Similarly, ρπ ˆT (s, a) represents the normalized occupancy measure of policy π under the approximate dynamics model ˆT. 2.2 Integral Probability Metric Integral probability metric (IPM) is a family of discrepancy measures between two distributions over the same space (Müller, 1997; Sriperumbudur et al., 2009). Specifically, given two probability distributions P and Q over X, the F-IPM is defined as d F(P, Q) := sup f F Ex P[f(x)] Ex Q[f(x)], (2) where F is a class of witness functions f : X R. Following Bińkowski et al. (2018), we assume IPM is symmetric. That is, if f F, we also have f F. By choosing different function class F, IPM reduces to many well-known distance metrics between probability distributions. In particular, the Wasserstein-1 distance (Villani, 2008) is defined using the 1-Lipschitz functions {f : f L 1}, where the Lipschitz semi-norm L is defined as f L = supx =y |f(x) f(y)|/|x y|. Furthermore, total variation is also a kind of IPM and we use d TV( , ) to denote it. 2.3 Model-based Policy Optimization We briefly summarize the model-based policy optimization (MBPO) (Janner et al., 2019) algorithm, on top of which we build our algorithm. MBPO uses a bootstrapped ensemble of probabilistic dynamics models ˆTθ(s |s, a), parameterized by θ. Each individual dynamics model is a probabilistic neural network which outputs a Gaussian distribution with diagonal covariance. The model ensemble is trained on the real data via minimizing the negative log-likelihood loss: Li ˆT (θ) = µi θ (sn, an) sn+1 Σi θ 1 (sn, an) µi θ (sn, an) sn+1 + log det Σi θ (sn, an) . (3) The learned model ˆTθ(s |s, a) is used to generate k-step rollouts starting from the states sampled from the real data buffer Denv with the actions taken by the current policy πφ. The generated simulated data is then added to a separate buffer Dmodel. Finally, the policy πφ is trained on both real and simulated data from Denv Dmodel in a fixed ratio using Soft Actor-Critic (SAC) (Haarnoja et al., 2018), which trains a stochastic policy with entropy regularization in actor-critic architecture by minimizing the expected KL-divergence: Lπ(φ) = Es[DKL(πφ( |s) exp(Q(s, ) V (s))], (4) where the Q and V functions are estimated via the soft Bellman backup operator following Haarnoja et al. (2018). 3. Feature-based Adaptation-augmented MBPO In this section, we first propose a feature-based adaptation approach to explicitly mitigate the distribution shift problem in MBRL, which is inspired by the return discrepancy upper bound derived as follows. Adaptation Augmented Model-based Policy Optimization 3.1 An Upper Bound for Return Discrepancy One of the main benefits of model-based RL methods is that we can use it to simulate data to replace the real environment once the model is learned. Imagine that if the dynamics model is perfect, we do not need the real environment anymore. However, if the dynamics model is extremely erroneous, the policy learned on the model may perform worse in the real environment, which can lower sample efficiency instead. Therefore, it is necessary in MBRL to derive an upper bound of the discrepancy between the expected return in the real environment ηT (π) and the expected return in the model η ˆT (π) with the same policy π in the following form (Luo et al., 2018; Janner et al., 2019): η ˆT (π) ηT (π) C. (5) Usually, the dynamics model ˆT will be learned from experiences (s, a, s ) collected by a behavioral policy πD in the real environment dynamics T. Typically, in an online MBRL method with iterative policy optimization, the behavioral policy πD represents a collection of past policies. Once we have derived this bound, we can naturally design a model-based framework to optimize the RL objective by maximizing the surrogate return η ˆT (π) and minimizing the discrepancy C simultaneously. Specifically, previous works have derived the following lemma to give a precise return discrepancy (Luo et al., 2018; Yu et al., 2020). Lemma 1 (Luo et al. (2018), Lemma 4.3; Yu et al. (2020), Lemma 4.1; Shen et al. (2020), Lemma E.1) Let two MDPs share the same reward function r(s, a), but with different dynamics T( |s, a) and ˆT( |s, a) respectively. Define V π T (s) := Eπ,T P t=0 γtr (st, at) | s0 = s as the expected discounted return under π starting from state s, and Gπ ˆT (s, a) := Es ˆT [V π T (s ) | s, a] Es T [V π T (s ) | s, a]. For any policy π, we have that η ˆT (π) ηT (π) = κE(s,a) ρπ ˆ T [Gπ ˆT (s, a)], (6) where κ = γ(1 γ) 1. For the sake of completeness, we provide a proof of Lemma 1, which is almost the same as in Yu et al. (2020). The only difference is that our occupancy measure is normalized. Proof Let Wj be the expected return when executing π on ˆT for the first j steps, then switching to T for the remainder. That is, Wj = Eat π, t 0 is a constant coefficient. By further applying Fenchel conjugate (Rockafellar, 1970) and the interchangeability principle as in Zhang et al. (2020a), optimizing the final objective of Gradient DICE is a minimax problem as min τ max f,β L(τ, β, f) =(1 γ)Es ν0,a π[f(s, a)] + γE(s,a) ρπD T ,s ˆT,a π[τ(s, a)f(s , a )] E(s,a) ρπD T [τ(s, a)f(s, a) + 1 + λ(E(s,a) ρπD T [βτ(s, a) β] 1 2β2) , (20) where we have τ : S A R, f : S A R and β R. In practical implementation, we use feed-forward neural networks to model the function f and τ. Since we use a model ensemble and the importance ratio is model-specific, we also construct separate DICE network for each dynamics model. Furthermore, because the simulated data is generated from some states in Denv, we use the collected real states to approximate the initial state distribution ν0(s). After the DICE network estimates the ratio for each real sample, we clip it to be in the interval [α0, α1] with the intention of improving the stability. We demonstrate the detailed process of IAMPO upon the MBPO (Janner et al., 2019) backbone in Algorithm 2. Specifically, in each iteration, IAMPO first trains the DICE network (line 3), which is then used to estimate the importance ratio (line 4). Like in Algorithm 1, the weighted sampling strategy (line 12 to 13) will be introduced in Section 5. 5. Weighted Sampling Strategy Although the model adaptation methods proposed in previous sections help to alleviate the distribution shift problem in model learning, some inaccurate predictions can still lead to catastrophic performance degradation due to the imperfect approximation. A straightforward solution is to discard the data with large model error when use them for policy optimization. In practice, it is non-trivial to estimate the model error and instead we can use model uncertainty that is considered to be positively correlated to the model error to replace it. For Shen, Lai, Liu, Zhao, Yu and Zhang Algorithm 2 IAMPO 1: Initialize policy πφ, dynamics model ˆTθ, DICE network {τ, f, β}, environment buffer Denv, model buffer Dmodel 2: repeat 3: Perform G2 gradient steps to train the DICE network {τ, f, β} with data from Denv according to Eq. 20 4: Use τ to estimate the ratio for each sample in Denv 5: Perform G1 gradient steps to train the model ˆTθ with the loss re-weighted by the ratio using data in Denv 6: for F model rollouts do 7: Sample a state s uniformly from Denv 8: Use the policy πφ to perform a k-step rollout on the model ˆTθ starting from s; add to Dmodel 9: end for 10: for E timesteps in the real environment do 11: Use the policy πφ to take an action in the real environment; add the sample(s, a, s , r) to Denv 12: Use the model ˆTθ to estimate the uncertainty for each sample in Dmodel 13: Perform G3 gradient steps to train the policy πφ with real data uniformly sampled from Denv and simulated data sampled from Dmodel according to the uncertainty 14: end for 15: until certain number of real samples have been collected example, Janner et al. (2019) used the model to generate short rollouts so that the longer rollouts with higher uncertainty are cut off. Pan et al. (2020) explicitly estimated the model uncertainty and masked the simulated data with high uncertainty. These methods directly enforce the occupancy measure of high-uncertainty simulated data to be zero. However, it may be too restrictive since there is no guarantee that high uncertainty means high model error, and the masked simulated data may be still useful if its ground-truth model error is small. To tackle the above challenges, we propose to utilize a weighted sampling strategy during model usage phase for better generalization. Formally, as illustrated in Figure 1, when forming a mini-batch for policy training, the weighted sampling strategy chooses simulated data in the model buffer Dmodel according to a Boltzmann distribution based on the estimated uncertainty d(s, a). For a simulated data xi = (si, ai, s i, ri), the probability p(xi) of being sampled into the mini-batch is p(xi) = exp( σ d(si, ai)) P xj exp( σ d(sj, aj)) (21) where σ is a temperature parameter. There are multiple ways to estimate the uncertainty, such as the maximum standard deviation of the learned models in the ensemble (Yu et al., 2020) or the prediction disagreement between one model versus the rest of the models (Pan et al., 2020). In this paper, we follow the uncertainty estimation method in Kidambi et al. (2020) and use the maximum prediction discrepancy d(s, a) = maxi,j ˆTθi(s, a) ˆTθj(s, a) 2 Adaptation Augmented Model-based Policy Optimization where ˆTθi and ˆTθj are members in the model ensemble { ˆTθ1, ˆTθ2, . . . }. Besides, the sampling temperature σ is set as min{σ0, σ1 dmax dmin } in practice, where σ0 and σ1 are hyperparameters, and dmax and dmin are the maximum and minimum uncertainty estimated in the model buffer respectively. Alternatively, one can also view the weighted sampling strategy as implicitly optimizing the second term of Eq. 14, since the second term, which is in the derivative form, suggests to minimize log ρπ ˆTθ weighted by d F(T, ˆT). And the weighted sampling strategy reduces the occupancy measure ρπ ˆTθ of simulated data if its uncertainty is high, which is positively correlated to the model error d F(T, ˆT). 6. Experiments The experiments aim to answer the following three questions: 1) How do FAMPO/IAMPO perform compared to prior model-free and model-based methods? 2) Do the feature-based and instance-based adaptation methods address the distribution shift problem in MBRL as we expected? 3) What are key ingredients in our algorithms? 6.1 Comparative Evaluation Compared Methods. We compare the proposed FAMPO and IAMPO to other model-free and model-based algorithms: Soft Actor-Critic (SAC) (Haarnoja et al., 2018), the state-of-theart model-free off-policy algorithm in terms of sample efficiency and asymptotic performance. For model-based methods, we compare to MBPO (Janner et al., 2019), PETS (Chua et al., 2018) and SLBO (Luo et al., 2018), where PETS directly uses the model for planning without explicit policy learning and SLBO trains the model with a multi-step L2-norm loss and updates the policy using TRPO (Schulman et al., 2015). Environments. We evaluate our methods and other baselines on six Mu Jo Co continuous control tasks from Open AI Gym (Brockman et al., 2016) with a maximum horizon of 1000, including Inverted Pendulum, Swimmer, Hopper, Walker2d, Ant and Half Cheetah. For the Swimmer environment, we use the modified version introduced by Langlois et al. (2019) since the original version is quite difficult to solve. For the other five environments, we adopt the same settings as in Janner et al. (2019). Implementation Details. We implement all our experiments using Tensor Flow. For MBPO, FAMPO and IAMPO, we first apply a random policy to sample a certain number of real data to pre-train the dynamics model. Every time we train the dynamics model, we randomly sample several real data as a validation set and stop the model training if the model loss does not decrease for five gradient steps, which means we do not choose a specific value for the hyperparameter G1. Other important hyperparameters used in our methods are chosen by grid search and detailed hyperparameter settings can be found in Appendix E. Results. The learning curves of all compared methods are presented in Figure 2. From the comparison, we observe that FAMPO and IAMPO are more sample efficient as they learn faster than all other baselines in all six environments. Furthermore, our methods are capable of reaching comparable asymptotic performance of the state-of-the-art model-free baseline SAC. Compared with MBPO, our approaches achieve better performance in all the environments, which verifies the value of model adaptation. This also indicates that even Shen, Lai, Liu, Zhao, Yu and Zhang 0 0.5K 1K 1.5K 2K steps average return Inverted Pendulum 0 3K 6K 9K 12K 15K steps average return 0 20K 40K 60K 80K 100K steps average return 0 50K 100K 150K 200K 250K steps average return 0 50K 100K 150K 200K 250K 300K steps average return 0 10K 20K 30K 40K 50K steps average return Half Cheetah Figure 2: Performance curves of our methods and other baselines on six Mu Jo Co tasks. The results are averaged over eight random seeds, where solid curves depict the mean of eight trials and shaded areas indicate the standard deviation. For each trail, the average return over ten episodes in the real environment is evaluated every 1000 environment timesteps. For MBPO, FAMPO and IAMPO, in the dynamics model pre-training stage, the policy is not updated and thus the performance is not plotted at the beginning. in the situation with reduced distribution shift by using short rollouts, model adaptation still helps. By comparing FAMPO and IAMPO, we can see that these two variants are comparably effective across different environments. Further analysis of the feature-based and instance-based adaptation is provided in Section 6.2. 6.2 Empirical Analysis Distribution shift. To investigate how our proposed methods help mitigate the distribution shift problem in MBRL, we first visualize the real data and the simulated data in Figure 3(a). In particular, after training IAMPO for 50 epochs on Hopper, we randomly sample 500 real (s, a) data from Denv and 2000 simulated (s, a) data from Dmodel. Then, we plot the normalized t-SNE visualization of the state-action pairs with the simulated data colored in blue and the real data colored differently according to the importance ratio estimated by IAMPO. It can be easily concluded that the distribution of simulated data deviates from that of real data, which may lead to poor model prediction in these areas. Moreover, if the real data is densely distributed and the simulated data is sparse, i.e., the ground-truth density ratio ρπ ˆTθ(s, a)/ρπD T (s, a) is small, the estimated ratio is also small (black points drawn in the figure), and vice versa. This visualization implies that the importance ratio of IAMPO is estimated as we expect. Adaptation Augmented Model-based Policy Optimization 0.0 0.2 0.4 0.6 0.8 1.0 (a) distribution shift. 5K 20K 40K steps wasserstein distance FAMPO MBPO SLBO (b) Wasserstein-1 distance. 0 10 20 30 40 50 rollout length compounding error FAMPO IAMPO MBPO (c) Compounding errors. 0.0 0.2 0.4 0.6 0.8 1.0 groundtruth error estimated uncertainty (d) Uncertainty. Figure 3: Empirical result on Hopper environment. (a) T-SNE visualization of state-action pairs randomly sampled from Denv and Dmodel. Simulated data is drawn in blue while real data is drawn in different colors, and the brighter points represent the real samples with larger importance ratio. (b) The Wasserstein-1 distance between the feature distributions. (c) Average compounding errors with different rollout lengths. (d) Relationship between the estimated uncertainty and the ground-truth error of the real samples. The orange points are the half of the data with high uncertainty, and the blue ones are the half with low uncertainty. Wasserstein-1 Distance. Instead of re-weighting the real samples as shown in Figure 3(a), FAMPO explicitly align the feature distribution of real samples and simulated ones. To further verify the effect of feature-based model adaptation, we visualize the estimated Wasserstein-1 distance between the real features and simulated ones. Besides MBPO and FAMPO, we additionally analyze the multi-step training loss of SLBO since it also utilizes the model output as the input of model training, which may help learn invariant features. According to the results shown in Figure 3(b), we find that: i) the vanilla model training in MBPO itself can slowly minimize the Wasserstein-1 distance between feature distributions; ii) the multi-step training loss in SLBO does help learn invariant features but the improvement is limited; iii) the model adaptation loss in FAMPO is effective in promoting feature distribution alignment, which is consistent with our initial motivation. Shen, Lai, Liu, Zhao, Yu and Zhang FAMPO IAMPO 0 average return Hopper (epoch 40) FAMPO IAMPO 0 Walker2d (epoch 100) FAMPO IAMPO 0 Ant (epoch 100) Figure 4: The results of the ablation studies conducted on three environments with few interactions (40K steps for Hopper, 100K steps for Walker2d and ant). The bars are average returns over five trials and black error lines indicate the standard deviation. Compounding Errors. We also investigate if the model adaptation help alleviate the compounding model errors (Nagabandi et al., 2018) of multi-step forward predictions, which is largely caused by the distribution shift problem. Concretely, we use the current policy to sample a trajectory (s0, a0, s1, ..., ah 1, sh) of length h on real environment, and use the learned dynamics model to generate the corresponding simulated rollouts (ˆs0, a0, ˆs1, ..., ah 1, ˆsh) where ˆs0 = s0 and ˆsi+1 = ˆTθ(ˆsi, ai). Then the empirical compounding error is calculated as ϵh = 1 h Ph i=1 ˆsi si 2. We conduct experiments with different trajectory length h and plot the results in Figure 3(c). We find that both FAMPO and IAMPO achieves smaller compounding errors than MBPO, which meets our motivation that model adaptation can successfully mitigate the distribution shift. Uncertainty. For our weighted sampling strategy, one critical question is whether the estimated uncertainty matches the real state prediction error (e.g.the IPM in the second term of Eq. 14). However, since it is hard to obtain the ground-truth label of the simulated data due to the complexity of Mu Jo Co simulator, we instead evaluate the uncertainty quantification using sampled real data (s, a, s ) Denv. To be specific, we use the dynamics model to predict the next state ˆs of newly collected real samples, on which the model hasn t been trained. Then we calculate the ground-truth error ˆs s 2 and estimate the uncertainty d(s, a). Figure 3(d) shows the relationship between the estimated uncertainty and the ground-truth error, from which we get to know that in general there is a positive correlation between the estimated uncertainty and the ground-truth error justifying the incorporation of the uncertainty in algorithm design. On the other hand, we further find directly masking half of the data with highest uncertainty will also mask data with low ground-truth error, revealing the disadvantage of hard-masking mechanism. 6.3 Ablation Studies In this section, we aim to investigate the key ingredients of our algorithms through ablation studies. Particularly, we compare the corresponding variants : (i) FAMPO w/o S, which only adopts feature-based model adaptation without incorporating weighted sampling strategy; (ii) IAMPO w/o S, which only adopts instance-based model adaptation without incorporating Adaptation Augmented Model-based Policy Optimization MBPO 100 200 400 600 800 0 average return G2 in FAMPO (Hopper, epoch 40) MBPO 100 400 800 1400 2000 0 G1 in IAMPO (Hopper, epoch 40) MBPO 0.1 0.5 1 5 10 0 0 in IAMPO (Walker2d, epoch 100) Figure 5: The results of the hyperparameter studies with respect to the adaptation iteration G2 in FAMPO, the DICE training iteration G1 and the temperature of the probabilistic sampling σ0 in IAMPO. The experiments are conducted on Hopper with 40k steps and Walker2d with 100 steps. weighted sampling strategy; (iii) MBPO w/ S, which only adopts weighted sampling strategy without incorporating model adaptation. We conduct the experiments of ablation studies on three environments and the results are shown in Figure 4. From the comparison, we observe that: 1) Both the feature-based model adaptation and instance-based model adaptation are quite effective, since in all three environments FAMPO-S and IAMPO-S improve the performance in a considerable margin compared to MBPO. 2) The effectiveness of weighted sampling strategy varies across the three environments. To be more specific, weighted sampling strategy shows its effectiveness on Hopper and Walker2d with different degrees of performance improvement, but it does not improve much on Ant. The reason may be that on complex environments it is difficult to estimate the uncertainty well to reach the positive correlation to the ground-truth model error. 6.4 Hyperparameter Studies In this section, we study the sensitivity of our methods to important hyperparameters. Specifically, For model adaptation learning, we investigate the sensitivity of FAMPO to the adaptation iteration G2 and IAMPO to the DICE training iteration G1. While for weighted sampling strategy, we would like to assess the importance of the probabilistic sampling temperature σ. We conduct experiments with different value of these hyperparameters and plot the result in Figure 5. According to the results, We observe that increasing G2 yields better performance up to a certain level while too large G2 degrades the performance, which means that we need to control the trade-offbetween model training and model adaptation to ensure the feature representations to be invariant and also discriminative. Similar trend can be summarized in the result of G1 and σ, but in most cases the performance is still better than MBPO. To explain, too small G1 can t train the DICE networks sufficiently but too large G1 may cause overfitting. And with relatively small σ, the algorithm just adopts uniform sampling in model usage, while using too large σ greatly degrades the performance since only focusing low-uncertainty simulated data may reduce the data diversity to obtain a good policy. Shen, Lai, Liu, Zhao, Yu and Zhang 7. Related work Recently, model-based reinforcement learning have attracted increasing attention, mainly due to its high sample efficiency by learning a model of the environment dynamics and using the model for policy optimization. In the literature, for MBRL, there are two stages, i.e., model learning and model usage. Among them, model learning mainly involves two aspects: (1) function approximator choice like Gaussian process (Deisenroth and Rasmussen, 2011), time-varying linear models (Sutton et al., 2012; Levine et al., 2016) and neural networks (Nagabandi et al., 2018), and (2) objective design like multi-step L2-norm (Luo et al., 2018), log-likelihood loss (Chua et al., 2018) and adversarial loss (Wu et al., 2019). In this regard, Chua et al. (2018) have demonstrated the superior effectiveness of using an ensemble of probabilistic models with log-likelihood loss in reducing the potential overfitting. Inspired by their success, we also use this model architecture in the design of our methods. For model usage, MBRL methods can be roughly categorized into four groups according to the specific model usage strategies (Lai et al., 2020; Zhu et al., 2020). The first group consists of analytic-gradient algorithms that use model derivatives to search the policy by back-propagation, including Deisenroth and Rasmussen (2011); Heess et al. (2015). The second group includes shooting algorithms that directly plan forward by model predictive control (MPC) without an explicit policy (Nagabandi et al., 2018; Chua et al., 2018). The third group mainly contains model-augmented value expansion algorithms that use modelbased rollouts to improve targets for model-free temporal difference (TD) updates (Buckman et al., 2018; Feinberg et al., 2018). The last group of algorithms are Dyna-style methods that use the learned model to generate simulated data to augment the real data for model-free policy training (Sutton, 1990; Luo et al., 2018). Under this taxonomy, our approaches are Dyna-style methods that are inspired by the recent MBPO (Janner et al., 2019) algorithm. One major challenge of Dyna-style MBRL in model usage is the distribution shift problem between the simulated and the real data, and this error will only exacerbate when the learned model is used to make multi-step predictions, because of the error compounding problem. To this end, various solutions have been proposed to mitigate the distribution shift. Among them, there are mainly two types: the first one aims at improving model learning while the second one proposes to use the model more conservatively. Our algorithms take idea from both lines of works and also makes novel contributions to both. To facilitate model learning, Asadi et al. (2019) proposed to build multi-step models to directly predict the outcome of an action sequence. Furthermore, the model predicted outputs are used to construct the model training samples in addition to the real samples (Talvitie, 2017), so that the model can generalize to its output distribution. Asadi et al. (2018) also proposed to add Lipschitz continuity constraint on the model and provided a bound on multi-step prediction error accordingly. As a comparison, FAMPO leveraged feature-based domain adaptation and proposed a model adaptation strategy to learn a feature space where real data and simulated data are close in distribution. The design principle of IAMPO is heavily inspired by domain adaptation as well. However, instead of feature-based approaches, IAMPO is more related to instance-based methods for adaptation, which often involves an estimation of importance ratio to correct for the potential domain shift. Other works include constraining the model usage so as not to explore the regions with disastrous errors. For example, MBPO (Janner et al., 2019) used the model to generate short Adaptation Augmented Model-based Policy Optimization rollouts to avoid large departure from the real distribution in long rollouts. In a similar vein, STEVE (Buckman et al., 2018) interpolated between rollouts of various lengths according to the estimated model uncertainty. Recently, Mu et al. (2021) trained an evaluator to assess the reliability of the imagined trajectories, and M2AC (Pan et al., 2020) proposed a masking mechanism that masks the simulated data whose estimated uncertainty is high in usage phase. Our model usage component also exploits model uncertainty to reduce the possibility of selecting unreliable simulated data for policy optimization. On the theoretical side, previous works on MBRL mostly focused on either the tabular MDPs or linear dynamics (Szita and Szepesvári, 2010; Jaksch et al., 2010; Dean et al., 2019; Simchowitz et al., 2018), but not much in the context of continuous state space and non-linear systems. Recently, Luo et al. (2018) gave a theoretical guarantee of monotonic improvement by introducing a reference policy and imposing constraints on policy optimization and model learning related to the reference policy. Janner et al. (2019) then derived a lower bound focusing on branched short rollouts, but their algorithm was a heuristic instead of being designed to maximize the lower bound. In contrast, our algorithms are naturally inspired by our theoretical results and directly minimize a derived return discrepancy upper bound without enforcing any constraint on the policy. 8. Conclusion In this paper, we investigate how to explicitly tackle the distribution shift problem in MBRL. We first provide an upper bound of the return discrepancy to justify the necessity of model adaptation to correct the potential distribution bias in MBRL. With this insight, We then propose to incorporate model adaptation into model learning from both feature and instance perspective. In this way, the model gives more accurate predictions when generating simulated data, and therefore the follow-up policy optimization performance can be improved. Besides model adaptation learning, we additionally propose the weighted sampling strategy to further avoid the impact of inaccurate model predictions. Extensive experiments on continuous control tasks have shown the effectiveness of our work. We believe our work takes an important step towards more sample-efficient MBRL. Acknowledgments The Shanghai Jiao Tong University team is supported by Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102) and National Natural Science Foundation of China (62076161). The author Hang Lai is supported by Wu Wen Jun Honorary Doctoral Scholarship, AI Institute, Shanghai Jiao Tong University. The work of Han Zhao was supported in part by the Defense Advanced Research Projects Agency (DARPA) under Cooperative Agreement Number: HR00112320012, a Facebook Research Award, and Amazon AWS Cloud Credit. Shen, Lai, Liu, Zhao, Yu and Zhang Appendix A. A Different View of Analysis In this appendix, we provide an alternative perspective on the expected return upper bound derivation. Lemma 3 Define the normalized state visit distribution as νπ T (s) := (1 γ) P t=0 γt P π T,t(s), where P π T,t(s) = P(st = s | π, T). Assume the initial state distributions of the real dynamics T and the dynamics model ˆT are the same. For any state s , assume there exists a witness function class Fs = {f : S A R} such that ˆT(s | , ) : S A R is in Fs . Then the following holds: |νπD T (s ) νπ ˆT (s )| γd Fs (ρπD T , ρπ ˆT ) + γE(s,a) ρπD T T(s | s, a) ˆT(s | s, a) . (22) Proof For the state visit distribution νπ ˆT (s ), we have νπ ˆT (s ) = (1 γ)ν0(s ) + γ Z ρπ ˆT (s, a) ˆT(s |s, a) ds da (23) where ν0 denotes the probability of the initial state being the state s . Then we have |νπD T (s ) νπ ˆT (s )| s,a T(s |s, a)ρπD T (s, a) ˆT(s |s, a)ρπ ˆT (s, a) ds da = γ E(s,a) ρπD T [T(s |s, a)] E(s,a) ρπ ˆ T [ ˆT(s |s, a)] γ E(s,a) ρπD T [T(s |s, a) ˆT(s |s, a)] + γ E(s,a) ρπD T [ ˆT(s |s, a)] E(s,a) ρπ ˆ T [ ˆT(s |s, a)] γE(s,a) ρπD T T(s |s, a) ˆT(s |s, a) + γd Fs (ρπD T , ρπ ˆT ), (24) which completes the proof. Lemma 3 states that the discrepancy between two state visit distributions for each state is bounded by the dynamics model error for predicting this state and the discrepancy between the two state-action occupancy measures. Intuitively, it means that when both the input state-action distributions and the conditional dynamics distributions are close then the output state distributions will be close as well. Theorem 4 Let F := s SFs and define ϵπ := 2d TV(νπ T , νπD T ). Under the assumption of Lemma 3, the expected return ηT (π) admits the following bound: η ˆT (π) ηT (π) rmax ϵπ + γrmax d F(ρπD T , ρπ ˆT ) Vol(S) (25) + γrmax E(s,a) ρπD T 2DKL(T( |s, a) ˆT( |s, a)), (26) where Vol(S) is the volume of state space S. Adaptation Augmented Model-based Policy Optimization Proof The return discrepancy is bounded as follows |η ˆT (π) ηT (π)| = ρπ T (s, a) ρπ ˆT (s, a) r(s, a) ds da νπ T (s)π(a|s) νπ ˆT (s)π(a|s) r(s, a) ds da νπ T (s)π(a|s) νπ ˆT (s)π(a|s) ds da νπ T (s) νπ ˆT (s) ds νπD T (s) νπ ˆT (s) + νπ T (s) νπD T (s) ds νπD T (s) νπ ˆT (s) ds + rmax ϵπ Replacing the above state s with the notation s , then according to Lemma 3, we have |η(π) ˆη(π)| rmax ϵπ + γrmax Z s d Fs (νπD T , νπ ˆT ) ds + γrmax E(s,a) ρπD T T(s |s, a) ˆT(s |s, a) ds rmax ϵπ + γrmax d F(ρπD T , ρπ ˆT ) Vol(S) + γrmax E(s,a) ρπD T T(s |s, a) ˆT(s |s, a) ds = rmax ϵπ + γrmax d F(ρπD T , ρπ ˆT ) Vol(S) + 2γrmax E(s,a) ρπD T d TV(T( |s, a), ˆT( |s, a)) rmax ϵπ + γrmax d F(ρπD T , ρπ ˆT ) Vol(S) + γrmax E(s,a) ρπD T 2DKL(T( |s, a), ˆT( |s, a)) , (28) where the last inequality holds due to Pinsker s inequality, which completes the proof. Theorem 4 gives another upper bound on the return discrepancy between the return in the model and the real environment. In this bound, the first term denotes the divergence between state visit distributions induced by the policy π and the behavioral policy πD in the environment, which is an important objective in batch reinforcement learning (Fujimoto et al., 2019) for reliable exploitation of off-policy samples. The second term corresponds to the distribution shift problem and the last term corresponds to the model estimation error on real data. By comparing this bound to the one in Theorem 2, it seems the bound in Theorem 2 might be tighter since there is no extra ϵπ term. However, we should notice that the assumptions made in Theorem 2 are stronger. To be more specific, we assume Gπ ˆT satisfies the constraint in Theorem 2, while here we only assume the model ˆT to satisfy the constraint, which is easier to hold. Shen, Lai, Liu, Zhao, Yu and Zhang Appendix B. Interchange Justification In this appendix, we provide a proof of the interchangeability of the gradient and integration in Equation 14, which is guaranteed by the Leibniz Integral Rule (Protter et al., 1985; Talvila, 2001). Formally, Lemma 5 (Leibniz Integral Rule) Let f(x, t) be a function such that f(x, t) and its partial derivative f(x, t)/ x are continuous in t and x in some region of the xt plane, including a t b, x0 x x1, then for x0 x x1, a f(x, t) dt = Z b xf(x, t) dt. (29) Let f(θ, (s, a)) = ρπ ˆTθ(s, a) d F(T( |s, a), ˆTθ( |s, a)). According to Lemma 5, now we just need to prove that i). f is continuous, and ii). the partial derivative f/ θ is continuous in the state-action space. Note that π, ˆTθ and ˆTθ/ θ are continuous since they are constructed using neural networks. We assume that the ground-truth dynamic T and V π T F are continuous and the initial state s0 is sampled from a continuous distribution. Proof To prove i), f is continuous if ρπ ˆTθ and d F(T, ˆTθ) are both continuous. We first consider ρπ ˆTθ(s, a) = (1 γ) π(a|s) P t=0 γt P π ˆT,t(s). For t = 0, P π ˆT,0(s) = P(s0 = s) is continuous since s0 is sampled from a continuous distribution. For t 1, P π ˆT,t(s) = R st 1,a P π ˆT,t 1(st 1) π(a|st 1) ˆTθ(s|st 1, a) dst 1 da is continuous if P π ˆT,t 1(s) is continuous, since π and ˆTθ are continuous. Then by induction, P π ˆT,t(s) is continuous for t 0. Therefore, ρπ ˆTθ(s, a) is continuous as a discounted sum over P π ˆT,t(s) π(a|s). We then consider d F(T( |s, a), ˆTθ( |s, a)), d F(T( |s, a), ˆTθ( |s, a)) = sup f1 F Es T f(s ) | s, a Es ˆT f(s ) | s, a s (T(s |s, a) ˆTθ(s |s, a))f1(s ) ds . (30) Note that the supreme of continuous functions is still continuous, thus d F(T( |s, a), ˆTθ( |s, a)) is continuous under the assumption that T and ˆTθ are continuous. So far, we have proven (i). To prove ii), f/ θ = ρπ ˆTθ/ θ d F(T, ˆTθ) + ρπ ˆTθ d F(T, ˆTθ)/ θ. We have proven that d F(T, ˆTθ) and ρπ ˆTθ are continuous, so f/ θ is continuous if ρπ ˆTθ/ θ and d F(T, ˆTθ)/ θ are Adaptation Augmented Model-based Policy Optimization continuous. First, ρπ ˆTθ θ =(1 γ) π(a|s) t=0 γt P π ˆT,t(s) =(1 γ) π(a|s) st 1,a P π ˆT,t 1(st 1) π(a|st 1) ˆTθ(s|st 1, a) dst 1 da =(1 γ) π(a|s) st 1,a π(a|st 1) P π ˆT,t 1(st 1) θ ˆTθ(s|st 1, a) (33) + P π ˆT,t 1(st 1) ˆTθ(s|st 1, a) dst 1 da, (34) where the interchange of the integration and the gradient step in the last equation can be proven trivially via Lemma 5. Similarly by induction, we can prove that P π ˆT,t(s)/ θ is continuous for t 0. Note that the integration of a continuous function is continuous. Therefore, ρπ ˆTθ/ θ is continuous since P π ˆT,t 1(st 1)/ θ and ˆTθ/ θ are continuous. Secondly, d F(T, ˆTθ) θ = supf1 F R s (T(s |s, a) ˆTθ(s |s, a))f1(s ) ds s (T(s |s, a) ˆTθ(s |s, a))f 1 (s ) ds s f 1 (s ) ˆTθ(s |s, a) θ ds , (37) where f 1 = arg maxf1 F R s (T(s |s, a) ˆTθ(s |s, a))f1(s ) ds . Here we can directly take the supremum since F := n f : f rmax 1 γ o is closed. And the supremum is almost surely unique, since the supremum simply lets f 1 = rmax/(1 γ) when T(s |s, a) ˆTθ(s |s, a) is positive, and f 1 = rmax/(1 γ) otherwise. The interchange of the integration and the gradient step from (8) to (9) can be proven trivially via Lemma 5 again. Similarly, as ˆTθ/ θ is continuous, d F(T, ˆTθ)/ θ is continuous too, which completes the proof of ii). Combining i) and ii) completes the proof. Appendix C. Design Evaluation In this appendix, we evaluate the design choices for each part of our algorithm. More specifically, we compare different choices of feature alignment metrics, importance ratio estimation methods, and sampling strategy schemes. C.1 MMD Variant of FAMPO Besides Wasserstein-1 distance, we can use other distribution divergence metrics to align the features. MMD is another instance of IPM when the witness function class is the unit Shen, Lai, Liu, Zhao, Yu and Zhang 0 20K 40K 60K 80K 100K steps average return 0 50K 100K 150K 200K 250K 300K steps average return 0 50K 100K 150K 200K 250K 300K steps average return Figure 6: Learning curves of FAMPO using different metrics (Wasserstein-1 distance and MMD). ball in a reproducing kernel Hilbert space (RKHS). Let k be the kernel of the RKHS Hk of functions on X. Then the squared MMD in Hk between two feature distributions Phe and Phm is Gretton et al. (2012): MMD2 k(Phe, Phm) := Ehe,h e[k(he, h e)] + Ehm,h m[k(hm, h m)] 2Ehe,hm[k(he, hm)], (38) which is a non-parametric measurement based on kernel mappings. In practice, given finite feature samples from distributions {h1 e, , h Ne e } Phe and {h1 m, , h Nm m } Phm, where Ne and Nm are the number of real samples and simulated ones, one unbiased estimator of MMD2 k(Phe, Phm) can be written as follows: LMMD(θg) = 1 Ne(Ne 1) i =i k(hi e, hi e )+ 1 Nm(Nm 1) j =j k(hj m, hj m) 2 Ne Nm j=1 k(hi e, hj m). (39) To achieve model adaptation through MMD, we optimize the feature extractor to minimize the above adaptation loss LMMD with real (s, a) data and simulated one as input. When implementing the MMD variant, choosing optimal kernels remains an open problem and we use a linear combination of eight RBF kernels with bandwidths {0.001, 0.005, 0.01, 0.05, 0.1, 1, 5, 10}. The results are shown in Figure 6. Note that we only compare different metrics for feature alignment here and do not incorporate the weighted sampling strategy. We can observe that using MMD as the distribution divergence measure is also effective in the FAMPO framework, only slightly worse than using Wasserstein-1 distance on Hopper. C.2 Classification Variant of IAMPO We compare different methods to estimate the importance ratio for IAMPO: DICE network and binary classification, as introduced in Section 4.2. The results are shown in Figure 7. Similarly, here we do not use weighted sampling strategy. While IAMPO with binary classification achieves better performance than MBPO, it is not as effective as DICE, mainly due to the imbalance between the amount of real data and simulated data. Adaptation Augmented Model-based Policy Optimization 0 20K 40K 60K 80K 100K steps average return 0 50K 100K 150K 200K 250K 300K steps average return 0 50K 100K 150K 200K 250K 300K steps average return Figure 7: Learning curves of IAMPO with DICE network (IAMPO-DICE) and IAMPO with classification (IAMPO-CLAS). C.3 Masking Scheme of Sampling Strategy We also use the same uncertainty estimation method as in weighted sampling strategy to implement the masking scheme(Pan et al., 2020). To be more specific, after estimating the uncertainty, we directly mask half the simulated data with high uncertainty in Dmodel and use the remaining data to train the policy. In practice, we assure that after the masking scheme, the size of Dmodel is still the same as MBPO by increasing the rollout batch size. The preliminary results on Hopper and Walker2d are shown in Figure 8. We find that the masking scheme cannot achieve better performance than MBPO. The reason may be that after masking half the simulated data, only those very close to the real data are reserved to train the policy, as analyzed in Section 5. This observation further verifies the effectiveness weighted sampling strategy. 5K 25K 50K 70K steps average return MBPO Mask Weighted 5K 25K 50K 75K 100K steps average return Figure 8: Comparison results of original MBPO to MBPO with masking scheme and weighted sampling strategy. Shen, Lai, Liu, Zhao, Yu and Zhang 5K 25K 50K 75K 100K steps average return FAMPO FAMPO-NOSTOP (a) Early stopping. 5K 50K 100K 150K 200K 250K 300K steps average return FAMPO FAMPO-SW FAMPO-ADDA (b) Adaptation strategy. Figure 9: More empirical analysis for FAMPO. (a) FAMPO-NOSTOP denotes the FAMPO variant without early stopping the model adaptation procedure. (b) FAMPO-SW denotes the FAMPO variant of sharing the feature extractor weights for two data distributions. FAMPO-ADDA denotes the FAMPO variant of fixing the feature extractor of real data. Appendix D. More Experimental Results for FAMPO D.1 Model Adaptation Early Stopping In practice, we find that after a certain number of environment steps, the model loss difference between FAMPO and MBPO becomes small. So in FAMPO, we early stop the model adaptation procedure after collecting a certain number of real data, such as 40K in the Hopper environment. We then conduct experiments without early stopping model adaptation, and the results are demonstrated in Figure 9(a). We find that keeping adapting the dynamics model throughout the whole learning process does not bring performance improvement. This indicates that model adaptation makes a difference only when the model training data is insufficient. So we set a model adaptation early stopping epoch for each environment (see Table 2 for detail) to improve the computation efficiency. D.2 Adaptation Strategy In FAMPO, we untie the feature extractor weights for two data distributions and learn the two feature extractors simultaneously, which is a variant of the adaptation strategy in Adversarial Discriminative Domain Adaptation (ADDA) (Tzeng et al., 2017). Differently, in ADDA the feature mapping for source domain (i.e. real data) is fixed. Another alternative is to share the feature extractor weights between the two data distributions. From the comparison in Figure 9(b), we observe that the performance of these three adaptation strategies differs not much, but FAMPO performs slightly better. The reason may be that, different from general domain adaptation, in MBRL scenarios, the real and simulated data will be collected and generated continuously. Therefore learning a feature extractor specified for the simulated data at each iteration is unnecessary. Adaptation Augmented Model-based Policy Optimization Appendix E. Hyperparameter Settings Table 1 lists the common hyperparameters used in FAMPO and IAMPO. Table 2 and Table 3 list the distinct hyperparameters of FAMPO and IAMPO, respectively. Table 1: Common hyperparameters for FAMPO and IAMPO. Inverted Pendulum Half Cheetah network architecture MLP with four hidden layers of size 200 feature extractor: four hidden layers; decoder: one output layer real samples for 300 2000 5000 model pretraining real steps 250 1000 per epoch E real steps between 125 250 model training F model rollout 100000 batch size B ensemble size 7 G3 policy updates 30 20 40 per real step Table 2: Distinct hyperparameters for FAMPO. [a, b, x, y] denotes a linear function, i.e. at epoch e, f(e) = min(max(x + e a b a (x y), x), y). Inverted Pendulum Half Cheetah model adaptation 64 256 batch size k rollout length 1 1 [5,45,1,15] 1 [10,60,1,25] [1,30,1,5] G2 model adaptation 6 40 400 1000 3000 [1,30,100,1000] updates model adaptation 6 6 40 80 60 30 early stop epoch σ0 temperature in 1 0 1 1 0.2 0 weighted sampling σ1 temperature in 20 0 1 50 30 0 weighted sampling Shen, Lai, Liu, Zhao, Yu and Zhang Table 3: Distinct hyperparameters for IAMPO. [a, b, x, y] denotes a linear function, i.e. at epoch e, f(e) = min(max(x + e a b a (x y), x), y). Inverted Pendulum Half Cheetah DICE training 256 batch size k rollout length 1 3 [10,60,1,15] 1 [10,60,1,25] 5 G2 DICE training 100 100 800 600 300 450 updates α0 ratio clipping 0.1 minimum value α1 ratio clipping 5 20 5 maximum value λ coefficient in 0.1 3 5 0.1 DICE loss σ0 temperature in 2 5 1 2 0.1 0 weighted sampling σ1 temperature in 20 20 40 50 7.5 0 weighted sampling Adaptation Augmented Model-based Policy Optimization Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214 223, 2017. Kavosh Asadi, Dipendra Misra, and Michael Littman. Lipschitz continuity in model-based reinforcement learning. In International Conference on Machine Learning, pages 264 273, 2018. Kavosh Asadi, Dipendra Misra, Seungchan Kim, and Michel L Littman. Combating the compounding-error problem with a multi-step model. ar Xiv preprint ar Xiv:1905.13320, 2019. Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in neural information processing systems, pages 137 144, 2007. Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151 175, 2010. Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. ar Xiv preprint ar Xiv:1801.01401, 2018. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016. Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sampleefficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pages 8224 8234, 2018. Kurtland Chua, Roberto Calandra, Rowan Mc Allister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4754 4765, 2018. Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, pages 1 47, 2019. Marc Deisenroth and Carl E Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465 472, 2011. Amir-massoud Farahmand. Iterative value-aware model learning. In Advances in Neural Information Processing Systems, pages 9072 9083, 2018. Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I Jordan, Joseph E Gonzalez, and Sergey Levine. Model-based value estimation for efficient model-free reinforcement learning. ar Xiv preprint ar Xiv:1803.00101, 2018. Shen, Lai, Liu, Zhao, Yu and Zhang Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052 2062, 2019. Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096 2030, 2016. Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723 773, 2012. Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3389 3396. IEEE, 2017. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In Advances in neural information processing systems, pages 5767 5777, 2017. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. ar Xiv preprint ar Xiv:1801.01290, 2018. Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. Advances in Neural Information Processing Systems, 28:2944 2952, 2015. Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in neural information processing systems, pages 4565 4573, 2016. Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563 1600, 2010. Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, pages 12498 12509, 2019. Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based offline reinforcement learning. Advances in Neural Information Processing Systems, 33, 2020. Bartosz Krawczyk. Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 5(4):221 232, 2016. Adaptation Augmented Model-based Policy Optimization Hang Lai, Jian Shen, Weinan Zhang, and Yong Yu. Bidirectional model-based policy optimization. In International Conference on Machine Learning, pages 5618 5627. PMLR, 2020. Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. Benchmarking model-based reinforcement learning. ar Xiv preprint ar Xiv:1907.02057, 2019. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334 1373, 2016. Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. ar Xiv preprint ar Xiv:1807.03858, 2018. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. Yao Mu, Yuzheng Zhuang, Bin Wang, Guangxiang Zhu, Wulong Liu, Jianyu Chen, Ping Luo, Shengbo Li, Chongjie Zhang, and Jianye Hao. Model-based reinforcement learning via imagination with derived memory. Advances in Neural Information Processing Systems, 34, 2021. Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429 443, 1997. Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dual DICE: Behavior-agnostic estimation of discounted stationary distribution corrections. ar Xiv preprint ar Xiv:1906.04733, 2019. Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559 7566. IEEE, 2018. Feiyang Pan, Jia He, Dandan Tu, and Qing He. Trust the model when it is confident: Masked model-based actor-critic. Advances in Neural Information Processing Systems, 33, 2020. Murray H Protter, Charles B Morrey, Murray H Protter, and Charles B Morrey. Differentiation under the integral sign. improper integrals. the gamma function. Intermediate Calculus, pages 421 453, 1985. R Tyrrell Rockafellar. Convex analysis, volume 36. Princeton university press, 1970. John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889 1897, 2015. Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. Wasserstein distance guided representation learning for domain adaptation. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Shen, Lai, Liu, Zhao, Yu and Zhang Jian Shen, Han Zhao, Weinan Zhang, and Yong Yu. Model-based policy optimization with unsupervised model adaptation. Advances in Neural Information Processing Systems, 33, 2020. Max Simchowitz, Horia Mania, Stephen Tu, Michael I Jordan, and Benjamin Recht. Learning without mixing: Towards a sharp analysis of linear system identification. ar Xiv preprint ar Xiv:1802.08334, 2018. Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert RG Lanckriet. On integral probability metrics, φ-divergences and binary classification. ar Xiv preprint ar Xiv:0901.2698, 2009. Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based reinforcement learning in contextual decision processes. ar Xiv preprint ar Xiv:1811.08540, 2018. Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pages 216 224. Elsevier, 1990. Richard S Sutton, Csaba Szepesvári, Alborz Geramifard, and Michael P Bowling. Dyna-style planning with linear function approximation and prioritized sweeping. ar Xiv preprint ar Xiv:1206.3285, 2012. István Szita and Csaba Szepesvári. Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1031 1038, 2010. Erik Talvila. Necessary and sufficient conditions for differentiating under the integral sign. The American Mathematical Monthly, 108(6):544 548, 2001. Erik Talvitie. Model regularization for stable sample rollouts. In UAI, pages 780 789, 2014. Erik Talvitie. Self-correcting models for model-based reinforcement learning. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167 7176, 2017. Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. Yueh-Hua Wu, Ting-Han Fan, Peter J Ramadge, and Hao Su. Model imitation for modelbased reinforcement learning. ar Xiv preprint ar Xiv:1909.11821, 2019. Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33, 2020. Adaptation Augmented Model-based Policy Optimization Bianca Zadrozny. Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning, page 114, 2004. Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. Gen DICE: Generalized offline estimation of stationary values. ar Xiv preprint ar Xiv:2002.09072, 2020a. Shangtong Zhang, Bo Liu, and Shimon Whiteson. Gradient DICE: Rethinking generalized offline estimation of stationary values. In International Conference on Machine Learning, pages 11194 11203. PMLR, 2020b. Han Zhao, Remi Tachet des Combes, Kun Zhang, and Geoffrey J Gordon. On learning invariant representation for domain adaptation. ar Xiv preprint ar Xiv:1901.09453, 2019. Guangxiang Zhu, Minghao Zhang, Honglak Lee, and Chongjie Zhang. Bridging imagination and reality for model-based deep reinforcement learning. Advances in Neural Information Processing Systems, 33, 2020.