# offline_modelbased_adaptable_policy_learning__be05aa31.pdf

Ofﬂine Model-based Adaptable Policy Learning

Xiong-Hui Chen1, Yang Yu1,3, , Qingyang Li2, Fan-Ming Luo1, Zhiwei Qin2, Wenjie Shang2, Jieping Ye2

1 National Key Laboratory of Novel Software Technology, Nanjing University, Nanjing, China 2 AI Labs, Didi Chuxing 3 Polixir.ai chenxh@lamda.nju.edu.cn, yuy@nju.edu.cn, qingyangli@didiglobal.com luofm@lamda.nju.edu.cn {qinzhiwei,shangwenjie,yejieping}@didiglobal.com

In reinforcement learning, a promising direction to avoid online trial-and-error costs is learning from an ofﬂine dataset. Current ofﬂine reinforcement learning methods commonly learn in the policy space constrained to in-support regions by the ofﬂine dataset, in order to ensure the robustness of the outcome policies. Such constraints, however, also limit the potential of the outcome policies. In this paper, to release the potential of ofﬂine policy learning, we investigate the decision-making problems in out-of-support regions directly and propose ofﬂine Model-based Adaptable Policy LEarning (MAPLE). By this approach, instead of learning in in-support regions, we learn an adaptable policy that can adapt its behavior in out-of-support regions when deployed. We conduct experiments on Mu Jo Co controlling tasks with ofﬂine datasets. The results show that the proposed method can make robust decisions in out-of-support regions and achieve better performance than SOTA algorithms.

1 Introduction

Recent studies have shown that reinforcement learning (RL) is a promising approach for real-world applications, e.g., sequential recommendation systems [1, 2, 3, 4] and robotic locomotion skill learning [5, 6]. However, the trial-and-error of RL in the real world [7] obstructs further applications in cost-sensitive scenarios [8].

Ofﬂine (batch) RL learns a policy within a static dataset collected by a behavior policy without additional interactions with the environment [8, 9, 10, 11]. Since it avoids costly trial-and-error in real-world environments, ofﬂine RL is a promising way to handle the challenge in cost-sensitive applications. A signiﬁcant challenge of ofﬂine RL is in answering counterfactual queries, which asks about how the performance (e.g., Q value) would have been if the agent were to execute an unseen action sequence, then learning to make optimal decisions based on the performance [8]. Fujimoto et al. [10] have shown that the distributional shift of states and actions, which comes from the discrepancy between evaluated policies and behavior policies, often leads to large extrapolation error in value function estimation. In traditional model-free algorithms, the extrapolation error in value function estimation hurts the generalization performance of the learned policies. Since the additional samples, which can correct value estimation errors, are unavailable in the ofﬂine setting, the performance of learned policies based on value function is unstable [10].

On the other hand, model-based RL techniques, which learn dynamics models from collected datasets and learn the value function and policies based on the dynamics models, do not need to estimate

Corresponding author

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

the value functions rely on the collected datasets. However, similar challenges occur in dynamics model approximation. The dynamics model might overﬁt the limited dataset and suffer extrapolation errors in regions that behavior policies have not visited, which causes instability of the learned policy when deployment [12]. Here we call it out-of-support regions. Moreover, in model inference, the compounding error, that is, the accumulated prediction errors between simulation trajectories and reality, would be large even if the one-step prediction error is small [13, 14]. Recent studies in ofﬂine model-based RL [12, 15] have made signiﬁcant progress in Mu Jo Co tasks [16]. These methods constrain policy sampling in dynamics models for robust policy learning. By using large penalty [12] or trajectory truncation [15, 12] in the regions with large prediction uncertainty (uncertainty is a designed metric to evaluate the conﬁdence of prediction correctness) or compounding error, policy exploration is constrained in the regions of dynamics models where the predictions are corrected with high conﬁdence, so as to avoid exploiting regions with risks of large extrapolation error. However, the constraints on dynamics models lead to a conservative policy learning process, which limits the potential of leveraging dynamics models: The visits on states and actions in out-of-support regions are more likely to be inhibited by the constraints, making the learned policy restrict the agent to be in similar regions as the behavior policy.

From the perspectives of counterfactual queries, we consider that model-based RL is promising to handle ofﬂine RL ideal reconstructed dynamics models can simulate the transition dataset without the distributional-shift problem given any policy, and the value function can be estimated via the simulated transition dataset directly. The bottleneck of ofﬂine model-based RL comes from the policy learning in the approximated dynamics model with extrapolation error. In this paper, instead of learning by tightly constraining policy exploration in in-support regions, we investigate decision-making in out-of-support regions directly. Finally, we propose a new ofﬂine policy learning framework, ofﬂine Model-based Adaptable Policy LEarning (MAPLE), to address the aforementioned issues. Ideally, MAPLE tries to model all possible transition dynamics in the out-of-support regions. Then an Adaptable policy is learned to be aware of each case to adapt its behavior to reach optimal performance. In the practical version of MAPLE, we use an ensemble technique to construct ensemble dynamics models. To be aware of each case of the transition dynamics and learn an adaptable policy, we use a meta-learning technique that introduces an extra environment-context extractor structure to represent dynamics patterns, and the policy adjusts itself according to the environment contexts.

We conduct experiments on the Mu Jo Co tasks. The results show that the sampling regions for robust ofﬂine policy learning can be extended via constructing transition patterns in out-of-support regions to cover the real case. The output adaptable policy yields better performance than SOTA algorithms when deployed. MAPLE gives a new direction to handle the ofﬂine policy learning problem in the dynamics models: Besides constraining on sampling and training dynamics models with better generalization, we can also model out-of-distribution regions by constructing all possible transition patterns.

2 Related Work

Reinforcement learning (RL) has shown to be a promising approach to complex real-world decisionmaking problems [1, 2, 3, 4]. However, unconstrained online trial-and-error in the training of RL agents prevents further applications of RL in safety-critical scenarios since it might result in large economic losses [8, 17, 18, 19]. Many studies propose to overcome the problem by ofﬂine (batch) RL algorithms [20]. Prior works on ofﬂine RL are based on model-free algorithms. To overcome the extrapolation error, which is introduced by the discrepancy between the ofﬂine dataset and true state-action distribution of learned target policies [10], these methods are designed to constrain target policies to be close to the behavior policies [10, 11, 21], to apply ensemble methods for robust value function estimation [22], or to re-weight samples in datasets with importance sampling [23]. Most recent studies have shown that policy learning with an approximated dynamics model has good potential to take robust actions outside the action distribution of behavior policies [15, 12]. The challenge comes from the extrapolation error of the dynamics models in out-of-support regions. To address the issues, these methods learn policies from dynamics models with uncertainty constraints. Uncertainty is a measure of prediction conﬁdence on next states. The uncertainty is often computed by the inconsistency in the ensemble dynamics model predictions for each state-action pair. Kidambi et al. [15] construct terminating states based on a hard threshold on uncertainty, while Yu et al. [12] use a soft reward penalty to incorporate uncertainty. The penalty constrains policy exploration and

In-support region Learned policies

(a) learn to adapt (MAPLE) (b) learn by constraining

Figure 1: Illustration of MAPLE compared with learning by constraining. The pointed lines represent the optimal trajectories of the learned policies. There are several policies in MAPLE since the method learns to adapt to multiple dynamics models. The gray oval represents the in-support region.

optimization to the regions with high consistency for better worst-case performance in the deployment environment [15, 12].

The difference between the aforementioned model-based methods and MAPLE is shown in Figure 1. Compared with previous methods [15, 12] learned by constraining (in Figure 1(b)), we learn to adapt to all possible dynamics transitions in the states in out-of-support regions (in Figure 1(a)).

3 Background and Notation

In the standard RL framework, an agent interacts with an environment governed by a Markov Decision Process (MDP) [24]. The agent learns a policy π(at|st), which chooses an action at A given a particular state st S, at each time-step t {0, 1, ..., T}, where T is the trajectory length. S and A denote the state and action spaces, respectively. The reward function rt = r(st, at) R evaluates the immediate performance of the action at given the state st. The goal of RL is to ﬁnd an optimal policy π which maximizes the multi-step cumulative discounted reward (i.e., long-term performance). The objective of RL is maxπ Jρ(π) := Eτ p(τ|π,ρ) h PT k=0 γkrk i , where γ is the discount factor, and p(τ|π, ρ) is the probability of generating a trajectory τ := [s0, a0, ..., a T 1, s T ] under the policy π and a dynamics model ρ(st+1|st, at). In particular, p(τ | π) := d0(s0) QT 1 t=0 ρ(st+1 | st, at)π(at|st), where d0(s0) is the initial state distribution. A common way to ﬁnd an optimal policy π is to optimize the policy with gradient ascent along Jρ(π) [24, 25].

In the ofﬂine RL setting, we are given only a static dataset D = {(si, ai, ri, si+1)} collected by some unknown policy. The goal is to obtain a policy that maximizes Jρ by only using the static dataset.

We argue that in current ofﬂine model-based methods, the constraints of sampling to in-support regions of the dynamics model lead to a conservative policy learning process, limiting the potential of leveraging dynamics models. In this paper, to relax the constraints on the dynamics model, we investigate decision-making in out-of-support regions directly.

In this section, we ﬁrst give a motivating example to show our ideal solution to out-of-support region decision-making (in Section 4.1). Then, we introduce a practical algorithm of the proposed solution for complex tasks, based on meta-learning techniques (in Section 4.2 and Section 4.3).

4.1 Decision-Making in Out-of-Support Regions

By rethinking the scheme of ofﬂine model-based RL, without loss of generality, we ﬁrst formulate the problem as decision-making with a partially known dynamics model (Pak-DM) in a surrogate objective. In this problem, we have two dynamics models: a target dynamics model ρ and a partially known dynamics model ρ , where ρ is the deployment environment in the ofﬂine RL setting, and ρ is used to approximate ρ. Due to the bias of data sampling in the ofﬂine setting and the

limitation on the capacity of the function approximator, only in part of state-action space, we have ρ (s |s, a) = ρ(s |s, a), while the transitions in other parts of space are uncertain. We call the satisﬁed space accessible space (a.k.a., in-support regions) and its complement inaccessible space (a.k.a., out-of-support regions). In this problem, we assume the two subspaces have been predeﬁned in some ways as in the ofﬂine setting (e.g., we can deﬁne the space in which an uncertainty quantiﬁcation is larger than a threshold as the inaccessible space) and then discuss the decision-making problem in this setting. Formally, given an accessible space Xa and an inaccessible space Xi, the partially known dynamics model is deﬁned as:

ρ (s |s, a) = ρ(s |s, a) [s, a] Xa Unknown [s, a] Xi ,

where X denotes the state-action concatenated space for brevity and [s, a] denote a vector concatenating s and a. Our objective is to ﬁnd a policy π to maximize Jρ by only querying the partially known dynamics model ρ . If Xi = , that is ρ (s |s, a) = ρ(s |s, a), s S, a A, the problem is reduced to a vanilla model-based policy learning problem with an oracle dynamics model. For simpliﬁcation, we assume the oracle reward function r is given, but it can also be formulated as a partially known reward function in a similar way. Now we give an example of a Pak-DM in Figure 2.

r(C1) = 100

Figure 2: An example of a Pak-DM. Each node denotes a state. Here we consider an MDP with ﬁnite state space, including A, B, C1, C2 and D. The directed edges denote the transition process. Here we consider a one-action transition in all states except for state B. On state B, we have action α and β, which are denoted as square nodes. By taking α on state B, the state will change to A. However, the transition after taking β in B is unknown. We use dashed directed edges to denote the possible transitions. In our formulation, edges to any node are valid. But we just consider the edges from B to C1 and C2 and omit the edges to A, B, and D for better readability. The reward function r(s) will give a reward when the agent reaches A, C1 or C2.

In this setting, the model-based policy learning algorithms with constraints ([12, 15]) can be summarized as: ﬁnding the optimal policy without reaching inaccessible space. It might output a conservative policy since it avoids making decisions that might lead agents to out-of-support regions. Taking Figure 2 as an example, the output policy would be run in the loop of A B α A. It will avoid taking β on state B. If the real transition is ρ(B, β) = C2, the policy can avoid the penalty 20 since C2 will not be reached. On the other hand, while ρ(B, β) = C1, the policy would miss the large bonus 100 in C1.

Our question is, how should we make good decisions in out-of-support regions directly so that we can make better use of the approximated ρ for better performance? We give a new paradigm to solve the problem which is the ideal implementation version of MAPLE:

1. (Training) Construct a dynamics model set {ˆρi} via modeling all possible transitions in Xi. (It is impractical to do that with inﬁnite state space. We will give a practical solution in Section 4.3);

2. (Training) Learn the optimal policy from each model ˆρi to form the optimal policy set {π ˆρi};

3. (Deployment) Initialize a state s0 from the deployment environment ρ;

4. (Deployment) Probe the environment ρ by selecting an action a such that [s, a] Xi. After getting the next state s = ρ(s |s, a), store the tuple (s, a, s ) in a memory (e.g., a replay

buffer) D. If there is no action a that allows [s, a] Xi, randomly select a policy from the policy set to take an action. 5. (Deployment) Reduce the policy set by only keeping the policies whose corresponding transition model ˆρi can explain the experiences in the memory: {ˆρi} {ρ | ρ(s |s, a) = s , (s, a, s ) D, ρ {ˆρi}} and {π ˆρi} {π ρ|ρ {ˆρi}};

6. (Deployment) Repeat Step 4 and 5 until the policy set is reduced to a single policy.

In this paradigm, we solve the decision-making problem in out-of-support regions by probing the uncertainty part of the deployment environment and adapting the policy for the environment. In Figure 2, the ideal MAPLE solution would construct two dynamics models: ˆρ1 where ˆρ1(B, β) = C1, and ˆρ2 where ˆρ2(B, β) = C2. Then we learn two optimal policies {π ρ1, π ρ2} for each dynamics model. At deployment, we ﬁrst randomly select a policy from the policy set to make decisions. Upon reaching B for the ﬁrst time, where the transition on action β is uncertain, we take action β and get the next state. If the next state is C1, the policy will reduce to π ˆρ1, otherwise to π ˆρ2. Therefore, if ρ(B, β) = C2, the policy would initially run A B β C2 D A and then run in the loop of A B α A because the latter yields higher rewards. If ρ(B, β) = C1, the policy would always run in the loop of A B β C1 D A.

We then dive into the performance difference of the two paradigms for decision-making. Formally, we give Theorem 1 to describe it. The full proof can be found in Appendix A.

Theorem 1 Given a target dynamics model ρ, a policy πa learned by adapting, a policy πc learned by constraints, and the maximum step Nm taken by πa to probe and reduce the policy set to a single policy, the performance gap between πa and πc satisﬁes:

Jρ(πa) Jρ(πc) c p γNm+1Jρ (π ),

where c denotes the performance gap of the optimal policy π and πc, while p denotes the performance degradation of MAPLE compared with π because of the phase of probing. Jρ (π ) denotes the performance degradation of π on the dynamics model ρ caused by different initial state distribution: Jρ (π ) = Edπ Nm+1(s)[V (s)] Edπa Nm+1(s)[V (s)], where dπ Nm+1(s) and dπa Nm+1(s) denote the state distribution induced by π and πa at the Nm +1 step and V (s) denotes the expected long-term rewards of π at state s.

We can see that the performance gain of πa is that it can automatically converge to the optimal policy after the loop of probing and reducing (i.e., c), while the cost of πa comes from additional probing on inaccessible space, including less reward getting when probing (i.e., p) and a worse initial state distribution after probing (i.e., Jρ (π )).

Based on Theorem 1, we give the principles for choosing between the paradigms: Firstly, with a larger performance gap of c, πa can reach a better performance than πc. On the other hand, the tasks with large penalties on undesired behavior might make p larger, which reduces the overall performance of πa. Besides, the tasks where sub-optimal behaviors easily lead agents to states with low value, e.g., unsafe states which are prone to terminate the trajectory, might make Jρ (π ) large, which also reduces the overall performance of πa.

4.2 Efﬁcient Decision-Making in Out-of-Support Regions with Meta-learning Techniques

It is computationally inefﬁcient to learn optimal policies independently for each dynamics model since the policies behaviors would be similar in in-support regions. For better efﬁciency, we introduce a context-aware adaptable policy, inspired by meta-learning techniques, to represent the set of learned policies. Here, we ﬁrst introduce a new notation: the environment-context vector z Z, where Z denotes the space of the context vectors. Given a set of dynamics models T := {ˆρi}, each ˆρi can be represented by a vector z. Formally, there is a mapping φ : T Z. We call φ an environmentcontext extractor. The context-aware policy π(a|s, z) takes actions based on the current state s and the vector of environment-context z of a given environment z = φ(ˆρ), where ˆρ T . We deﬁne the optimal environment-context extractor φ to satisfy: πφ Π, ˆρ T , Jˆρ(πφ ) = maxπ Jˆρ(π), where πφ := π(at|φ(zt|ρt), st) is an adaptable policy and Π denotes the policy class. We discuss the input of φ later. In addition, we deﬁne the optimal adaptable policy π φ to be one that satisﬁes ˆρ T , Jˆρ(π φ ) = maxπ Jˆρ(π). With the optimal φ and π φ , and given any ˆρ in T , the adaptable

policy can make the best decisions with the output of environment-context z. To achieve this, given a dynamics model set T , we optimize φ and πφ by the following objective function:

φ , π φ = arg max φ,πφ Eρ T [Jρ(πφ)] , (1)

where denotes a sample strategy to draw dynamics models ˆρ from the dynamics model set T s.t. P[ˆρ] > 0, ˆρ T . We take a uniform sampling strategy in the following analysis.

To learn the context z by φ(z|ˆρ), the main question is: What are suitable inputs to φ for context learning? In the robotics domain, similar environment contexts have been proposed recently [26, 27, 28]. The policy incorporates an online system identiﬁcation module φ(zt|st, τ0:t), which utilizes the history of past states and actions τ0:t = [s0, a0, ..., st 1, at 1, st] to predict the parameters of the dynamics in simulators. For example, τ could be a trajectory of robot interactions with varying friction coefﬁcients, and z is the value of the coefﬁcient. In practice, a recurrent neural network (RNN) is used to embed the sequential information into environment-context vectors zt = φ(st, at 1, zt 1). In MAPLE, we follow the same structure to model the extractor but the trajectories are rollout in the constructed dynamics models. If the reward function is also partially known, rt 1 should be considered, that is zt = φ(st, at 1, rt 1, zt 1).

With an RNN-based environment-context extractor φ optimized with Equation (1), the context-aware policy π can automatically probe environments and reduce the policy set. Considering a given i-step partial trajectory τ0:i and a subset of deterministic dynamics models T T where the dynamics model ˆρ T is consistent with τ0:i, the objective from i + 1 to T on given τ0:i can be rewritten as Eˆρ T [Eτ p(π,ˆρ)[PT k=i+1 γkrk]]. Since the dynamics models in T are indistinguishable at step i + 1, the optimal policy at this step would converge to a stochastic policy if the optimal actions are different among the dynamics models. If ˆρ is sampled uniformly from T and the optimal cumulative rewards PT k=i+1 γkr k are the same for each dynamics model, the optimal policy at i + 1 is to uniformly-sampled actions from the optimal actions of each dynamics model. If the optimal cumulative rewards are different, the action probabilities are weighted by the cumulative rewards of each dynamics model. On the other hand, partial trajectories from different dynamics models might predict the same z. If the optimal actions in the same state are conﬂicting, to increase the performance of objective Equation (1), the policy gradient has to backpropagate from π to φ. If the partial trajectories τ0:i are different, the contexts in these partial trajectories would be distinctive. Finally, the output action distribution of the context-aware policy would be optimized in a subset of the dynamics models in which the partial trajectories τ0:i are consistent.

4.3 Practical Implementation of Ofﬂine Model-based Adaptable Policy Learning

From the decision-making problem with Pak-DM to the real ofﬂine setting, the additional thing we should consider is the recognition of the inaccessible space and the construction of the dynamics model set. Especially in tasks with inﬁnite state-action space, it is impractical to ﬁnd the exact inaccessible space and recover all possible transitions in it. As a practical implementation, we use the ensemble technique to learn the dynamics model set, which would predict similar transitions in the accessible space and tend to predict different transitions in inaccessible space. With large ensemble models with different initialized weights, we can construct a large number of transition cases in the inaccessible space. If the real transition pattern falls into the variant of the ensemble dynamics model set, relying on the interpolation ability of the environment-context extractor φ, the adaptable policy π can take appropriate actions. However, only relying on the randomness of the initialization, to cover real cases in all out-of-support region need sufﬁciently enough dynamics model set, which would be highly expensive. In order to trade off the cost of model construction and better adaptivity, in the practical implementation, we use several tricks to constrain the policy exploration in the ensemble dynamics model set: 1) We add some mild constraints to the exploration region. To mitigate the compounding error of the model, we constrain the maximum rollout length to a ﬁxed number. Besides, at each step, a penalty is calculated according to the model uncertainty U(st, at), which measures the prediction uncertainty of the learned transition models at (st, at). As a result, a reward provided to agents consists of two parts: a reward given by a reward function and a penalty calculated by U(st, at). The constraints are the same as MOPO [12], but the coefﬁcients are more relaxed; 2) As we increase the rollout length, the large compounding error might lead the agent to reach entirely unreal regions in which the states would never appear in the deployment environment. These samples are useless for adaptable policy learning and might make the training process unstable.

Algorithm 1 Ofﬂine model-based adaptable policy learning

Input: φϕ as an environment-context extractor, parameterized by ϕ; Adaptable policy network πθ parameterized by θ; Ofﬂine dataset Doff; Trajectory termination rule f; Rollout horizon H; Process:

Generate an ensemble of m dynamics models {ˆρi} via supervised learning Initialize an empty buffer Drollout; Add hidden states z to each tuple in Drollout and initialize with 0 for 1, 2, 3, ... do

Randomly select the dynamics model ˆρi in {ˆρi} and sample s1, a0, z0 from Doff for t=1, 2 , ..., H do

Sample zt from φϕ(z|st, at 1, zt 1) and then sample at from πθ(a|st, zt) Rollout one step st+1 ˆρi(s|st, at) and rt+1 = r(st, at) Compute the terminal state dt+1 = f(st+1) Compute the reward penalty: rt+1 rt+1 λU(st, at) Add (st+1, rt+1, dt+1, st, at, zt) to Drollout Break the rollout if dt+1 is True end for Update the stored hidden states z in Doff with φϕ Use SAC [29] to update ϕ and θ via Equation (1) with Drollout and Doff end for

Given a task, we can construct some simple rules to discriminate the entirely unreal regions and terminate the trajectory rollout (i.e., set the done ﬂag to True) when reaching these regions. In our implementation, we terminate trajectories when the predicted next states are out of range of ( smin, smax), where smin and smax are two hyper-parameters to deﬁne a reasonable range of state space.

Based on the above techniques, we propose the practical implementation of ofﬂine model-based adaptable policy learning in Algorithm 1. More details can be found in the Appendix D.

5 Experiments

We evaluate MAPLE on multiple ofﬂine Mu Jo Co tasks [16]. All the details of MAPLE s training and evaluation are given in Appendix E and Appendix F. We release our code at Github 2.

5.1 Comparative Evaluation on Benchmark Tasks

Table 1: Results on Mu Jo Co tasks. Each number is the normalized score proposed by Fu et al. [30] of the policy at the last iteration of training, standard deviation. Among the ofﬂine RL methods, we bold the highest mean for each task.

Environment Dataset MAPLE MOPO MOPO-loose SAC BEAR BC BRAC-v CQL

Walker2d random 21.7 0.3 13.6 2.6 8.0 5.4 4.1 6.7 9.8 0.5 7.0 Walker2d medium 56.3 10.6 11.8 19.3 32.6 18.0 0.9 33.2 6.6 81.3 79.2 Walker2d mixed 76.7 3.8 39.0 9.6 35.7 2.2 3.5 25.3 11.3 0.4 26.7 Walker2d med-expert 73.8 8.0 44.6 12.9 66.7 14.8 -0.1 26.0 6.4 66.6 111.0 Half Cheetah random 38.4 1.3 35.4 1.5 35.4 2.1 30.5 25.5 2.1 28.1 35.4 Half Cheetah medium 50.4 1.9 42.3 1.6 44.0 1.6 -4.3 38.6 36.1 45.5 44.4 Half Cheetah mixed 59.0 0.6 53.1 2.0 36.9 15.0 -2.4 36.2 38.4 45.9 46.2 Half Cheetah med-expert 63.5 6.5 63.3 38.0 15.0 6.0 1.8 51.7 35.8 45.3 62.4 Hopper random 10.6 0.1 11.7 0.4 10.6 0.6 11.3 9.5 1.6 12.0 10.8 Hopper medium 21.1 1.2 28.0 12.4 16.9 2.4 0.8 47.6 29.0 32.3 58.0 Hopper mixed 87.5 10.8 67.5 24.7 83.1 6.5 1.9 10.8 11.8 0.9 48.6 Hopper med-expert 42.5 4.1 23.7 6.0 25.1 1.8 1.6 4.0 111.9 0.8 98.7

We test MAPLE in standard ofﬂine RL tasks with D4RL datasets [30]. The ensemble dynamics model set is trained via supervised learning. We repeat each task with three random seeds. In the

2https://github.com/xionghuichen/MAPLE

model learning stage, we train 20 models for each task and select 14 of them as the ensemble model for policy learning. The horizon H is set to 10 in these tasks. The policy is trained for 1000 iterations in the policy learning stage.

We compare MAPLE with: (1) MOPO: Learn an ensemble model via supervised learning and learn a policy in the ensemble models with uncertainty penalty [12]; (2) MOPO-loose: MOPO algorithm with the same hyperparameter as MAPLE for constraints, which is looser than MOPO; (3) BEAR: Learn a policy via off-policy RL while constraining the maximum mean discrepancy of the current policy to the behavior policy [11]; (4) BC: Imitate the behavior policy via supervised learning; (5) SAC: Perform typical SAC updates with the static dataset [29]; (6) BRAC-v: A behavior-regularized actor-critic proposed by Wu et al. [21]; (7) CQL: Learn an action-value function with regularization to obtain a conservative policy [31]. Results of (3-7) and (1) are taken from the work of [30] and [12].

Table 1 has shown the performance of 12 tasks. In summary, the performance of MAPLE on 7 tasks is better than other SOTA algorithms. Besides, MAPLE reaches the best performance among the SOTA model-based conservative policy learning algorithms in 10 out of the 12 tasks. These results demonstrate the superior generalization ability of MAPLE.

We can also ﬁnd that the performance of MAPLE and MOPO are higher than other model-free baseline algorithms in most of the tasks, which reveals that model-based methods can ﬁnd better policies by taking actions outside of the action distribution of the behavior policies. However, in Hopper experiments, model-free methods BC, CQL, and BRAC-v often reach better performance. We consider that it is because the environment is unstable, more diverse collected data is needed for robust dynamics model learning. Even in MAPLE, a ﬁnite number of ensemble dynamics models might not cover the real case so that to make a robust adaptation. Therefore, only in the Hopper-mixed task, which has diverse collected data, MOPO and MAPLE can improve the deployment performance.

We also conduct MOPO-loose for each task, which shares the same hyper-parameters with MAPLE, including the ensemble model size, rollout length, and penalty coefﬁcient. The results show that for some cases (e.g., Walker2d-med-expert and Hopper-mixed), MOPO-loose can also enhance the performance. We consider the improvement coming from the constraints in original implementations to be too tight. However, in most of the tasks, the improvement is not as signiﬁcant as MAPLE. This phenomenon indicates that the improvement of MAPLE does not come from parameter tuning but our self-adaptation mechanism.

The implementation of MAPLE shares the cross-domain idea with meta-RL, particularly domain randomization and sim2real. In the current implementation, the online system identiﬁcation (OSI) method in meta RL is employed [26, 27, 28]. Other techniques, e.g., Vari BAD [32], can also be integrated into MAPLE. We construct another variant of MAPLE, i.e., Vari BAD-MAPLE, as another implementation. Vari BAD-MAPLE can also reach a signiﬁcantly better performance than MOPO. Compared with the original implementation of MAPLE, Vari BAD-MAPLE can do better than MAPLE in Walker2d-medium and Half Cheetah-random but worse than MAPLE in Walker2d-mixed and Half Cheetah-mixed. The detailed results can be found in Appendix F.6.

5.2 Analysis of MAPLE

0 200 400 600 800 1000 epochs

normalized return

m=100 H=10 m=200 H=10 m=50 H=10 m=7 H=10 MOPO

0 200 400 600 800 1000 epochs

normalized return

m=100 H=20 m=200 H=20 m=50 H=20 m=7 H=20 MOPO

0 200 400 600 800 1000 epochs

normalized return

m=100 H=40 m=200 H=40 m=50 H=40 m=7 H=40 MOPO

Figure 3: The learning curves of MAPLE with different hyper-parameters m and H. The solid curves are the mean of normalized return and the shadow is the standard error.

10 15 20 25 30 35 40 horizon

trained one-step rew

trained one-step rew

deployed normalized rew

deployed normalized rew

Figure 4: A Comparison of trained and deployed rewards at the 1000th epoch based on m = 50 and among different H.

The ultimate target of MAPLE is handling the decision-making challenge in out-of-support regions. By constructing all possible transitions in out-of-support regions, the probing-reducing loop can ﬁnd the optimal policy ﬁnally. However, it is impractical to construct a dynamics model set that covers all possible transitions in out-of-support regions. In practice, we construct an ensemble dynamics model set and use loose constraints on policy sampling to expand the exploration boundary for better asymptotic performance. Therefore, with a larger size of the model set, the dynamics models can cover more real transitions in out-of-support regions, then MAPLE is expected to have better performances by relaxing the constraints. In this section, to verify the argument, we analyze the relationship among constraint degree, the size of ensemble models, and the asymptotic performance.

In MAPLE, the constraint degree can be evaluated by the rollout length (H) or the reward penalty coefﬁcient (λ); The size of ensemble models is m; The asymptotic performance is evaluated by the cumulative rewards at the 1000th iteration. We select Walker2dmedium to verify the argument. With the ﬁxed λ, we compare the asymptotic performance of different H and m.

We give the results in Figure 3. Firstly, for all H, by increasing the size of m, the performance of the converged policies is improved. On the other hand, without enough ensemble models, too loose constraints will result in worse performance. In particular, for the setting of m = 50, when H = 10, the ﬁnal normalized return is near 0.7. However, as H increases, the asymptotic performance drops gradually. When H = 40, the asymptotic performance is even worse than MOPO. As can be seen in Figure 4, the trained one-step reward is similar among different horizons. It means that the performance of policies in dynamics models is similar. The worse performance indicates that the adaptable policy overﬁts the ﬁnite dynamics models and φ fails to infer a correct environment-context to the adaptable policy when deployed. The issue can be remedied by constructing more dynamics models. As depicted in Figure 3(c), by increasing the model size to 200, the asymptotic performance recovers around 80%.

We also conducted additional experiments in Walker2d-mixed and Half Cheetah-mixed to visualize the z. The result reveals that different dynamics models are distinguished by z and are approximated divided into several groups. Besides, we sample trajectories for 1000 time-step in the deployment environment. we found that the value of Z oscillated within a region. The detailed results can be found in Appendix F.5.

5.3 MAPLE with large dynamics model set

Table 2: Results on Mu Jo Co tasks with MAPLE-200.

Environment Dataset MAPLE-200 MAPLE

Walker2d random 22.1 0.1 21.7 0.3 Walker2d medium 81.3 0.1 56.3 10.6 Walker2d mixed 75.4 0.9 76.7 3.8 Walker2d med-expert 107.0 0.8 73.8 8.0 Half Cheetah random 41.5 3.6 38.4 1.3 Half Cheetah medium 48.5 1.4 50.4 1.9 Half Cheetah mixed 69.5 0.2 59.0 0.6 Half Cheetah med-expert 55.4 3.2 63.5 6.5 Hopper random 10.7 0.2 10.6 0.1 Hopper medium 44.1 2.6 21.1 1.2 Hopper mixed 85.0 1.0 87.5 10.8 Hopper med-expert 95.3 7.3 42.5 4.1

Based on the above analysis, we can get an empirical conclusion that increasing the model size is signiﬁcantly helpful to ﬁnd a better and robust adaptable policy via expanding the exploration boundary. Therefore, we conduct another variant of the MAPLE algorithm, MAPLE-200, which uses 200 ensemble dynamics models for policy learning and expands the rollout horizon to 20.

The results of MAPLE-200 can be found in Table 2. We can see that the empirical conclusions not only suit the Walker2d-medium but also other tasks. In all of the tasks, MAPLE-200 reaches at least similar performance to MAPLE. In the tasks like Walker2d-med-expert, Half Cheetah-mixed, Hopper-medium, and Hoppermed-expert, the performance improvement of MAPLE-200 is signiﬁcant. Besides, in the tasks of

Hopper-medium and Hopper-med-expert, where MAPLE and MOPO fail to reach a comparable performance to model-free ofﬂine methods, MAPLE-200 can reach similar or much better results.

MAPLE-200 demonstrates a powerful adaptation ability. However, we point out that, by increasing the 10x size of ensemble dynamics models, the time overhead for MAPLE-200 training is also larger. For example, by using NVIDIA Tesla P40 and Xeon(R) E5-2630 to train the algorithms, the time overhead of MAPLE-200 is 10 times longer than MAPLE. Besides, to obtain dynamics models that covered more real cases in out-of-support regions than MAPLE-200, the size of ensemble models becomes much larger, which is one of the limitations for current MAPLE implementation.

6 Discussion and Future Work

Prior work has demonstrated the feasibility of model-based policy learning in the ofﬂine setting by using the conservative policy learning paradigm [12, 15]. It is an important breakthrough from zero to one in the ofﬂine setting, but it is far from the end of ofﬂine model-based policy learning. In this work, we investigate the decision-making problems in out-of-support regions directly. We ﬁrst formulate the problem as decision-making in Pak-DM and propose MAPLE, a learn-to-adapt paradigm to solve the problem. We also give a theorem to describe the pros and cons of the paradigms to give us principles for the paradigm selection. We veriﬁed MAPLE in the Mu Jo Co tasks, and get our empirical conclusion: by increasing the size of the model set, we can expand the exploration boundary in the approximated dynamics models by using adaptable policy to make better and robust decisions in deployment environments.

MAPLE gives another direction to handle the ofﬂine model-based learning problem: Besides constraining on sampling and training dynamics models with better generalization, we can model out-of-distribution regions by constructing all possible transition patterns. The current limitation lies in the implementation: (1) The extractor s generalization ability depends on the neural network itself, which is uncontrollable to some degree. Adding some auxiliary tasks might handle this issue; (2) Ensemble is the direct way to cover real transitions in out-of-support regions. However, as the size of the model set becomes large, it is hard to generate new different dynamics models to cover the real cases only by increasing the size of the model. More efﬁcient ways to increase the cover real transitions of the dynamics model set should be further explored.

Acknowledgements

This work is supported by the National Key Research and Development Program of China (2020AAA0107200) and the NSFC (61876077). We thank Zheng-Mao Zhu, Shengyi Jiang, Rong-Jun Qin, Yu-Ren Liu, and the anonymous reviewers for their useful suggestions on the papers.

[1] Xiting Wang, Yiru Chen, Jie Yang, Le Wu, Zhengtao Wu, and Xing Xie. A reinforcement learning framework for explainable recommendation. In 2018 IEEE International Conference on Data Mining, pages 587 596. IEEE, 2018.

[2] Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, and Dawei Yin. Recommendations with negative feedback via pairwise deep reinforcement learning. In Proceedings of the 24th. ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1040 1048, London, UK, 2018.

[3] Xiangyu Zhao, Changsheng Gu, Haoshenglun Zhang, Xiaobing Liu, Xiwang Yang, and Jiliang Tang. Deep reinforcement learning for online advertising in recommender systems. Co RR, abs/1909.03602, 2019.

[4] Han Cai, Kan Ren, Weinan Zhang, Kleanthis Malialis, Jun Wang, Yong Yu, and Defeng Guo. Real-time bidding by reinforcement learning in display advertising. In Proceedings of the 10th. ACM International Conference on Web Search and Data Mining, Cambridge, United Kingdom, 2017.

[5] Xue Bin Peng, Erwin Coumans, Tingnan Zhang, Tsang-Wei Edward Lee, Jie Tan, and Sergey Levine. Learning agile robotic locomotion skills by imitating animals. Co RR, abs/2004.00784, 2020.

[6] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft actor-critic algorithms and applications. Co RR, abs/1812.05905, 2018.

[7] Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research, 37(4-5):421 436, 2018.

[8] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Ofﬂine reinforcement learning: Tutorial, review, and perspectives on open problems. Co RR, abs/2005.01643, 2020.

[9] Noah Y. Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin A. Riedmiller. Keep doing what worked: Behavior modelling priors for ofﬂine reinforcement learning. In Proceeding of the 8th. International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.

[10] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In Proceedings of the 36th. International Conference on Machine Learning, Long Beach, CA, 2019.

[11] Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Vancouver, Canada, 2019.

[12] Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. MOPO: model-based ofﬂine policy optimization. Co RR, abs/2005.13239, 2020.

[13] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Modelbased policy optimization. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 12498 12509, 2019.

[14] Stéphane Ross and Drew Bagnell. Agnostic system identiﬁcation for model-based reinforcement learning. In Proceedings of the 29th International Conference on Machine Learning, Edinburgh, UK, 2012.

[15] Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. MORe L : Model-based ofﬂine reinforcement learning. Co RR, abs/2005.05951, 2020.

[16] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mu Jo Co: A physics engine for model-based control. In Proceedings of the 24th. IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026 5033, Vilamoura, Portugal, 2012.

[17] Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. Ofﬂine A/B testing for recommender systems. In Yi Chang, Chengxiang Zhai, Yan Liu, and Yoelle Maarek, editors, Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 198 206. ACM, 2018.

[18] Georgios Theocharous, Philip S. Thomas, and Mohammad Ghavamzadeh. Personalized ad recommendation systems for life-time value optimization with guarantees. In Proceedings of the 25th. International Joint Conference on Artiﬁcial Intelligence, pages 1806 1812, Buenos Aires, Argentina, 2015. AAAI Press.

[19] Philip S. Thomas, Georgios Theocharous, Mohammad Ghavamzadeh, Ishan Durugkar, and Emma Brunskill. Predictive off-policy policy evaluation for nonstationary decision problems, with applications to digital marketing. In Satinder P. Singh and Shaul Markovitch, editors, Proceedings of the 31st. AAAI Conference on Artiﬁcial Intelligence, pages 4740 4745, CA, USA, 2017. AAAI Press.

[20] Sascha Lange, Thomas Gabel, and Martin A. Riedmiller. Batch reinforcement learning. In Marco A. Wiering and Martijn van Otterlo, editors, Reinforcement Learning, volume 12 of Adaptation, Learning, and Optimization, pages 45 73. Springer, 2012.

[21] Yifan Wu, George Tucker, and Oﬁr Nachum. Behavior regularized ofﬂine reinforcement learning. Co RR, abs/1911.11361, 2019.

[22] Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on ofﬂine reinforcement learning. International Conference on Machine Learning, 2020.

[23] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. Co RR, abs/1904.08473, 2019.

[24] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2018.

[25] Richard S. Sutton, David A. Mc Allester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, pages 1057 1063, Denver, Colorado, 1999.

[26] Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-Real transfer of robotic control with dynamics randomization. In Proceedings of the 35th. IEEE International Conference on Robotics and Automation, pages 1 8, Brisbane, Australia, 2018.

[27] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob Mc Grew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, Jonas Schneider, Nikolas Tezak, Jerry Tworek, Peter Welinder, Lilian Weng, Qiming Yuan, Wojciech Zaremba, and Lei Zhang. Solving rubik s cube with a robot hand. Co RR, abs/1910.07113, 2019.

[28] Wenhao Yu, Jie Tan, C. Karen Liu, and Greg Turk. Preparing for the unknown: Learning a universal policy with online system identiﬁcation. In Nancy M. Amato, Siddhartha S. Srinivasa, Nora Ayanian, and Scott Kuindersma, editors, Robotics: Science and Systems XIII, Massachusetts Institute of Technology, , July 12-16, 2017, Cambridge, MA, 2017.

[29] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft Actor-Critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th. International Conference on Machine Learning, pages 1856 1865, Stockholmsmässan, Sweden, 2018.

[30] Justin Fu, Aviral Kumar, Oﬁr Nachum, George Tucker, and Sergey Levine. D4RL: datasets for deep data-driven reinforcement learning. Co RR, abs/2004.07219, 2020.

[31] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for ofﬂine reinforcement learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020.

[32] Luisa M. Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann, and Shimon Whiteson. Varibad: A very good method for bayes-adaptive deep RL via meta-learning. In 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.